Abstract
Animal pose estimation has received increasing attention in recent years. The main challenge for this task is the diversity of animal species compared to their human counterpart. To address this issue, we design a keypoint-interactive Transformer model for high-resolution animal pose estimation, namely KITPose. Since a high-resolution network maintains local perception and the self-attention module in Transformer is an expert in connecting long-range dependencies, we equip the high-resolution network with a Transformer to enhance the model capacity, achieving keypoints interaction in the decision stage. Besides, to smoothly fit the pose estimation task, we simultaneously train the model parameters and joint weights, which can automatically adjust the loss weight for each specific keypoint. The experimental results obtained on the AP10K and ATRW datasets demonstrate the merits of KITPose, as well as its superior performance over the state-of-the-art approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: AP-10K: a benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021)
Li, S., Li, J., Tang, H., Qian, R., Lin, W.: ATRW: a benchmark for amur tiger re-identification in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2590ā2598 (2020)
Pereira, T.D., et al.: SLEAP: multi-animal pose tracking. bioRXiv (2020)
Pereira, T.D., et al.: SLEAP: a deep learning system for multi-animal pose tracking. Nat. Methods 19, 486ā495 (2022). https://doi.org/10.1038/s41592-022-01426-1
Lauer, J., et al.: Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496ā504 (2022). https://doi.org/10.1038/s41592-022-01443-0
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686ā3693 (2014)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724ā4732 (2016)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291ā7299 (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483ā499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334ā2343 (2017)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472ā487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349ā3364 (2020)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693ā5703 (2019)
Cheng, B., et al.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386ā5395 (2020)
mmPose Contributor: OpenMMLab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose (2020)
Li, K., et al.: Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944ā1953 (2021)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802ā11812 (2021)
Li, Y., et al.: TokenPose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313ā11322 (2021)
Yuan, Y., et al.: HRFormer: high-Resolution Vision Transformer for Dense Predict. In: Advances in Neural Information Processing Systems 34 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Mathis, A., at al.: Pretraining boosts out-of-domain robustness for pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1859ā1868 (2021)
Graving, J.M., Chae, D., et al.: DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019). https://doi.org/10.7554/eLife.47994
Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9498ā9507 (2019)
Li, C., Lee, G.H.: From synthetic to real: unsupervised domain adaptation for animal pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1482ā1491 (2021)
Labuguen, R., et al.: MacaquePose: a novel āin the wildā macaque monkey pose dataset for markerless motion capture. bioRxiv (2020)
Pereira, T.D., et al.: Fast animal pose estimation using deep neural networks. Nat. Methods 16(1), 117ā125 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing systems 30 (2017)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Liu, Z., at al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012ā10022 (2021)
Mu, J., Qiu, W., Hager, G.D., Yuille, A.L.: Learning from synthetic animals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12386ā12395 (2020)
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093ā7102 (2020)
Geng, Z., et al.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676ā14686 (2021)
Luo, Z., et al.: Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13264ā13273 (2021)
Jin, L., et al.: Grouping by Center: predicting Centripetal Offsets for the bottom-up human pose estimation. IEEE Trans. Multimedia (2022)
Harding, E.J., Paul, E.S., Mendl, M.: Cognitive bias and affective state. Nature 427(6972), 312 (2004)
Touvron, H., et al.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347ā10357 (2021)
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China (U1836218, 61876072, 61902153, 62106089).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rao, J., Xu, T., Song, X., Feng, ZH., Wu, XJ. (2022). KITPose: Keypoint-Interactive Transformer forĀ Animal Pose Estimation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13534. Springer, Cham. https://doi.org/10.1007/978-3-031-18907-4_51
Download citation
DOI: https://doi.org/10.1007/978-3-031-18907-4_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18906-7
Online ISBN: 978-3-031-18907-4
eBook Packages: Computer ScienceComputer Science (R0)