Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Self-attentive 3D human pose and shape estimation from videos

Published: 01 December 2021 Publication History

Abstract

We consider the task of estimating 3D human pose and shape from videos. While existing frame-based approaches have made significant progress, these methods are independently applied to each image, thereby often leading to inconsistent predictions. In this work, we present a video-based learning algorithm for 3D human pose and shape estimation. The key insights of our method are two-fold. First, to address the inconsistent temporal prediction issue, we exploit temporal information in videos and propose a self-attention module that jointly considers short-range and long-range dependencies across frames, resulting in temporally coherent estimations. Second, we model human motion with a forecasting module that allows the transition between adjacent frames to be smooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. Extensive experimental results show that our algorithm performs favorably against the state-of-the-art methods.

Highlights

We propose the SPS-Net for estimating 3D human pose and shape from videos.
We propose a consistency loss to address the lack of camera parameter annotations.
We develop a self-supervised learning scheme to address the occlusion issues.
The proposed method performs favorably against the state-of-the-art methods.

References

[1]
Andriluka M., Iqbal U., Insafutdinov E., Pishchulin L., Milan A., Gall J., Schiele B., Posetrack: A benchmark for human pose estimation and tracking, in: CVPR, 2018.
[2]
Arnab A., Doersch C., Zisserman A., Exploiting temporal context for 3D human pose estimation in the wild, in: CVPR, 2019.
[3]
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B., 2016. Simple online and realtime tracking. In: ICIP.
[4]
Bochkovskiy A., Wang C.Y., Liao H.Y.M., YOLOv4: Optimal speed and accuracy of object detection, 2020, ArXiv.
[5]
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV.
[6]
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J., 2019. YOLACT: real-time instance segmentation. In: ICCV.
[7]
Butepage J., Black M.J., Kragic D., Kjellstrom H., Deep representation learning for human motion prediction and classification, in: CVPR, 2017.
[8]
Cao Z., Martinez G.H., Simon T., Wei S.E., Sheikh Y.A., Openpose: Realtime multi-person 2D pose estimation using part affinity fields, TPAMI (2019).
[9]
Chen Y.C., Lin Y.Y., Yang M.H., Huang J.B., CrDoCo: Pixel-level domain transfer with cross-domain consistency, in: CVPR, 2019.
[10]
Chen Y.C., Lin Y.Y., Yang M.H., Huang J.B., Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation, TPAMI (2020).
[11]
Chen C.H., Tyagi A., Agrawal A., Drover D., Stojanov S., Rehg J.M., Unsupervised 3D pose estimation with geometric self-supervision, in: CVPR, 2019.
[12]
Cheng Y., Yang B., Wang B., Tan R.T., 3D human pose estimation using spatio-temporal networks with explicit occlusion training, in: AAAI, 2020.
[13]
Choutas, V., Pavlakos, G., Bolkart, T., 2020. Monocular expressive body regression through body-driven attention. In: ECCV.
[14]
Denton E., Birodkar V., Unsupervised learning of disentangled representations from video, in: NeurIPS, 2017.
[15]
Doersch C., Zisserman A., Sim2real transfer learning for 3D human pose estimation: motion to the rescue, in: NeurIPS, 2019.
[16]
Finn C., Goodfellow I., Levine S., Unsupervised learning for physical interaction through video prediction, in: NeurIPS, 2016.
[17]
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J., 2015. Recurrent network models for human dynamics. In: ICCV.
[18]
Girdhar R., Gkioxari G., Torresani L., Paluri M., Tran D., Detect-and-track: Efficient pose estimation in videos, in: CVPR, 2018.
[19]
Gordon, A., Li, H., Jonschkowski, R., Angelova, A., 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV.
[20]
He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, in: CVPR, 2016.
[21]
Huang, X., Liu, M.Y., Belongie, S., Kautz, J., 2018. Multimodal unsupervised image-to-image translation. In: ECCV.
[22]
Ionescu C., Papava D., Olaru V., Sminchisescu C., Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments, TPAMI (2013).
[23]
Jain A., Zamir A.R., Savarese S., Saxena A., Structural-rnn: Deep learning on spatio-temporal graphs, in: CVPR, 2016.
[24]
Kanazawa A., Black M.J., Jacobs D.W., Malik J., End-to-end recovery of human shape and pose, in: CVPR, 2018.
[25]
Kanazawa A., Zhang J.Y., Felsen P., Malik J., Learning 3D human dynamics from video, in: CVPR, 2019.
[26]
Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Viola F., Green T., Back T., Natsev P., Suleyman M., Zisserman A., The kinetics human action video dataset, 2017, ArXiv.
[27]
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. In: ICLR.
[28]
Kocabas M., Athanasiou N., Black M.J., VIBE: Video inference for human body pose and shape estimation, in: CVPR, 2020.
[29]
Kocabas M., Karagoz S., Akbas E., Self-supervised learning of 3D human pose using multi-view geometry, in: CVPR, 2019.
[30]
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K., 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV.
[31]
Kolotouros N., Pavlakos G., Daniilidis K., Convolutional mesh regression for single-image human shape reconstruction, in: CVPR, 2019.
[32]
Lassner C., Romero J., Kiefel M., Bogo F., Black M.J., Gehler P.V., Unite the people: Closing the loop between 3D and 2D human representations, in: CVPR, 2017.
[33]
Lee, K., Lee, I., Lee, S., 2018. Propagating lstm: 3D pose estimation based on joint interdependency. In: ECCV.
[34]
Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H., 2018. Diverse image-to-image translation via disentangled representations. In: ECCV.
[35]
Li J., Wang C., Zhu H., Mao Y., Fang H.S., Lu C., Crowdpose: Efficient crowded scenes pose estimation and a new benchmark, in: CVPR, 2019.
[36]
Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H., 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR.
[37]
Liu L., Xu W., Zollhoefer M., Kim H., Bernard F., Habermann M., Wang W., Theobalt C., Neural rendering and reenactment of human actor videos, ACM TOG (2019).
[38]
Loper M., Mahmood N., Romero J., Pons-Moll G., Black M.J., SMPL: A skinned multi-person linear model, ACM TOG (2015).
[39]
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J., 2019. AMASS: Archive of motion capture as surface shapes. In: ICCV.
[40]
Mehta D., Rhodin H., Casas D., Fua P., Sotnychenko O., Xu W., Theobalt C., Monocular 3D human pose estimation in the wild using improved cnn supervision, in: 3DV, 2017.
[41]
Mehta D., Sridhar S., Sotnychenko O., Rhodin H., Shafiei M., Seidel H.P., Xu W., Casas D., Theobalt C., Vnect: Real-time 3D human pose estimation with a single rgb camera, ACM TOG (2017).
[42]
Meister S., Hur J., Roth S., Unflow: Unsupervised learning of optical flow with a bidirectional census loss, in: AAAI, 2018.
[43]
Omran M., Lassner C., Pons-Moll G., Gehler P., Schiele B., Neural body fitting: Unifying deep learning and model based human pose and shape estimation, in: 3DV, 2018.
[44]
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., Tran, D., 2018. Image transformer. In: ICML.
[45]
Pascanu, R., Mikolov, T., Bengio, Y., 2013. On the difficulty of training recurrent neural networks. In: ICML.
[46]
Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., Desmaison A., Köpf A., Yang E., DeVito Z., Raison M., Tejani A., Chilamkurthy S., Steiner B., Fang L., Bai J., Chintala S., Pytorch: An imperative style, high-performance deep learning library, in: NeurIPS, 2019.
[47]
Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P.V., Schiele B., Deepcut: Joint subset partition and labeling for multi person pose estimation, in: CVPR, 2016.
[48]
Rayat Imtiaz Hossain, M., Little, J.J., 2018. Exploiting temporal information for 3D human pose estimation. In: ECCV.
[49]
Sengupta A., Budvytis I., Cipolla R., Synthetic training for accurate 3D human pose and shape estimation in the wild, 2020, ArXiv.
[50]
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T., 2019. Human mesh recovery from monocular images via a skeleton-disentangled representation. In: ICCV.
[51]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, in: NeurIPS, 2017.
[52]
Villegas R., Yang J., Ceylan D., Lee H., Neural kinematic networks for unsupervised motion retargetting, in: CVPR, 2018.
[53]
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G., 2018. Recovering accurate 3D human pose in the wild using imus and a moving camera. In: ECCV.
[54]
Walker, J., Doersch, C., Gupta, A., Hebert, M., 2016. An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV.
[55]
Walker, J., Marino, K., Gupta, A., Hebert, M., 2017. The pose knows: Video forecasting by generating pose futures. In: ICCV.
[56]
Wandt B., Rosenhahn B., Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation, in: CVPR, 2019.
[57]
Xu W., Chatterjee A., Zollhoefer M., Rhodin H., Fua P., Seidel H.P., Theobalt C., Mo 2 cap 2: Real-time mobile 3D motion capture with a cap-mounted fisheye camera, TVCG (2019).
[58]
Yang W., Ouyang W., Wang X., Ren J., Li H., Wang X., 3d human pose estimation in the wild by adversarial learning, in: CVPR, 2018.
[59]
Zanfir A., Bazavan E.G., Zanfir M., Freeman W.T., Sukthankar R., Sminchisescu C., Neural descent for visual 3D human pose and shape, 2020, ArXiv.
[60]
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J., 2019. Predicting 3D human dynamics from video. In: ICCV.
[61]
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., 2019. Self-attention generative adversarial networks. In: ICML.
[62]
Zhang, W., Zhu, M., Derpanis, K.G., 2013. From actemes to action: A strongly-supervised representation for detailed action understanding. In: ICCV.
[63]
Zhou T., Brown M., Snavely N., Lowe D.G., Unsupervised learning of depth and ego-motion from video, in: CVPR, 2017.
[64]
Zhou T., Jae Lee Y., Yu S.X., Efros A.A., Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences, in: CVPR, 2015.
[65]
Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV.
[66]
Zou, Y., Luo, Z., Huang, J.B., 2018. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: ECCV.

Index Terms

  1. Self-attentive 3D human pose and shape estimation from videos
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Computer Vision and Image Understanding
            Computer Vision and Image Understanding  Volume 213, Issue C
            Dec 2021
            94 pages

            Publisher

            Elsevier Science Inc.

            United States

            Publication History

            Published: 01 December 2021

            Author Tags

            1. 41A05
            2. 41A10
            3. 65D05
            4. 65D17

            Author Tags

            1. 3D human pose and shape estimation
            2. Self-supervised learning
            3. Occlusion handling

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 08 Feb 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media