research-article

Self-attentive 3D human pose and shape estimation from videos

Authors:

Marco Piccirilli,

Robinson Piramuthu,

Ming-Hsuan YangAuthors Info & Claims

Volume 213, Issue C

https://doi.org/10.1016/j.cviu.2021.103305

Published: 01 December 2021 Publication History

Abstract

We consider the task of estimating 3D human pose and shape from videos. While existing frame-based approaches have made significant progress, these methods are independently applied to each image, thereby often leading to inconsistent predictions. In this work, we present a video-based learning algorithm for 3D human pose and shape estimation. The key insights of our method are two-fold. First, to address the inconsistent temporal prediction issue, we exploit temporal information in videos and propose a self-attention module that jointly considers short-range and long-range dependencies across frames, resulting in temporally coherent estimations. Second, we model human motion with a forecasting module that allows the transition between adjacent frames to be smooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. Extensive experimental results show that our algorithm performs favorably against the state-of-the-art methods.

Highlights

•

We propose the SPS-Net for estimating 3D human pose and shape from videos.

•

We propose a consistency loss to address the lack of camera parameter annotations.

•

We develop a self-supervised learning scheme to address the occlusion issues.

•

The proposed method performs favorably against the state-of-the-art methods.

References

[1]

Andriluka M., Iqbal U., Insafutdinov E., Pishchulin L., Milan A., Gall J., Schiele B., Posetrack: A benchmark for human pose estimation and tracking, in: CVPR, 2018.

[2]

Arnab A., Doersch C., Zisserman A., Exploiting temporal context for 3D human pose estimation in the wild, in: CVPR, 2019.

[3]

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B., 2016. Simple online and realtime tracking. In: ICIP.

[4]

Bochkovskiy A., Wang C.Y., Liao H.Y.M., YOLOv4: Optimal speed and accuracy of object detection, 2020, ArXiv.

[5]

Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J., 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV.

[6]

Bolya, D., Zhou, C., Xiao, F., Lee, Y.J., 2019. YOLACT: real-time instance segmentation. In: ICCV.

[7]

Butepage J., Black M.J., Kragic D., Kjellstrom H., Deep representation learning for human motion prediction and classification, in: CVPR, 2017.

[8]

Cao Z., Martinez G.H., Simon T., Wei S.E., Sheikh Y.A., Openpose: Realtime multi-person 2D pose estimation using part affinity fields, TPAMI (2019).

[9]

Chen Y.C., Lin Y.Y., Yang M.H., Huang J.B., CrDoCo: Pixel-level domain transfer with cross-domain consistency, in: CVPR, 2019.

[10]

Chen Y.C., Lin Y.Y., Yang M.H., Huang J.B., Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation, TPAMI (2020).

[11]

Chen C.H., Tyagi A., Agrawal A., Drover D., Stojanov S., Rehg J.M., Unsupervised 3D pose estimation with geometric self-supervision, in: CVPR, 2019.

[12]

Cheng Y., Yang B., Wang B., Tan R.T., 3D human pose estimation using spatio-temporal networks with explicit occlusion training, in: AAAI, 2020.

[13]

Choutas, V., Pavlakos, G., Bolkart, T., 2020. Monocular expressive body regression through body-driven attention. In: ECCV.

[14]

Denton E., Birodkar V., Unsupervised learning of disentangled representations from video, in: NeurIPS, 2017.

[15]

Doersch C., Zisserman A., Sim2real transfer learning for 3D human pose estimation: motion to the rescue, in: NeurIPS, 2019.

[16]

Finn C., Goodfellow I., Levine S., Unsupervised learning for physical interaction through video prediction, in: NeurIPS, 2016.

[17]

Fragkiadaki, K., Levine, S., Felsen, P., Malik, J., 2015. Recurrent network models for human dynamics. In: ICCV.

[18]

Girdhar R., Gkioxari G., Torresani L., Paluri M., Tran D., Detect-and-track: Efficient pose estimation in videos, in: CVPR, 2018.

[19]

Gordon, A., Li, H., Jonschkowski, R., Angelova, A., 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: ICCV.

[20]

He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, in: CVPR, 2016.

[21]

Huang, X., Liu, M.Y., Belongie, S., Kautz, J., 2018. Multimodal unsupervised image-to-image translation. In: ECCV.

[22]

Ionescu C., Papava D., Olaru V., Sminchisescu C., Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments, TPAMI (2013).

[23]

Jain A., Zamir A.R., Savarese S., Saxena A., Structural-rnn: Deep learning on spatio-temporal graphs, in: CVPR, 2016.

[24]

Kanazawa A., Black M.J., Jacobs D.W., Malik J., End-to-end recovery of human shape and pose, in: CVPR, 2018.

[25]

Kanazawa A., Zhang J.Y., Felsen P., Malik J., Learning 3D human dynamics from video, in: CVPR, 2019.

[26]

Kay W., Carreira J., Simonyan K., Zhang B., Hillier C., Vijayanarasimhan S., Viola F., Green T., Back T., Natsev P., Suleyman M., Zisserman A., The kinetics human action video dataset, 2017, ArXiv.

[27]

Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. In: ICLR.

[28]

Kocabas M., Athanasiou N., Black M.J., VIBE: Video inference for human body pose and shape estimation, in: CVPR, 2020.

[29]

Kocabas M., Karagoz S., Akbas E., Self-supervised learning of 3D human pose using multi-view geometry, in: CVPR, 2019.

[30]

Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K., 2019. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV.

[31]

Kolotouros N., Pavlakos G., Daniilidis K., Convolutional mesh regression for single-image human shape reconstruction, in: CVPR, 2019.

[32]

Lassner C., Romero J., Kiefel M., Bogo F., Black M.J., Gehler P.V., Unite the people: Closing the loop between 3D and 2D human representations, in: CVPR, 2017.

[33]

Lee, K., Lee, I., Lee, S., 2018. Propagating lstm: 3D pose estimation based on joint interdependency. In: ECCV.

[34]

Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H., 2018. Diverse image-to-image translation via disentangled representations. In: ECCV.

[35]

Li J., Wang C., Zhu H., Mao Y., Fang H.S., Lu C., Crowdpose: Efficient crowded scenes pose estimation and a new benchmark, in: CVPR, 2019.

[36]

Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H., 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR.

[37]

Liu L., Xu W., Zollhoefer M., Kim H., Bernard F., Habermann M., Wang W., Theobalt C., Neural rendering and reenactment of human actor videos, ACM TOG (2019).

[38]

Loper M., Mahmood N., Romero J., Pons-Moll G., Black M.J., SMPL: A skinned multi-person linear model, ACM TOG (2015).

[39]

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J., 2019. AMASS: Archive of motion capture as surface shapes. In: ICCV.

[40]

Mehta D., Rhodin H., Casas D., Fua P., Sotnychenko O., Xu W., Theobalt C., Monocular 3D human pose estimation in the wild using improved cnn supervision, in: 3DV, 2017.

[41]

Mehta D., Sridhar S., Sotnychenko O., Rhodin H., Shafiei M., Seidel H.P., Xu W., Casas D., Theobalt C., Vnect: Real-time 3D human pose estimation with a single rgb camera, ACM TOG (2017).

[42]

Meister S., Hur J., Roth S., Unflow: Unsupervised learning of optical flow with a bidirectional census loss, in: AAAI, 2018.

[43]

Omran M., Lassner C., Pons-Moll G., Gehler P., Schiele B., Neural body fitting: Unifying deep learning and model based human pose and shape estimation, in: 3DV, 2018.

[44]

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., Tran, D., 2018. Image transformer. In: ICML.

[45]

Pascanu, R., Mikolov, T., Bengio, Y., 2013. On the difficulty of training recurrent neural networks. In: ICML.

[46]

Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., Desmaison A., Köpf A., Yang E., DeVito Z., Raison M., Tejani A., Chilamkurthy S., Steiner B., Fang L., Bai J., Chintala S., Pytorch: An imperative style, high-performance deep learning library, in: NeurIPS, 2019.

[47]

Pishchulin L., Insafutdinov E., Tang S., Andres B., Andriluka M., Gehler P.V., Schiele B., Deepcut: Joint subset partition and labeling for multi person pose estimation, in: CVPR, 2016.

[48]

Rayat Imtiaz Hossain, M., Little, J.J., 2018. Exploiting temporal information for 3D human pose estimation. In: ECCV.

[49]

Sengupta A., Budvytis I., Cipolla R., Synthetic training for accurate 3D human pose and shape estimation in the wild, 2020, ArXiv.

[50]

Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T., 2019. Human mesh recovery from monocular images via a skeleton-disentangled representation. In: ICCV.

[51]

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, in: NeurIPS, 2017.

[52]

Villegas R., Yang J., Ceylan D., Lee H., Neural kinematic networks for unsupervised motion retargetting, in: CVPR, 2018.

[53]

von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G., 2018. Recovering accurate 3D human pose in the wild using imus and a moving camera. In: ECCV.

[54]

Walker, J., Doersch, C., Gupta, A., Hebert, M., 2016. An uncertain future: Forecasting from static images using variational autoencoders. In: ECCV.

[55]

Walker, J., Marino, K., Gupta, A., Hebert, M., 2017. The pose knows: Video forecasting by generating pose futures. In: ICCV.

[56]

Wandt B., Rosenhahn B., Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation, in: CVPR, 2019.

[57]

Xu W., Chatterjee A., Zollhoefer M., Rhodin H., Fua P., Seidel H.P., Theobalt C., Mo 2 cap 2: Real-time mobile 3D motion capture with a cap-mounted fisheye camera, TVCG (2019).

[58]

Yang W., Ouyang W., Wang X., Ren J., Li H., Wang X., 3d human pose estimation in the wild by adversarial learning, in: CVPR, 2018.

[59]

Zanfir A., Bazavan E.G., Zanfir M., Freeman W.T., Sukthankar R., Sminchisescu C., Neural descent for visual 3D human pose and shape, 2020, ArXiv.

[60]

Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J., 2019. Predicting 3D human dynamics from video. In: ICCV.

[61]

Zhang, H., Goodfellow, I., Metaxas, D., Odena, A., 2019. Self-attention generative adversarial networks. In: ICML.

[62]

Zhang, W., Zhu, M., Derpanis, K.G., 2013. From actemes to action: A strongly-supervised representation for detailed action understanding. In: ICCV.

[63]

Zhou T., Brown M., Snavely N., Lowe D.G., Unsupervised learning of depth and ego-motion from video, in: CVPR, 2017.

[64]

Zhou T., Jae Lee Y., Yu S.X., Efros A.A., Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences, in: CVPR, 2015.

[65]

Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV.

[66]

Zou, Y., Luo, Z., Huang, J.B., 2018. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: ECCV.

Index Terms

Self-attentive 3D human pose and shape estimation from videos
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation
Abstract
The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-...
Highlights
- A novel “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm for 3D HPE.
- Superior robustness against occlusions for challenging scenarios.
- Real-time performance with low computational complexity for practical applications.
- ...
Self-supervised monocular depth estimation with self-distillation and dense skip connection
Abstract
Monocular depth estimation (MDE) is crucial in a wide range of applications, including robotics, autonomous driving and virtual reality. Self-supervised monocular depth estimation has emerged as a promising MDE approach without requiring hard-to-...
Highlights
- We propose a successive depth map self-distillation loss for self-supervised monocular depth estimation.
- We propose a dense skip connection strategy to improve the depth estimation effect of the depth network.
- We validate the ...
Multi-person 3D pose estimation from a single image captured by a fisheye camera
Abstract
Multi-person 3D pose estimation with absolute depths for a fisheye camera is a challenging task but with valuable applications in daily life, especially for video surveillance. However, to the best of our knowledge, such problem has not been ...
Highlights
- We propose a novel method for multi-person 3D pose estimation from a fisheye image.
- A re-projection module is introduced to alleviate the negative impact of distortions.
- Absolute 3D poses are obtained by our method without using ...

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 213, Issue C

Dec 2021

94 pages

ISSN:1077-3142

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 December 2021

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents