Self-attentive 3D human pose and shape estimation from videos

Published: 01 December 2021


We consider the task of estimating 3D human pose and shape from videos. While existing frame-based approaches have made significant progress, these methods are independently applied to each image, thereby often leading to inconsistent predictions. In this work, we present a video-based learning algorithm for 3D human pose and shape estimation. The key insights of our method are two-fold. First, to address the inconsistent temporal prediction issue, we exploit temporal information in videos and propose a self-attention module that jointly considers short-range and long-range dependencies across frames, resulting in temporally coherent estimations. Second, we model human motion with a forecasting module that allows the transition between adjacent frames to be smooth. We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. Extensive experimental results show that our algorithm performs favorably against the state-of-the-art methods.


We propose the SPS-Net for estimating 3D human pose and shape from videos.
We propose a consistency loss to address the lack of camera parameter annotations.
We develop a self-supervised learning scheme to address the occlusion issues.
The proposed method performs favorably against the state-of-the-art methods.


Index Terms

  1. Self-attentive 3D human pose and shape estimation from videos
            Information & Contributors


            Published In

            Computer Vision and Image Understanding  Volume 213, Issue C
            Dec 2021
            94 pages


            Publication History

            Published: 01 December 2021

            Author Tags

            1. 41A05
            2. 41A10
            3. 65D05
            4. 65D17

            Author Tags

            1. 3D human pose and shape estimation
            2. Self-supervised learning
            3. Occlusion handling


            • Research-article


