applsci-11-01826-v2
applsci-11-01826-v2
applsci-11-01826-v2
sciences
Article
Self-Attention Network for Human Pose Estimation
Hailun Xia 1,2,3, * and Tianyang Zhang 1,2,3
1 Beijing Key Laboratory of Network System Architecture and Convergence, Beijing University of Posts and
Telecommunications, Beijing 100876, China; zhangtianyang@bupt.edu.cn
2 Beijing Laboratory of Advanced Information Networks, Beijing University of Posts and Telecommunications,
Beijing 100876, China
3 School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
Beijing 100876, China
* Correspondence: xiahailun@bupt.edu.cn
Abstract: Estimating the positions of human joints from monocular single RGB images has been a
challenging task in recent years. Despite great progress in human pose estimation with convolutional
neural networks (CNNs), a central problem still exists: the relationships and constraints, such as
symmetric relations of human structures, are not well exploited in previous CNN-based methods.
Considering the effectiveness of combining local and nonlocal consistencies, we propose an end-to-
end self-attention network (SAN) to alleviate this issue. In SANs, attention-driven and long-range
dependency modeling are adopted between joints to compensate for local content and mine details
from all feature locations. To enable an SAN for both 2D and 3D pose estimations, we also design a
compatible, effective and general joint learning framework to mix up the usage of different dimension
data. We evaluate the proposed network on challenging benchmark datasets. The experimental
results show that our method has significantly achieved competitive results on Human3.6M, MPII
and COCO datasets.
Keywords: human pose estimation; self-attention network; joint learning framework; local and
nonlocal consistencies; end-to-end training
high-level semantic information, which can lead to optimization not being able to achieve
a better performance. Although a larger kernel size and more convolution blocks will solve
this problem, they also increase computation and time efficiency.
To deal with this problem simply, effectively and efficiently, we introduce a self-
attention mechanism [8] into our model which we called the Self-Attention Network (SAN).
It simulates a nonlocal relationship between feature maps and combines long-term distance
information into original feature maps which can significantly increase performance and
efficiency of model. It is integral so the training procedure is end-to-end. SANs produce
attention masks to reweight original features. These masks make the model focus more on
nonlocal information. It complements convolution operators and learns contextual features
and multilevel semantic information across feature maps.
This approach is general and can be used for both 2D and 3D pose estimations
indistinguishably. We proposed a joint learning framework to enable the mixed usage of
2D and 3D data. Rich annotated 2D data could complement small scale 3D data. This
satisfies end-to-end training and improves the generalization of model.
The main contribution of this work is three-fold:
(1) We propose a simple yet surprisingly effective self-attention approach (SAN) which
exploits long-range dependency between feature maps. It can increase representation
power and performance of convolution operators.
(2) We design a joint learning framework to enable usage of mixed 2D and 3D data such
that the model can output both 2D and 3D poses and enhance generalization. As a
by-product, our approach generates high quality 3D poses for images in the wild.
(3) In experiments, SAN advances competitive results on a 3D Human3.6M dataset [9]
by a large margin and achieves 48.6 mm (Mean per joint position error (MPJPE)). On
a 2D dataset, SAN achieves competitive results—91.7% (PCKh@0.5) on MPII [10] and
71.8 (AP) on COCO [11].
2. Related Work
Human pose estimation has been a widely discussed topic in the past. It is divided
into 2D and 3D human pose estimations. In this section, we focus on recent learning-based
methods that are most relevant to our work. We will also discuss related works on visual
attention mechanisms for complementing convolution operators.
2D human pose estimation: In recent years, significant progress has been made in 2D
pose estimation due to the development of deep learning and rich annotated datasets. The
authors of [12] stacked bottom-up and top-down processing with intermediate supervision
to improve the performance. These methods are used by many researches for 2D detection
in 3D pose estimation tasks. The authors of [13] incorporated a stacked hourglass model
with a multicontext attention mechanism to refine the prediction. The authors of [14]
learned to focus on specific regions of different input features by combining a novel
attention model. Different from these, our approach adopts a self-attention mechanism to
increase the receptive field in an efficient way to learn more semantic information.
3D human pose estimation: There are two main ways to recover 3D skeleton informa-
tion. The first one divides 3D pose estimation task into two su-tasks: 2D pose estimation
and inference of a 3D pose from a 2D pose [15,16]. This method combines a 2D pose
detector [12] and a depth regression step to estimate 3D poses. In this method, the 2D/3D
poses are separated so as to generalize 3D poses in the wild images. The second one
directly infers the 3D pose from RGB images [7,17,18]. In this way, the training procedure
is end-to-end. We adopted this method for our approach. The authors of [17] proposed a
volumetric representation for 3D poses and adopted a coarse-to-fine strategy to refine the
prediction. The authors of [7] combined the benefits of regression and heat maps. We also
adopted this method to make the training process differentiable and to reduce quantization
error to improve network efficiency.
Appl. Sci. 2021, 11, 1826 3 of 14
Mixed 2D/3D Data training: Although the end-to-end training process is concise, there
is a disadvantage—the small scale of 3D in the wild annotated data limits the performance
and accuracy of domain shifts. For this problem, the authors of [19] proposed a network
architecture that comprises a hidden space to encode 2D/3D features. The authors of [20]
proposed a weakly supervised approach to make use of large scale 2D data. The authors
of [7] used soft argmax to regress 2D/3D poses directly from images. The authors of [21]
proposed a method that combines 2D/3D data to compensate for the lack of 3D data. In
our approach, we propose a joint learning method that separates x, y, z heat maps to mix
2D and 3D data.
Self-Attention Mechanism: When humans look at global images, they pay more
attention to important areas and suppress other unnecessary information. Attention
mechanisms simulate human vision and have achieved great success in computer vision—
for example, scene segmentation, style transfer, image classification and action recognition.
In particular, self-attention has been proposed [8] to calculate the response at a position
in a sequence by attending to all positions within the same sequence. The authors of [22]
proposed a cross-modal self-attention module that captures the information between
linguistic and visual features. The authors of [23] assembled a self-attention mechanism
into a style-agnostic framework to catch salient characteristics within images. We propose a
network to extend self-attention mechanisms in human pose estimation task in feature maps
to learn anatomical relationships and constraints for better recognition in nonlocal regions.
3. Model Architecture
In this section, we first describe the problem formally and give an overview of our
approach. Then, we introduce the basic idea of our approach.
3.1. Overview
3D human pose estimation is a problem, where given a single RGB image or a series
of RGB images I = {I1 , I2 , . . . , Ii }, the human pose estimation process aims to localize 2D
(or 3D) human body joints in Euclidean space, denoted as Y = {y1 , y2 , . . . , yk }, yk ∈ R
(k is the number of key-points).
As mentioned before, there are some occlusions, including the occlusion of body by
objects in space and self-shielding and ambiguities including appearance diversity and
lighting environment in the input. During the pose estimation stage, these weak-points will
severely limit prediction ability. To solve this problem in an effective way, we adopt a new
architecture, as shown in Figure 1. This is an end-to-end framework including a backbone
block, self-attention network and upsampling block. Chief among them is the self-attention
network which picks up efficient features from input images and generates self-attention
masks to reweight the original feature maps to learn the long-range dependency between
global features. A self-attention layer can capture the relatedness between feature maps and
simulate long-distance multilevel associations across joints. The design of self-attention
networks will be explained in Section 3.3. Backbone blocks are mainly used to extract
features from image batches and upsampling blocks are used to regress feature maps to
higher resolutions to refine joint locations. Backbone and upsampling block designs will be
explained in Section 3.2.
Driven by the problem of lacking a 3D annotated dataset, we adopt a joint learning
framework which separates x, y, z location regression in the training process—explained in
Section 3.4. This method enables using mixed 2D and 3D data. It also increases the module
generalization to real-world scenarios and refines the performance.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 15
Appl. Sci. 2021, 11, 1826 4 of 14
Figure 1. Framework: Illustration of the proposed approach. The basic structure contains three parts. (a) ResNet backbone
which 1.
Figure is Framework:
used to extract features from
Illustration of theimages.
proposed(b) approach.
Self-Attention Network
The basic (SAN),
structure which learns
contains long-range
three parts. dependency
(a) ResNet backbone to
compensate
which is usedthe
to lost features
extract from
features original
from images
images. and reweight Network
(b) Self-Attention obtained (SAN),
feature which
maps. learns
With increasing the model will
long-rangeδ,dependency to
compensate
depend more theonlost features
nonlocal from original
information images
than local and reweight
content. obtained
(c) An feature
upsampling maps.
block With increasing
to regress the featureδ, maps
the model will
to higher
depend more
resolutions toon nonlocal
refine information than local content. (c) An upsampling block to regress the feature maps to higher
joint locations.
resolutions to refine joint locations.
3.2. Backbone and Upsampling Block Design
Driven by the problem
In the backbone block, weof lacking
adopteda ResNet
3D annotated
[24] to dataset, we adopt
extract features froma joint
inputlearning
images.
framework which
ResNet replaces separates
the traditionalx, y, z location +
convolution regression in the
pooling layer oftraining
the deepprocess—explained
neural network that
in Section
sweeps 3.4.horizontal
both This method enables directions
and vertical using mixed 2D the
across andimage.
3D data. It also
It adds increases
a skip the
connection
module generalization to real-world scenarios and refines the performance.
to ensure that higher layers have perform well as lower layers. Our model preserves conv1,
conv2_x, conv3_x, conv4_x and conv5_x and removes the Fully Connected (FC) layer in
3.2. Backbone
ResNet. and Upsampling
Because Block Design
we use Resnet-50 and ResNet-152 as our backbone, the kernel size and
strides arebackbone
In the different based
block, onwenetwork
adopted depth.
ResNetIn thetoupsampling
[24] block,
extract features weinput
from implemented
images.
deconvolution layers to regress obtained feature maps to a higher
ResNet replaces the traditional convolution + pooling layer of the deep neural network resolution. This block
will refine the joint locations.
that sweeps both horizontal and vertical directions across the image. It adds a skip con-
nection to ensure that higher layers have perform well as lower layers. Our model pre-
3.3. Self-Attention Network Design
serves conv1, conv2_x, conv3_x, conv4_x and conv5_x and removes the Fully Connected
Wheninobserving
(FC) layer batches we
ResNet. Because of input images including
use Resnet-50 humans,aswe
and ResNet-152 find
our that the the
backbone, relation-
ker-
ships and constraints between joints will produce more useful information.
nel size and strides are different based on network depth. In the upsampling block, we Many human
pose estimation
implemented methods uselayers
deconvolution convolution
to regress neural networks
obtained feature(CNNs),
maps theto aperformance
higher resolu- of
which is limited by valid receptive field
tion. This block will refine the joint locations. such that they are only capable of adjacent content
in feature maps and cannot process long-range relations and grasp high-level semantic
information.
3.3. To compensate
Self-Attention Network Design for this drawback, we propose a nonlocal approach called a
Self-Attention Network (SAN). SANs not only receive efficient features in a local region,
When observing batches of input images including humans, we find that the rela-
but also perceive contextual information over a wide range. The details are shown in
tionships and constraints between joints will produce more useful information. Many hu-
Figure 2.
man pose estimation methods use convolution neural networks (CNNs), the performance
Feature maps from the previous hidden layer X ∈ RC× B× H ×W (C is channel number, B
of which is limited by valid receptive field such that they are only capable of adjacent
is batch size, H × W is the pixel number) are first transformed to three feature spaces, where:
content in feature maps and cannot process long-range relations and grasp high-level se-
mantic information. To compensate
Wqt = β q xfor this
t drawback,
t , Wk = β k xt , Wvt we
= βpropose
v xt a nonlocal approach (1)
called a Self-Attention Network (SAN). SANs not only receive efficient features in a local
region, but also perceive
t indicates the target contextual information
feature maps index.overAll athree
widespace
range.vectors
The details
come are shown
from the
same
in 2. Wqt is a query space vector. Wkt is a key space vector. Wqt and Wkt are used
input.
Figure
to calculate weights which represent the similarity features between feature maps. Wvt
is a value space vector and is an output from original feature maps. Reweighting the
long-term information on Wvt enables the network to capture joint relationships easily. β q is
a weight matrix of the query space vector, which maps the input matrix of B × C × W × H
dimensions to B × C8 × W × H dimensions. β q needs to be transposed for the following
operations. β k is a weight matrix of the key space vector, which maps the input matrix
of B × C × W × H dimensions to B × C8 × W × H dimensions. β v is a weight matrix of
the value space vector, which maps the input matrix of B × C × W × H dimensions to
B × C × W × H dimensions. β q , β k and β v are all trainable weight matrixes that transform
feature maps to corresponding vector spaces and were implemented as 1 × 1 convolutions
in our experiment.
pl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 15
Appl. Sci. 2021, 11, 1826 5 of 14
(a) (b)
Figure 2. Figure
An illustration of a self-attention
2. An illustration network.
of a self-attention (a) SAN’s
network. inputsinputs
(a) SAN’s are three spacespace
are three vectors fromfrom
vectors
feature maps. (b)maps.
feature Detail(b)
ofDetail
self-attention map building
of self-attention process.
map building The output
process. of the of
The output SANthe is a mask
SAN is a mask
which will reweight
which the original
will reweight featurefeature
the original maps.maps.
We added these self-attention masks to original feature maps with a trainable variable
δ where A = 𝑊 ⨀𝑃 𝑊 (2)
t
refine the prediction. Inspired by [24], this skip connection also receives more information
and mitigates the problem of a vanishing gradient.
We almost added batch normalization and ReLU at every convolution layer to speed
up the training process. We used mean average loss (L1 loss) as the criterion.
W
Jkx = ∑ Ikx (6)
Hk ( x )=1
y
following this step, Jk and Jkz can be inferred. In this way, the locations of x, y, z are
separated so we can output 2D and 3D pose estimation results systematically.
4. Experiment
In this section, we show our experimental results. We evaluated our model on Hu-
man3.6M 3D [9], MPII [10] and COCO [11] 2D datasets.
position error (MPJPE), in millimeters, between the ground truth and the prediction across
all cameras and joints after aligning the depth of the root joints.
COCO: The COCO dataset [11] presents imagery data with various human poses,
different body scales and occlusion patterns. The training, valid and test sets contain
more than 200,000 images and 250,000 in the wild person instances labels. In total, 150,000
instances are publicly available for training and valid.
Table 1. Ablative study on Human3.6M. Mean per joint position error (MPJPE) numbers are in mm.
Table 2. Ablative study on computation complexity of models with and without SAN.
Figure 3. Results
Figure 3. Results of joint relationship of joint joints
over human relationship
with theover human values
different joints with
of δ.the different
The values
first lines are of δ. The first
confusion
lines arejoints
maps with the change of δ. Human confusion
show maps
the indexwithofthe change
joints. δ. Humanmap
Theofconfusion joints show thethe
represents index of joints.
feature gain ofThe confu-
the
sion
corresponding joint with other mapand
joints, represents the feature
it indicates gain of the
the constraints andcorresponding joint with
similarities between other
joints. The joints,
secondandlines
it indicates
are
thebased
manual connections with joints constraints and similarities
on confusion maps. between joints. The second lines are manual connections with
joints based on confusion maps.
Joint learning framework: MPII and COCO datasets provide large-scale 2D key-point
inTable 3. Comparison
the wild of mean per
data. Comparing to joint position
methods error
{a,b} and(mm)
{c,d}ininHuman3.6M between
Table 1, training the estimated
with both 2D
pose and the ground truth. Lower values are better, with the best in bold, and the second best un-
and 3D data provides significant performance gain—MPJPE dropped 5.3 mm when adding
derlined.
MPII dataset and 4.8 mm when adopting COCO dataset. This verifies the effectiveness of
joint learning framework
(a)Protocol#1: in our training
reconstruction process.
error (MPJPE).
Di- Dis- Eat- Network depth: From method Sit- {e,f} in Table 1, the performance is enhanced by a deeper
Sit-
Protocol#1 Greetnetwork.
ResNet Phone Pose Purch.network depth,
Changing Smoke Photo Wait. Walk WalkD.WalkT. Avg.
rect. cuss ing ting tingD. MPJPE can drop by 3.3 mm from ResNet-50
to ResNet-152.
CoarseToFine
67.4 71.9 66.7 69.1 72.0 65.0
Computation 68.3 83.7
complexity: Table 296.5 71.7 the77.0
compares model65.8 59.1 without
with and 74.9 SAN
63.2 71.9
in terms
[17]
of parameter numbers and flops(Floating-point operations per second). The parameter of
the original method is 34 M and our method is 39 M. The flops of the original method are
14.10 G and our method is 14.43 G. The parameter increases 5 M and flops increases 0.33 G
when adding SAN, which leads to MPJPE dropping by 4.3 mm. This verifies that the SAN
model with low computation complexity will achieve a better performance.
Table 3. Comparison of mean per joint position error (mm) in Human3.6M between the estimated pose and the ground truth. Lower values are better, with the best in bold, and the second
best underlined.
Figure 4. Qualitative results of 3D human pose on Human3.6M dataset. Predicted poses are r
Figure 4. Qualitative resultsFigure
of 3D human
tated and
4. pose on
zoomed
Qualitative Human3.6M
for dataset.
theofconsistency
results 3D human Predicted
of
pose poseswith
perspective
on Human3.6M aredataset.
rotated andimage.
original zoomed
Predicted forare
poses thero-
consistency of perspective with original image.
tated and zoomed for the consistency of perspective with original image.
Table 4. Comparison of PCKh@0.5 (%) on MPII. It reports the percentage of detections that fall within a normalized distance
of ground truth. Higher values are better, with the best being indicated by bold font, and the second best being underlined.
Table 5. Comparison results on COCO test-dev. Higher values are better, with the best being
indicated by bold font, and the second best being underlined.
References
1. Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image
mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops
(ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604.
2. Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning. In Proceed-
ings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018;
pp. 5137–5146.
3. Li, B.; He, M.; Dai, Y.; Cheng, X.; Chen, Y. 3D skeleton based action recognition by video-domain translation-scale invariant
mapping and multi-scale dilated CNN. Multimed. Tools Appl. 2018, 77, 22901–22921. [CrossRef]
Appl. Sci. 2021, 11, 1826 13 of 14
4. Li, B.; Chen, H.; Chen, Y.; Dai, Y.; He, M. Skeleton boxes: Solving skeleton based action detection with a single deep convo-lutional
neural network. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong
Kong, China, 10–14 July 2017; pp. 613–616.
5. Insafutdinov, E.; Andriluka, M.; Pishchulin, L.; Tang, S.; Levinkov, E.; Andres, B.; Schiele, B. ArtTrack: Articulated Multi-Person
Tracking in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu,
HI, USA, 21–26 July 2017; pp. 1293–1301.
6. Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image
Underst. 2020, 192, 102897. [CrossRef]
7. Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the Constructive Side-Channel Analysis
and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 536–553.
8. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need.
In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Long Beach, CA, USA,
4–9 December 2017.
9. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human
Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [CrossRef] [PubMed]
10. Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 3686–3693.
11. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects
in context. In Computer Vision—ECCV ECCV Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2014;
pp. 740–755.
12. Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European
Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499.
13. Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context Attention for Human Pose Estimation. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 5669–5678.
14. Sun, G.; Ye, C.; Wang, K. Focus on What’s Important: Self-Attention Model for Human Pose Estimation. arXiv 2018,
arXiv:1809.08371.
15. Sun, X.; Shang, J.; Liang, S.; Wei, Y. Compositional Human Pose Regression. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2621–2630.
16. Zhao, L.; Peng, X.; Tian, Y.; Kapadia, M.; Metaxas, D.N. Semantic Graph Convolutional Networks for 3D Human Pose Re-gression.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20
June 2019; pp. 3425–3435.
17. Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-Fine volumetric prediction for single-image 3d human pose.
In Proceedings of the IEEE Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), Venice, Italy,
21–26 July 2017; pp. 1263–1272.
18. Ci, H.; Wang, C.; Ma, X.; Wang, Y. Optimizing Network Structure for 3D Human Pose Estimation. In Proceedings of the 2019
IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 2262–2271.
19. Habibie, I.; Xu, W.; Mehta, D.; Pons-Moll, G.; Theobalt, C. In the Wild Human Pose Estimation Using Explicit 2D Features and
Intermediate 3D Representations. In Proceedings of the International Conference on Computer Vision and Pattern Recognition
(CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10905–10914.
20. Zhou, X.; Huang, Q.X.; Sun, X.; Xue, X.; Wei, Y. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Ap-proach.
In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 398–407.
21. Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3D Human Pose from Structure and Motion.
In Proceedings of the Constructive Side-Channel Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 679–696.
22. Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. In Proceedings of
the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019;
pp. 10494–10503.
23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. arXiv 2015, arXiv:1512.03385.
24. Li, S.; Chan, A.B. 3d human pose estimation from monocular images with deep convolutional neural network. In Proceedings
Asian Conference on Computer Vision (ACCV); Springer: Berlin/Heidelberg, Germany, 2014; pp. 332–347.
25. Yao, Y.; Ren, J.; Xie, X.; Liu, W.; Liu, Y.-J.; Wang, J. Attention-Aware Multi-Stroke Style Transfer. In Proceedings of the
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019;
pp. 1467–1475.
26. Fang, H.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning Pose Grammar to Encode Human Body Configuration for 3D Human Pose
Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018;
Volume 32. No. 1.
27. Chu, W.T.; Pan, Z.W. Semi-Supervised 3D Human Pose Estimation by Jointly Considering Temporal and Multiview Infor-mation.
IEEE Access 2020, 8, 226974–226981. [CrossRef]
Appl. Sci. 2021, 11, 1826 14 of 14
28. Lee, K.; Lee, I.; Lee, S. Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency. In Proceedings of the Constructive
Side-Channel Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 123–141.
29. Li, C.; Lee, G.H. Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network. In Proceedings
of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019;
pp. 9879–9887.
30. Pavllo, D.; Feichtenhofer, C.; Grangier, D.; Auli, M. 3D Human Pose Estimation in Video with Temporal Convolutions and
Semi-Supervised Training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7745–7754.
31. Guo, Y.; Chen, Z. Absolute 3D Human Pose Estimation via Weakly-supervised Learning. In Proceedings of the 2020 IEEE
International Conference on Visual Communications and Image Processing (VCIP), Macau, China, 1–4 December 2020;
pp. 273–276.
32. Wandt, B.; Rosenhahn, B. RepNet: Weakly Supervised Training of an Adversarial Reprojection Network for 3D Human Pose
Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach,
CA, USA, 15–20 June 2019; pp. 7774–7783.
33. Hossain, M.R.I.; Little, J.J. Exploiting Temporal Information for 3D Human Pose Estimation. In Proceedings of the Constructive
Side-Channel Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 69–86.
34. Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7307–7316.
35. Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the Constructive Side-Channel
Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 472–487.
36. Tang, W.; Yu, P.; Wu, Y. Deeply Learned Compositional Models for Human Pose Estimation. In Proceedings of the Constructive
Side-Channel Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany, 2018; pp. 197–214.
37. Yang, S.; Yang, W.; Cui, Z. Pose Neural Fabrics Search. arXiv 2019, arXiv:1909.07068.
38. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019;
pp. 5686–5696.
39. Huo, Z.; Jin, H.; Qiao, Y.; Luo, F. Deep High-resolution Network with Double Attention Residual Blocks for Human Pose
Estimation. IEEE Access 2020, 8, 1. [CrossRef]
40. Jun, J.; Lee, J.H.; Kim, C.S. Human Pose Estimation Using Skeletal Heatmaps. In Proceedings of the 2020 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December
2020; pp. 1287–1292.
41. Tang, Z.; Peng, X.; Geng, S.; Zhu, Y.; Metaxas, D.N. CU-Net: Coupled U-Nets. In Proceedings of the British Machine Vision
Conference (BMVC), Newcastle, UK, 3–6 September 2018.
42. Tang, Z.; Peng, X.; Geng, S.; Wu, L.; Zhang, S.; Metaxas, D. Quantized Densely Connected U-Nets for Efficient Landmark
Localization. In Proceedings of the Constructive Side-Channel Analysis and Secure Design; Springer: Berlin/Heidelberg, Germany,
2018; pp. 348–364.
43. Ning, G.; Zhang, Z.; He, Z. Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation. IEEE Trans.
Multimedia 2018, 20, 1246–1259. [CrossRef]
44. Zhang, W.; Fang, J.; Wang, X.; Liu, W. Efficientpose: Efficient human pose estimation with neural architecture search. arXiv 2020,
arXiv:2012.07086.