Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition
Abstract
:1. Introduction
- We propose a new neighborhood partition strategy by constructing four subgraphs with different connections of joints.
- We introduce a part attention module to learn the activation parameters of each part and perform weighted feature fusion.
- A new network structure is proposed, which integrates streams of different stages in specific layers of the network.
- Our final model achieved state-of-the-art performance on two large-scale skeleton-based action recognition datasets.
2. Related Work
2.1. Skeleton-Based Action Recognition
2.2. Graph Convolutional Neural Network
3. Background
3.1. Skeleton Graph Construction
3.2. Graph-Based Convolution
3.3. Attention-Enhanced Adaptive GCNs
4. Method
4.1. Construction of Spatial Graph
4.2. Part Attention Module
4.3. Graph Convolutional Block
4.4. Network Architecture
5. Experiments
5.1. Datasets
5.2. Training Details
5.3. Ablation Studies
5.3.1. Partition Strategy
5.3.2. Part Attention Module
5.3.3. Multi-Stage Streams Network
5.3.4. Bone Information
5.4. Comparison with the State-of-the-Arts
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Rasouli, A.; Yau, T.; Lakner, P.; Malekmohammadi, S.; Rohani, M.; Luo, J. PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D. arXiv 2020, arXiv:2012.07773. [Google Scholar]
- Kong, Y.; Fu, Y. Max-Margin Action Prediction Machine. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1844–1858. [Google Scholar] [CrossRef] [PubMed]
- Song, Z.; Yin, Z.; Yuan, Z.; Zhang, C.; Zhang, S. Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction. arXiv 2020, arXiv:2007.01065. [Google Scholar]
- Koppula, H.S.; Saxena, A. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 14–29. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhao, Z.; Chen, G.; Chen, C.; Li, X.; Su, F. Instance-Based Video Search via Multi-Task Retrieval and Re-Ranking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshop, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Ciptadi, A.; Goodwin, M.S.; Rehg, J.M. Movement Pattern Histogram for Action Recognition and Retrieval. In The European Conference on Computer Vision (ECCV); Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 695–710. [Google Scholar]
- Singh, S.; Velastin, S.A.; Ragheb, H. MuHAVi: A Multicamera Human Action Video Dataset for the Evaluation of Action Recognition Methods. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Boston, MA, USA, 29 August–1 September 2010; pp. 48–55. [Google Scholar]
- Gul, M.A.; Yousaf, M.H.; Nawaz, S.; Ur Rehman, Z.; Kim, H. Patient Monitoring by Abnormal Human Activity Recognition Based on CNN Architecture. Electronics 2020, 9, 1993. [Google Scholar] [CrossRef]
- Du, Y.; Wang, W.; Wang, L. Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In The European Conference on Computer Vision (ECCV); Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition From Skeleton Data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zheng, W.; Li, L.; Zhang, Z.; Huang, Y.; Wang, L. Relational Network for Skeleton-Based Action Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 826–831. [Google Scholar]
- Kim, T.S.; Reiter, A. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
- Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) Workshops, Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv 2018, arXiv:1801.07455. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
- Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
- Anvarov, F.; Kim, D.H.; Song, B.C. Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention. Electronics 2020, 9, 147. [Google Scholar] [CrossRef] [Green Version]
- Duvenaud, D.K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R.P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 2224–2232. [Google Scholar]
- Niepert, M.; Ahmed, M.; Kutzkov, K. Learning Convolutional Neural Networks for Graphs. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
- Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; Bronstein, M.M. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kipf, T.; Fetaya, E.; Wang, K.C.; Welling, M.; Zemel, R. Neural Relational Inference for Interacting Systems. arXiv 2018, arXiv:1802.04687. [Google Scholar]
- Shuman, D.I.; Narang, S.K.; Frossard, P.; Ortega, A.; Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Process. Mag. 2013, 30, 83–98. [Google Scholar] [CrossRef] [Green Version]
- Bruna, J.; Zaremba, W.; Szlam, A.; Lecun, Y. Spectral Networks and Locally Connected Networks on Graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
- Henaff, M.; Bruna, J.; LeCun, Y. Deep Convolutional Networks on Graph-Structured Data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 3844–3852. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. arXiv 2019, arXiv:1912.06971. [Google Scholar] [CrossRef] [PubMed]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France, 2015; pp. 2048–2057. [Google Scholar]
- Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Jie, H.; Li, S.; Albanie, S.; Gang, S.; Vedaldi, A. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks. arXiv 2018, arXiv:1810.12348. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; Devito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 3–9 December 2017. [Google Scholar]
- Tang, Y.; Tian, Y.; Lu, J.; Li, P.; Zhou, J. Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-Occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18, Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
- Li, B.; Li, X.; Zhang, Z.; Wu, F. Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8561–8568. [Google Scholar] [CrossRef] [Green Version]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Fernando, B.; Gavves, E.; Oramas, J.M.; Ghodrati, A.; Tuytelaars, T. Modeling Video Evolution for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Methods | Pros | Cons | |
---|---|---|---|
Manual | Actionlet Ensemble [20] | Use depth data. | Sensitive to noise. |
HOJ3D [21] | Use histograms of 3D joints. | Portability is poor. | |
Lie Group [19] | Model body parts finer. | Limited to small datasets. | |
RNNs | HBRNN [9] | Model temporal evolution. | Easy to overfit. |
ST-LSTM [11] | Make spatiotemporal analysis. | Low recognition accuracy. | |
ARRN-LSTM [13] | Achieve higher performance. | Complex network structure. | |
CNNs | TCN [14] | Re-designs the TCN. | Small temporal information. |
Synthesized CNN [15] | Enhance visualization. | Not flexible enough. | |
3scale ResNet152 [16] | Can use pre-trained CNNs. | A large amount of calculation. | |
GCNs | ST-GCN [17] | Model actions as graphs. | Ignore long-range links |
AS-GCN [33] | Explore actional-structural links. | Complex network. | |
2s-AGCN [34] | Increase model’s flexibility. | Simple temporal domain modeling. |
Methods | X-Sub (%) | X-View (%) |
---|---|---|
baseline (AAGCN) [35] | 88.0 | 95.1 |
baseline (we train) [35] | 87.75 | 94.83 |
SGCN-T | 88.55 | 95.11 |
SGCN-E | 88.47 | 95.26 |
SGCN-T+E | 88.68 | 95.39 |
Method | X-Sub (%) | X-View (%) |
---|---|---|
baseline (AAGCN) [35] | 88.0 | 95.1 |
baseline (we train) [35] | 87.75 | 94.83 |
AGCN (2&5&8-th) | 88.25 | 95.13 |
AGCN (3&6&9-th) | 88.31 | 95.17 |
AGCN (4&7&10-th) | 88.19 | 95.10 |
AGCN (3&8-th) | 88.44 | 95.31 |
AGCN (4&9-th) | 88.29 | 95.25 |
AGCN (5&10-th) | 88.26 | 95.12 |
Method | X-Sub (%) | X-View (%) |
---|---|---|
new baseline (ASGCN) | 88.93 | 95.50 |
MS-ASGCN (T = 3) | 89.47 | 95.83 |
MS-ASGCN (T = 4) | 89.69 | 95.87 |
MS-ASGCN (T = 5) | 89.80 | 95.94 |
MS-ASGCN (T = 6) | 89.87 | 95.97 |
MS-ASGCN (T = 7) | 89.58 | 95.82 |
X-Sub (%) | X-View (%) | |||||
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1.0 | 89.29 | 95.65 |
0.1 | 0.1 | 0.1 | 0.1 | 0.6 | 89.73 | 95.79 |
0.1 | 0.1 | 0.1 | 0.2 | 0.5 | 89.80 | 95.94 |
0.1 | 0.1 | 0.15 | 0.2 | 0.45 | 89.68 | 95.87 |
0.1 | 0.15 | 0.2 | 0.25 | 0.3 | 89.51 | 95.80 |
0.2 | 0.2 | 0.2 | 0.2 | 0.2 | 89.33 | 95.72 |
Methods | X-Sub (%) | X-View (%) |
---|---|---|
MS-ASGCN (Js) | 89.80 | 95.94 |
MS-ASGCN (Bs) | 89.92 | 95.61 |
MS-ASGCN | 90.87 | 96.53 |
Methods | X-Sub (%) | X-View (%) |
---|---|---|
Lie Group [19] | 50.1 | 82.8 |
HBRNN [9] | 59.1 | 64.0 |
ST-LSTM [11] | 69.2 | 77.7 |
VA-LSTM [12] | 79.2 | 87.7 |
ARRN-LSTM [13] | 80.7 | 88.8 |
TCN [14] | 74.3 | 83.1 |
Synthesized CNN [15] | 80.0 | 87.2 |
3scale ResNet152 [16] | 85.0 | 92.3 |
ST-GCN [17] | 81.5 | 88.3 |
DPRL [41] | 83.5 | 89.8 |
HCN [42] | 86.5 | 91.1 |
STGR-GCN [43] | 86.9 | 92.3 |
AS-GCN [33] | 86.8 | 94.2 |
2s-AGCN [34] | 88.5 | 95.1 |
MS-AAGCN [35] | 90.0 | 96.2 |
Shift-GCN [44] | 90.7 | 96.5 |
MS-ASGCN (Ours) | 90.9 | 96.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, C.; Zou, L.; Fan, C.; Jiang, H.; Liu, Y. Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics 2021, 10, 2198. https://doi.org/10.3390/electronics10182198
Li C, Zou L, Fan C, Jiang H, Liu Y. Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition. Electronics. 2021; 10(18):2198. https://doi.org/10.3390/electronics10182198
Chicago/Turabian StyleLi, Chaoyue, Lian Zou, Cien Fan, Hao Jiang, and Yifeng Liu. 2021. "Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition" Electronics 10, no. 18: 2198. https://doi.org/10.3390/electronics10182198