research-article

SCN: : Dilated silhouette convolutional network for video action recognition

Authors:

Zichun ZhongAuthors Info & Claims

Volume 85, Issue C

https://doi.org/10.1016/j.cagd.2021.101965

Published: 01 February 2021 Publication History

Abstract

Human action is a spatio-temporal motion sequence where strong inter-dependencies between the spatial geometry and temporal dynamics of motion exist. However, in existing literature for human action recognition from a video, there is a lack of synergy in investigating spatial geometry and temporal dynamics in a joint representation and embedding space. In this paper, we propose a dilated Silhouette Convolutional Network (SCN) for action recognition from a monocular video. We model the spatial geometric information of the moving human subject using silhouette boundary curves extracted from each frame of the motion video. The silhouette curves are stacked to form a 3D curve volume along the time axis and resampled to a 3D point cloud as a unified spatio-temporal representation of the video action. With the dilated silhouette convolution, the SCN is able to learn co-occurrence features from low-level geometric shape boundaries and their temporal dynamics jointly, and construct a unified convolutional embedding space, where the spatial and temporal properties are integrated effectively. The geometry-based SCN significantly improves the discrimination of learned features from the shape motions. Experiment results on the JHMDB, HMDB, and UCF101 datasets demonstrate the effectiveness and superiority of our proposed representation and deep learning method.

References

[1]

H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042.

[2]

A. Bobick, J. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 257–267.

Digital Library

[3]

J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.

[4]

J.M. Chaquet, E.J. Carmona, A. Fernández-Caballero, A survey of video datasets for human action and activity recognition, Comput. Vis. Image Underst. 117 (2013) 633–659.

Digital Library

[5]

G. Chéron, I. Laptev, C. Schmid, P-CNN: pose-based CNN features for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3218–3226.

[6]

V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, PoTion: pose motion representation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7024–7033.

[7]

R. Christoph, F.A. Pinz, Spatiotemporal residual networks for video action recognition, in: Advances in Neural Information Processing Systems, 2016, pp. 3468–3476.

[8]

J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.

[9]

J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.

[10]

W. Du, Y. Wang, Y. Qiao, RPAN: an end-to-end recurrent pose-attention network for action recognition in videos, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3725–3734.

[11]

Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118.

[12]

C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.

[13]

R. Girdhar, D. Ramanan, Attentional pooling for action recognition, in: Advances in Neural Information Processing Systems, 2017, pp. 34–45.

[14]

G. Gkioxari, J. Malik, Finding action tubes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 759–768.

[15]

L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 2247–2253.

Digital Library

[16]

T. Hassner, A critical review of action recognition benchmarks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 245–250.

[17]

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.

[18]

A. Jahagirdar, M. Nagmode, Silhouette-based human action recognition by embedding HOG and PCA features, in: S. Bhalla, V. Bhateja, A. Chandavale, A. Hiwale, S. Satapathy (Eds.), Intelligent Computing and Information and Communication, Springer, 2018, pp. 363–371.

[19]

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, M.J. Black, Towards understanding action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3192–3199.

[20]

S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2012) 221–231.

Digital Library

[21]

Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, L. Wang, TEA: temporal excitation and aggregation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 906–915.

[22]

C. Moenning, N.A. Dodgson, Fast marching farthest point sampling, Technical report University of Cambridge, Computer Laboratory, 2003.

[23]

B.X. Nie, C. Xiong, S.C. Zhu, Joint action recognition and pose estimation from video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1293–1301.

[24]

X. Peng, C. Schmid, Multi-region two-stream R-CNN for action detection, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 744–759.

[25]

C.R. Qi, L. Yi, H. Su, L.J. Guibas, PointNet++: deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, 2017, pp. 5099–5108.

Digital Library

[26]

L. Shi, Y. Zhang, J. Cheng, H. Lu, Skeleton-based action recognition with directed graph neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7912–7921.

[27]

J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose recognition in parts from single depth images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1297–1304.

[28]

K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014, pp. 568–576.

[29]

S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4263–4270.

[30]

L. Sun, K. Jia, K. Chen, D.Y. Yeung, B.E. Shi, S. Savarese, Lattice long short-term memory for human action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2147–2156.

[31]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.

[32]

Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. (2017): ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038.

[33]

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.

[34]

S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance, Vis. Comput. 29 (2013) 983–1009.

[35]

C. Wang, Y. Wang, A.L. Yuille, An approach to pose-based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 915–922.

[36]

H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3551–3558.

Digital Library

[37]

J. Wang, A. Cherian, F. Porikli, S. Gould, Video representation learning using discriminative pooling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1149–1158.

[38]

L. Wang, W. Hu, T. Tan, Recent developments in human motion analysis, Pattern Recognit. 36 (2003) 585–601.

[39]

L. Wang, W. Li, W. Li, L. Van Gool, Appearance-and-relation networks for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1430–1439.

[40]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 20–36.

[41]

X. Wang, A. Gupta, Videos as space-time region graphs, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 399–417.

[42]

W. Wu, Z. Qi, L. Fuxin, PointConv: deep convolutional networks on 3D point clouds, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9621–9630.

[43]

Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. (2017): Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851.

[44]

A. Yan, Y. Wang, Z. Li, Y. Qiao, PA3D: pose-action 3D machine for video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7922–7931.

[45]

S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 7444–7452.

[46]

H. Yub Jung, S. Lee, Y. Seok Heo, I. Dong Yun, Random tree walk toward instantaneous 3D human pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2467–2474.

[47]

P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, View adaptive neural networks for high performance skeleton-based human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019) 1963–1978.

[48]

Y. Zhou, X. Sun, C. Luo, Z.J. Zha, W. Zeng, Spatiotemporal fusion in 3D CNNs: a probabilistic view, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9826–9835.

[49]

M. Zolfaghari, G.L. Oliveira, N. Sedaghat, T. Brox, Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2904–2913.

Index Terms

SCN: Dilated silhouette convolutional network for video action recognition

Index terms have been assigned to the content through auto-classification.

Recommendations

Volumetric Quasi-conformal Mappings
GRAPP 2015: Proceedings of the 10th International Conference on Computer Graphics Theory and Applications

Due to intrinsic differences between surfaces and higher dimensional objects, some important results regarding surfaces can not be extended to volumetric domains. Most significantly, there exist no conformal volumetric maps apart from Möbius ...
Multi-scale spatialtemporal information deep fusion network with temporal pyramid mechanism for video action recognition
Artificial Intelligence and Advanced Manufacturing (AIAM 2020)

In the deep learning-based video action recognitio, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. This paper puts ...
Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

The 3D convolutional neural networks recently have been applied to explore spatial-temporal content for video action recognition. However, they either suffer from high computational cost by spatial-temporal feature extraction or ignore the correlation ...

Comments

Information & Contributors

Information

Published In

cover image Computer Aided Geometric Design

Computer Aided Geometric Design Volume 85, Issue C

Feb 2021

127 pages

ISSN:0167-8396

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents