Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SCN: : Dilated silhouette convolutional network for video action recognition

Published: 01 February 2021 Publication History

Abstract

Human action is a spatio-temporal motion sequence where strong inter-dependencies between the spatial geometry and temporal dynamics of motion exist. However, in existing literature for human action recognition from a video, there is a lack of synergy in investigating spatial geometry and temporal dynamics in a joint representation and embedding space. In this paper, we propose a dilated Silhouette Convolutional Network (SCN) for action recognition from a monocular video. We model the spatial geometric information of the moving human subject using silhouette boundary curves extracted from each frame of the motion video. The silhouette curves are stacked to form a 3D curve volume along the time axis and resampled to a 3D point cloud as a unified spatio-temporal representation of the video action. With the dilated silhouette convolution, the SCN is able to learn co-occurrence features from low-level geometric shape boundaries and their temporal dynamics jointly, and construct a unified convolutional embedding space, where the spatial and temporal properties are integrated effectively. The geometry-based SCN significantly improves the discrimination of learned features from the shape motions. Experiment results on the JHMDB, HMDB, and UCF101 datasets demonstrate the effectiveness and superiority of our proposed representation and deep learning method.

References

[1]
H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042.
[2]
A. Bobick, J. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 257–267.
[3]
J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[4]
J.M. Chaquet, E.J. Carmona, A. Fernández-Caballero, A survey of video datasets for human action and activity recognition, Comput. Vis. Image Underst. 117 (2013) 633–659.
[5]
G. Chéron, I. Laptev, C. Schmid, P-CNN: pose-based CNN features for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3218–3226.
[6]
V. Choutas, P. Weinzaepfel, J. Revaud, C. Schmid, PoTion: pose motion representation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7024–7033.
[7]
R. Christoph, F.A. Pinz, Spatiotemporal residual networks for video action recognition, in: Advances in Neural Information Processing Systems, 2016, pp. 3468–3476.
[8]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
[9]
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
[10]
W. Du, Y. Wang, Y. Qiao, RPAN: an end-to-end recurrent pose-attention network for action recognition in videos, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3725–3734.
[11]
Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118.
[12]
C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
[13]
R. Girdhar, D. Ramanan, Attentional pooling for action recognition, in: Advances in Neural Information Processing Systems, 2017, pp. 34–45.
[14]
G. Gkioxari, J. Malik, Finding action tubes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 759–768.
[15]
L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 2247–2253.
[16]
T. Hassner, A critical review of action recognition benchmarks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 245–250.
[17]
K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
[18]
A. Jahagirdar, M. Nagmode, Silhouette-based human action recognition by embedding HOG and PCA features, in: S. Bhalla, V. Bhateja, A. Chandavale, A. Hiwale, S. Satapathy (Eds.), Intelligent Computing and Information and Communication, Springer, 2018, pp. 363–371.
[19]
H. Jhuang, J. Gall, S. Zuffi, C. Schmid, M.J. Black, Towards understanding action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3192–3199.
[20]
S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2012) 221–231.
[21]
Y. Li, B. Ji, X. Shi, J. Zhang, B. Kang, L. Wang, TEA: temporal excitation and aggregation for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 906–915.
[22]
C. Moenning, N.A. Dodgson, Fast marching farthest point sampling, Technical report University of Cambridge, Computer Laboratory, 2003.
[23]
B.X. Nie, C. Xiong, S.C. Zhu, Joint action recognition and pose estimation from video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1293–1301.
[24]
X. Peng, C. Schmid, Multi-region two-stream R-CNN for action detection, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 744–759.
[25]
C.R. Qi, L. Yi, H. Su, L.J. Guibas, PointNet++: deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, 2017, pp. 5099–5108.
[26]
L. Shi, Y. Zhang, J. Cheng, H. Lu, Skeleton-based action recognition with directed graph neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7912–7921.
[27]
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose recognition in parts from single depth images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1297–1304.
[28]
K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014, pp. 568–576.
[29]
S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4263–4270.
[30]
L. Sun, K. Jia, K. Chen, D.Y. Yeung, B.E. Shi, S. Savarese, Lattice long short-term memory for human action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2147–2156.
[31]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
[32]
Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. (2017): ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038.
[33]
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
[34]
S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance, Vis. Comput. 29 (2013) 983–1009.
[35]
C. Wang, Y. Wang, A.L. Yuille, An approach to pose-based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 915–922.
[36]
H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3551–3558.
[37]
J. Wang, A. Cherian, F. Porikli, S. Gould, Video representation learning using discriminative pooling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1149–1158.
[38]
L. Wang, W. Hu, T. Tan, Recent developments in human motion analysis, Pattern Recognit. 36 (2003) 585–601.
[39]
L. Wang, W. Li, W. Li, L. Van Gool, Appearance-and-relation networks for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1430–1439.
[40]
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 20–36.
[41]
X. Wang, A. Gupta, Videos as space-time region graphs, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 399–417.
[42]
W. Wu, Z. Qi, L. Fuxin, PointConv: deep convolutional networks on 3D point clouds, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9621–9630.
[43]
Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. (2017): Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851.
[44]
A. Yan, Y. Wang, Z. Li, Y. Qiao, PA3D: pose-action 3D machine for video recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7922–7931.
[45]
S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 7444–7452.
[46]
H. Yub Jung, S. Lee, Y. Seok Heo, I. Dong Yun, Random tree walk toward instantaneous 3D human pose estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2467–2474.
[47]
P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, View adaptive neural networks for high performance skeleton-based human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019) 1963–1978.
[48]
Y. Zhou, X. Sun, C. Luo, Z.J. Zha, W. Zeng, Spatiotemporal fusion in 3D CNNs: a probabilistic view, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9826–9835.
[49]
M. Zolfaghari, G.L. Oliveira, N. Sedaghat, T. Brox, Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2904–2913.

Index Terms

  1. SCN: Dilated silhouette convolutional network for video action recognition
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Computer Aided Geometric Design
    Computer Aided Geometric Design  Volume 85, Issue C
    Feb 2021
    127 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 February 2021

    Author Tags

    1. Silhouette convolutional network (SCN)
    2. Spatio-temporal representation
    3. Geometric computing
    4. Video action recognition
    5. Deep learning
    6. Artificial intelligence

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media