Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132734.3132739acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

PKU-MMD: A Large Scale Benchmark for Skeleton-Based Human Action Understanding

Published: 23 October 2017 Publication History

Abstract

Despite the fact that many 3D human activity benchmarks being proposed, most existing action datasets focus on the action recognition tasks for the segmented videos. There is a lack of standard large-scale benchmarks, especially for current popular data-hungry deep learning based methods. In this paper, we introduce a new large scale benchmark (PKU-MMD) for continuous skeleton-based human action understanding and cover a wide range of complex human activities with well annotated information. PKU-MMD contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Our dataset also provides multi-modality data sources, including RGB, depth, Infrared Radiation and Skeleton. To the best of our knowledge, it is the largest skeleton-based detection database so far. We conduct extensive experiments and evaluate different methods on this dataset. We believe this large-scale dataset will benefit future researches on action detection for the community.

References

[1]
Jake K. Aggarwal and Lu Xia. 2014. Human activity recognition from 3D data: A review. PRL (2014).
[2]
Victoria Bloom, Dimitrios Makris, and Vasileios Argyriou. 2012. G3D: A gaming action dataset and real time action recognition evaluation framework CVPR.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding CVPR.
[4]
Ziyun Cai, Jungong Han, Li Liu, and Ling Shao. 2016. RGB-D datasets using Microsoft Kinect or similar sensors: a survey. Multimedia Tools and Applications (2016).
[5]
Lulu Chen, Hong Wei, and James Ferryman. 2013. A survey of human motion analysis using depth imagery. PRL Vol. 34 (2013).
[6]
Zhongwei Cheng, Lei Qin, Yituo Ye, Qingming Huang, and Qi Tian. 2012. Human daily action analysis with multi-view and color-depth data ECCV.
[7]
Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek, and Tinne Tuytelaars. 2016. Online Action Detection.
[8]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description CVPR.
[9]
Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition CVPR.
[10]
Yusuke Goutsu, Wataru Takano, and Yoshihiko Nakamura. 2015. Motion Recognition Employing Multiple Kernel Learning of Fisher Vectors Using Local Skeleton Features. In ICCV.
[11]
Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. IJCV (2014).
[12]
Mohamed E. Hussein, Marwan Torki, Mohammad Abdelaziz Gowayyed, and Motaz El-Saban. 2013. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations. In IJCAI.
[13]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks CVPR.
[14]
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In CVPR.
[15]
Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a bag of 3D points. CVPR.
[16]
Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, and Jiaying Liu. 2016. Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks. In ECCV.
[17]
Ivan Lillo, Alvaro Soto, and Juan Carlos Niebles. 2014. Discriminative hierarchical modeling of spatio-temporally composable human activities CVPR.
[18]
Bingbing Ni, Gang Wang, and Pierre Moulin. 2013. RGBD-hudaact: A color-depth video database for human daily activity recognition. Springer.
[19]
Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy. 2013. Berkeley MHAD: A comprehensive multimodal human action database WACV.
[20]
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the fisher kernel for large-scale image classification ECCV.
[21]
Hossein Rahmani, Arif Mahmood, Du Huynh, and Ajmal Mian. 2016. Histogram of oriented principal components for cross-view action recognition. TPAMI (2016).
[22]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet Large Scale Visual Recognition Challenge. CoRR (2014).
[23]
Michael S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos ICCV.
[24]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis CVPR.
[25]
Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, and Gang Wang. 2016. Multimodal multipart learning for action recognition in depth videos. TPAMI (2016).
[26]
Amr Sharaf, Marwan Torki, Mohamed E Hussein, and Motaz El-Saban. 2015. Real-time multi-scale action detection from 3D skeleton data WACV.
[27]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos NIPS.
[28]
Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2016. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. AAAI (2016).
[29]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv (2012).
[30]
Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2011. Human Activity Detection from RGBD Images. AAAI (2011).
[31]
Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from RGBD images ICRA.
[32]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. 2016. Inception-v4, Inception-resnet and the impact of residual connections on learning. arXiv (2016).
[33]
Yicong Tian, Rahul Sukthankar, and Mubarak Shah. 2013. Spatiotemporal deformable part models for action detection CVPR.
[34]
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group CVPR.
[35]
Raviteja Vemulapalli and Rama Chellapa. 2016. Rolling rotations for recognizing human actions from 3D skeletal data CVPR.
[36]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV.
[37]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras CVPR.
[38]
Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. 2014. Cross-view action modeling, learning and recognition CVPR.
[39]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Action recognition and detection by combining motion and appearance features. THUMOS (2014).
[40]
Limin Wang, Zhe Wang, Yuanjun Xiong, and Yu Qiao. 2015. CUHK&SIAT submission for THUMOS15 action recognition challenge. THUMOS (2015).
[41]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: towards good practices for deep action recognition ECCV.
[42]
Ping Wei, Yibiao Zhao, Nanning Zheng, and Song-Chun Zhu. 2013. Modeling 4D human-object interactions for event and object recognition ICCV.
[43]
Ping Wei, Nanning Zheng, Yibiao Zhao, and Song-Chun Zhu. 2013. Concurrent action detection with structural prediction CVPR.
[44]
Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. 2015. Watch-n-patch: Unsupervised understanding of actions and relations CVPR.
[45]
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification ACM MM.
[46]
Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning CVPR.
[47]
Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu. 2013. The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection. In CVPR.
[48]
Jing Zhang, Wanqing Li, Philip O Ogunbona, Pichao Wang, and Chang Tang. 2016. RGB-D-based action recognition datasets: A survey. PR (2016).
[49]
Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng, Yanghao Li, Li Shen, and Xiaohui Xie. 2016. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. AAAI (2016).

Cited By

View all
  • (2025)Leveraging Artificial Occluded Samples for Data Augmentation in Human Activity RecognitionSensors10.3390/s2504116325:4(1163)Online publication date: 14-Feb-2025
  • (2025)Asynchronous Joint-Based Temporal Pooling for Skeleton-Based Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346584535:1(357-366)Online publication date: Jan-2025
  • (2025)Frequency Decoupled Masked Auto-Encoder for Self-Supervised Skeleton-Based Action RecognitionIEEE Signal Processing Letters10.1109/LSP.2024.352539832(546-550)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. PKU-MMD: A Large Scale Benchmark for Skeleton-Based Human Action Understanding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    VSCC '17: Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities
    October 2017
    58 pages
    ISBN:9781450355063
    DOI:10.1145/3132734
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action detection
    2. skeleton-based action understanding
    3. video analysis
    4. video benchmark

    Qualifiers

    • Research-article

    Funding Sources

    • CCF-Tencent Open Research Fund
    • Microsoft Research Asia

    Conference

    MM '17
    Sponsor:
    MM '17: ACM Multimedia Conference
    October 23, 2017
    California, Mountain View, USA

    Acceptance Rates

    VSCC '17 Paper Acceptance Rate 6 of 12 submissions, 50%;
    Overall Acceptance Rate 6 of 12 submissions, 50%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Leveraging Artificial Occluded Samples for Data Augmentation in Human Activity RecognitionSensors10.3390/s2504116325:4(1163)Online publication date: 14-Feb-2025
    • (2025)Asynchronous Joint-Based Temporal Pooling for Skeleton-Based Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.346584535:1(357-366)Online publication date: Jan-2025
    • (2025)Frequency Decoupled Masked Auto-Encoder for Self-Supervised Skeleton-Based Action RecognitionIEEE Signal Processing Letters10.1109/LSP.2024.352539832(546-550)Online publication date: 2025
    • (2025)Large scale foundation models for intelligent manufacturing applications: a surveyJournal of Intelligent Manufacturing10.1007/s10845-024-02536-7Online publication date: 4-Jan-2025
    • (2024)A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain GeneralizationSensors10.3390/s2408249124:8(2491)Online publication date: 12-Apr-2024
    • (2024)OTM-HC: Enhanced Skeleton-Based Action Representation via One-to-Many Hierarchical Contrastive LearningAI10.3390/ai50401065:4(2170-2186)Online publication date: 1-Nov-2024
    • (2024)Unsupervised Temporal Adaptation in Skeleton-Based Human Action RecognitionAlgorithms10.3390/a1712058117:12(581)Online publication date: 16-Dec-2024
    • (2024)[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation LearningITE Transactions on Media Technology and Applications10.3169/mta.12.12312:1(123-132)Online publication date: 2024
    • (2024)Embodied Human Activity Recognition2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00632(6433-6443)Online publication date: 3-Jan-2024
    • (2024)Hypergraph-Based Multi-View Action Recognition Using Event CamerasIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338211746:10(6610-6622)Online publication date: Oct-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media