Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FT-HID: a large-scale RGB-D dataset for first- and third-person human interaction analysis

Published: 07 October 2022 Publication History

Abstract

Analysis of human interaction is one important research topic of human motion analysis. It has been studied either using first-person vision (FPV) or third-person vision (TPV). However, the joint learning of both types of vision has so far attracted little attention. One of the reasons is the lack of suitable datasets that cover both FPV and TPV. In addition, existing benchmark datasets of either FPV or TPV have several limitations, including the limited number of samples, participant subjects, interaction categories, and modalities. In this work, we contribute a large-scale human interaction dataset, namely FT-HID dataset. FT-HID contains pair-aligned samples of first-person and third-person visions. The dataset was collected from 109 distinct subjects and has more than 90K samples for three modalities. The dataset has been validated by using several existing action recognition methods. In addition, we introduce a novel multi-view interaction mechanism for skeleton sequences, and a joint learning multi-stream framework for first-person and third-person visions. Both methods yield promising results on the FT-HID dataset. It is expected that the introduction of this vision-aligned large-scale dataset will promote the development of both FPV and TPV, and their joint learning techniques for human action analysis.

References

[1]
Asadi-Aghbolaghi M, Bertiche H, Roig V, Kasaei S, Escalera S (2017) Action recognition from rgb-d data: Comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International conference on computer vision workshops, pp. 3179–3188
[2]
Ben Tanfous A, Drira H, Ben Amor B (2018) Coding kendall’s shape trajectories for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2840–2849
[3]
Bloom V, Argyriou V, Makris D (2013) Dynamic feature selection for online action recognition. In: international workshop on human behavior understanding, pp. 64–76. Springer
[4]
Bloom V, Argyriou V, Makris D (2014) G3di: A gaming interaction dataset with a real time detection and evaluation framework. In: European conference on computer vision, pp. 698–712. Springer
[5]
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 7–12. IEEE
[6]
Cao C, Lan C, Zhang Y, Zeng W, Lu H, and Zhang Y Skeleton-based action recognition with gated convolutional neural networks IEEE Trans Circ Sys Video Tech 2018 29 11 3247-3257
[7]
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6299–6308
[8]
Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on applications of computer vision, pp. 1092–1099. IEEE
[9]
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International conference on image processing (ICIP), pp. 168–172. IEEE
[10]
Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3222–3231
[11]
Ding Z, Wang P, Ogunbona PO, Li W (2017) Investigation of different skeleton features for cnn-based 3d action recognition. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW), pp. 617–622. IEEE
[12]
Fan Z, Zhao X, Lin T, and Su H Attention-based multiview re-observation fusion network for skeletal action recognition IEEE Trans Multim 2018 21 2 363-374
[13]
Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: CVPR 2011, pp. 3281–3288. IEEE
[14]
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941
[15]
Fernando B, Gavves E, Oramas J, Ghodrati A, and Tuytelaars T Rank pooling for action recognition IEEE Trans Patt Anal Mach Intell 2016 39 4 773-787
[16]
Gao X, Hu W, Tang J, Liu J, Guo Z (2019) Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM international conference on multimedia, pp. 601–610. ACM
[17]
Gao Z, Li S, Zhu Y, Wang C, and Zhang H Collaborative sparse representation leaning model for rgbd action recognition J Visual Commun Image Represent 2017 48 442-452
[18]
Gao Z, Zhang H, Xu G, and Xue Y Multi-perspective and multi-modality joint representation and recognition model for 3d action recognition Neurocomputing 2015 151 554-564
[19]
Garcia NC, Morerio P, Murino V (2018) Modality distillation with multiple stream networks for action recognition. In: Proceedings of the european conference on computer vision (ECCV), pp. 103–118
[20]
Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 409–419
[21]
Hou Y, Yu H, Zhou D, Wang P, Ge H, Zhang J, and Zhang Q Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition Neur Comp Appl 2021 33 23 16439-16450
[22]
Hu JF, Zheng WS, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5344–5352
[23]
Hu JF, Zheng WS, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp. 335–351
[24]
Huang D, Yao S, Wang Y, De La Torre, F (2014) Sequential max-margin event detectors. In: European conference on computer vision, pp. 410–424. Springer
[25]
Ijjina EP and Chalavadi KM Human action recognition in rgb-d videos using motion sequence information and deep learning Patt Recognit 2017 72 504-516
[26]
Imran J, Kumar P (2016) Human action recognition using rgb-d sensor and deep convolutional neural networks. In: 2016 International conference on advances in computing, communications and informatics (ICACCI), pp. 144–148. IEEE
[27]
Ji Y, Xu F, Yang Y, Shen F, Shen HT, Zheng WS (2018) A large-scale rgb-d database for arbitrary-view human action recognition. In: 2018 ACM Multimedia conference on multimedia conference, pp. 1510–1518. ACM
[28]
Jia C and Fu Y Low-rank tensor subspace learning for rgb-d action recognition IEEE Trans Image Process 2016 25 10 4641-4652
[29]
Joachims T (2006) Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD International conference on Knowledge discovery and data mining, pp. 217–226
[30]
Khowaja SA and Lee SL Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition Neur Comp Appl 2020 32 14 10423-10434
[31]
Kong J, Liu T, and Jiang M Collaborative multimodal feature learning for rgb-d action recognition J Visu Commun Image Represent 2019 59 537-549
[32]
Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for rgb-d action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 1054–1062
[33]
Kong Y and Fu Y Discriminative relational representation learning for rgb-d action recognition IEEE Trans Image Process 2016 25 6 2856-2865
[34]
Kong Y and Fu Y Max-margin heterogeneous information machine for rgb-d action recognition Int J Comp Vision 2017 123 3 350-371
[35]
Koperski M, Bremond, F (2016) Modeling spatial layout of features for real world scenario rgb-d action recognition. In: 2016 13th IEEE international conference on advanced video and signal based surveillance (AVSS), pp. 44–50. IEEE
[36]
Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence 33(1), pp 8561-8568
[37]
Li C, Hou Y, Wang P, and Li W Joint distance maps based action recognition with convolutional neural networks IEEE Sign Process Lett 2017 24 5 624-628
[38]
Li C, Li S, Gao Y, Zhang X, and Li W A two-stream neural network for pose-based hand gesture recognition IEEE Trans Cognit Develop Sys 2021
[39]
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian, Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3595–3603
[40]
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, pp. 9–14. IEEE
[41]
Li Y, Xia R, and Liu X Learning shape and motion representations for view invariant skeleton-based action recognition Patt Recognit 2020 103 107293
[42]
Liu AA, Nie WZ, Su YT, Ma L, Hao T, and Yang ZX Coupled hidden conditional random fields for rgb-d human action recognition Sig Process 2015 112 74-82
[43]
Liu AA, Xu N, Nie WZ, Su YT, Wong Y, and Kankanhalli M Benchmarking a multimodal and multiview and interactive dataset for human action recognition IEEE Trans Cybern 2016 47 7 1781-1794
[44]
Liu AA, Xu N, Su YT, Lin H, Hao T, and Yang ZX Single/multi-view human action recognition via regularized multi-task learning Neurocomputing 2015 151 544-553
[45]
Liu H, Yuan M, and Sun F Rgb-d action recognition using linear coding Neurocomputing 2015 149 79-85
[46]
Liu J, Akhtar N, and Ajmal M Viewpoint invariant action recognition using rgb-d videos IEEE Access 2018 6 70061-70071
[47]
Liu J, Shahroudy A, Perez ML, Wang G, Duan LY, and Chichung AK Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding IEEE Trans Patt Anal Mach Intell 2019 42 10 2684-2701
[48]
Liu Z, Li Z, Wang R, Zong M, and Ji W Spatiotemporal saliency-based multi-stream networks with attention-aware lstm for action recognition Neur Comp Appl 2020 32 18 14593-14602
[49]
Mansur A, Makihara Y, and Yagi Y Inverse dynamics for action recognition IEEE Trans Cybern 2012 43 4 1226-1236
[50]
Moghimi M, Azagra P, Montesano L, Murillo AC, Belongie S (2014) Experiments on an rgb-d wearable vision system for egocentric activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 597–603
[51]
Negin F, Özdemir, F, Akgül CB, Yüksel KA, Erçil, A (2013) A decision forest based feature selection framework for action recognition from rgb-depth cameras. In: International conference image analysis and recognition, pp. 648–657. Springer
[52]
Nie Q, Wang J, Wang X, and Liu Y View-invariant human action recognition based on a 3d bio-constrained skeleton model IEEE Trans Image Process 2019 28 8 3959-3972
[53]
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 716–723
[54]
Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: 2012 IEEE conference on computer vision and pattern recognition, pp. 2847–2854. IEEE
[55]
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision, pp. 742–757. Springer
[56]
Seddik B, Gazzah S, and Amara NEB Human-action recognition using a multi-layered fusion scheme of kinect modalities IET Comp Vision 2017 11 7 530-540
[57]
Shahroudy A, Liu J, Ng TT Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019
[58]
Shahroudy A, Ng TT, Gong Y, and Wang G Deep multimodal feature analysis for action recognition in rgb+ d videos IEEE Trans Patt Anal Mach Intell 2017 40 5 1045-1058
[59]
Shao Z, Li Y, and Zhang H Learning representations from skeletal self-similarities for cross-view action recognition IEEE Trans Circ Sys Video Tech 2020 31 1 160-174
[60]
Shi L, Zhang Y, Cheng J, Lu, H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7912–7921
[61]
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12026–12035
[62]
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
[63]
Song S, Cheung NM, Chandrasekhar V, Mandal B, Liri J (2016) Egocentric activity recognition with multimodal fisher vector. In: 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 2717–2721. IEEE
[64]
Song Y, Liu S, and Tang J Describing trajectory of surface patch for human action recognition on rgb and depth videos IEEE Sig Process Lett 2014 22 4 426-429
[65]
Tang Y, Wang Z, Lu J, Feng J, and Zhou J Multi-stream deep neural networks for rgb-d egocentric action recognition IEEE Trans Circ Sys Video Tech 2018 29 10 3001-3015
[66]
Van Gemeren C, Tan RT, Poppe R, Veltkamp RC (2014) Dyadic interaction detection from pose and flow. In: International workshop on human behavior understanding, pp. 101–115. Springer
[67]
Vernikos I, Mathe E, Papadakis A, Spyrou E, Mylonas P (2019) An image representation of skeletal data for action recognition using convolutional neural networks. In: Proceedings of the 12th ACM International conference on pervasive technologies related to assistive environments, pp. 325–326. ACM
[68]
Wang H and Wang L Beyond joints: learning representations from primitive geometries for skeleton-based action recognition and detection IEEE Trans Image Process 2018 27 9 4382-4394
[69]
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on computer vision and pattern recognition, pp. 1290–1297. IEEE
[70]
Wang J, Nie X, Xia Y, Wu Y, Zhu, SC (2014) Cross-view action modeling, learning and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2649–2656
[71]
Wang K, Wang X, Lin L, Wang M, Zuo W (2014) 3d human activity recognition with reconfigurable convolutional neural networks. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 97–106
[72]
Wang P, Li W, Gao Z, Zhang J, Tang C, and Ogunbona PO Action recognition from depth maps using deep convolutional neural networks IEEE Trans Human-Mach Sys 2015 46 4 498-509
[73]
Wang P, Li W, Li C, and Hou Y Action recognition based on joint trajectory maps with convolutional neural networks Knowl-Based Sys 2018 158 43-53
[74]
Wang P, Li W, Wan J, Ogunbona P, Liu X (2018) Cooperative training of deep aggregation networks for rgb-d action recognition. In: Thirty-Second AAAI conference on artificial intelligence
[75]
Wei P, Zhao Y, Zheng N, Zhu SC (2013) Modeling 4d human-object interactions for event and object recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3272–3279
[76]
Wen YH, Gao L, Fu H, Zhang FL Xia S (2019) Graph cnns with motif and variable temporal block for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8989–8996
[77]
Wolf C, Lombardi E, Mille J, Celiktutan O, Jiu M, Dogan E, Eren G, Baccouche M, Dellandréa E, Bichot CE, et al. Evaluation of video activity localizations integrating quality and quantity measurements Comp Vis Image Underst 2014 127 14-30
[78]
Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 20–27. IEEE
[79]
Xia L, Gori I, Aggarwal JK, Ryoo MS (2015) Robot-centric activity recognition from first-person rgb-d videos. In: 2015 IEEE winter conference on applications of computer vision, pp. 357–364. IEEE
[80]
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp. 802–810
[81]
Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: Proceedings of the 23rd ACM international conference on Multimedia, pp. 1195–1198
[82]
Yang Z, Li Y, Yang J, and Luo J Action recognition with spatio-temporal visual attention on skeleton image sequences IEEE Trans Circ Sys Video Tech 2018 29 8 2405-2415
[83]
Yu M, Liu L, and Shao L Structure-preserving binary representations for rgb-d action recognition IEEE Trans Patt Anal Mach Intell 2015 38 8 1651-1664
[84]
Yu M, Liu L, and Shao L Structure-preserving binary representations for rgb-d action recognition IEEE Trans Patt Anal Mach Intell 2016 38 8 1651-1664
[85]
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pp. 28–35. IEEE
[86]
Zhang C and Tian Y Rgb-d camera-based daily living activity recognition J Comp Vis Image Process 2012 2 4 12
[87]
Zhang P, Lan C, Xing J, Zeng W, Xue J, and Zheng N View adaptive neural networks for high performance skeleton-based human action recognition IEEE Trans Patt Anal Mach Intell 2019 41 8 1963-1978
[88]
Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, and Zhuang Y Fusing geometric features for skeleton-based action recognition using multilayer lstm networks IEEE Trans Multim 2018 20 9 2330-2343
[89]
Zhang Y, Cao C, Cheng J, and Lu H Egogesture: a new dataset and benchmark for egocentric hand gesture recognition IEEE Trans Multim 2018 20 5 1038-1050
[90]
Zhu G, Zhang L, Shen P, and Song J Multimodal gesture recognition using 3-d convolution and convolutional lstm IEEE Access 2017 5 4517-4524
[91]
Zhu Z, Ji H, Zhang W, and Xu Y Rank pooling dynamic network: Learning end-to-end dynamic characteristic for action recognition Neurocomputing 2018 317 101-109
[92]
Zong M, Wang R, Chen Z, Wang M, Wang X, and Potgieter J Multi-cue based 3d residual network for action recognition Neur Comp Appl 2021 33 10 5167-5181

Cited By

View all
  • (2024)Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actionsNeural Computing and Applications10.1007/s00521-024-09630-036:18(10681-10697)Online publication date: 1-Jun-2024

Index Terms

  1. FT-HID: a large-scale RGB-D dataset for first- and third-person human interaction analysis
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Information & Contributors

              Information

              Published In

              cover image Neural Computing and Applications
              Neural Computing and Applications  Volume 35, Issue 2
              Jan 2023
              938 pages
              ISSN:0941-0643
              EISSN:1433-3058
              Issue’s Table of Contents

              Publisher

              Springer-Verlag

              Berlin, Heidelberg

              Publication History

              Published: 07 October 2022
              Accepted: 06 September 2022
              Received: 28 February 2022

              Author Tags

              1. Human interaction
              2. First-person vision
              3. Third-person vision
              4. Large-scale dataset

              Qualifiers

              • Research-article

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)0
              • Downloads (Last 6 weeks)0
              Reflects downloads up to 14 Jan 2025

              Other Metrics

              Citations

              Cited By

              View all
              • (2024)Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actionsNeural Computing and Applications10.1007/s00521-024-09630-036:18(10681-10697)Online publication date: 1-Jun-2024

              View Options

              View options

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media