Fall Detection using Human Skeleton Features

Ernesto Fabregas

Fall Detection using Human Skeleton Features

2021

This is a postprint version of the following conference paper: Ramirez, H., et al. (2021, march). Fall Detection using Human Skeleton Features. In: 11th International Conference on Pattern Recognition Systems (ICPRS-21), conference paper, 17-19 mar, 2021, Curicó, Chile. URL: http://www.icprs.org/ © Institution of Engineering and Technology, 2021. When the final version is published, the copy of record will be available at the IET Digital Librar. 1 Fall Detection using Human Skeleton Features Heilym Ramirez1, Sergio A. Velastin2.3, Ernesto Fabregas4, Ignacio Meza1, Dimitrios Makris5, and Gonzalo Farias1 1 Escuela de Ingenierı́a Eléctrica, Pontificia Universidad Católica de Valparaı́so, Av. Brasil 2147. 2362804 Valparaı́so, Chile. 2 School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK 3 Department of Computer Science and Engineering, Universidad Carlos III de Madrid, 28911 Leganés, Madrid, Spain 4 Departamento de Informática y Automática, Universidad Nacional de Educación a Distancia, Juan del Rosal 16. 28040 Madrid, Spain. 5Faculty of Science, Engineering and Computing, Kingston University, London SW15 3DW, UK. Keywords: fall detection, human skeleton, pose estimation, computer vision. cameras to identify falls in a certain population, than putting a sensor or body device on each individual. In addition, the use of cameras is generally cheaper. Abstract The important challenges and issues identified by most authors include the concerns about privacy, intrusion and operating device limitations, as well as the difficulties in comparing between the techniques. A leading cause of death and serious injury in people, especially for the older people, are falls. In addition, fall accidents have a direct economic cost to healthcare systems and have an indirect impact, to the society’s productivity. Among the most significant problems in fall detection systems is privacy, limitations of operating devices, and the comparison of machine learning techniques for detection. This article presents a system of fall detection by means of a k-Nearest Neighbor (KNN) classifier based on camera-vision using pose detection of the human skeleton for the features extraction. The proposed method is evaluated with UP-FALL dataset, surpassing the results of other fall detection systems that use the same database. This method achieves a 98.84% accuracy and an F1 -Score of 97.41%. 1 Introduction The World Health Organization (WHO) suggests that each year, approximately 37.3 million falls are serious [1] sufficient to require immediate medical attention and that one of the leading causes of fatal lesions is falls, making falls a major global public health. Although the older people are most at risk of serious and fatal injuries by falls, children and infants are also a group at high-risk when they suffer falls injuries [1]. Therefore, detect falls has become a field of interest for many researchers. Falls involve a significant direct economic cost to healthcare systems, in both terms of hospital and in long-term care costs, besides, of the resulting indirect costs [2]. This field of research has given rise to the intelligence environments concept that help to create assistance and monitoring systems in relevant environments. Wearable sensors have been the main focuses in systems for fall detection, environmental sensors or vision devices [3]. Context-sensitive systems include all the systems that use sensor technology implemented on the environment, such as cameras, motion capture devices and Kinect, which have the advantage of not being invasive devices allowing a more real context, taking into account that it is easier to implement a system with Nowadays, the development and studies of computer vision focused on detect falls have been a popular research topic [4–8] where issues such as accuracy and decrease have been addressed and with computational complexity as one of the main challenges. Vision-based systems use image processing techniques over video frames or captured images from cameras. Machine learning (ML) models can apply on top of image processing techniques to permit a more precise fall detection. Fall detection systems use the best recognized techniques of supervised learning, these are: MLP (multilayer perceptron) [9], SVM (support vector machines) [10], HMM (hidden Markov models), decision trees, KNN (random forest, kNearest Neighbors) [11] and CNN (Convolutional Neural Networks) [12]. Zerrouki et al. [13] present a study comparing ML-models for the falls detection, selecting as input video sequences during different daily falls and activities. Yanfei et al. [14] study the signals from a Kinect camera and processes point cloud images to detect drops and reduce false positives. Recently, fall detection applying deep learning techniques for has become an active area of research. Lu et al. [7] aplica CNN y LSTM para la extracción de caracterı́sticas utilizando secuencias de video de datos ambientales. In [15], CNN (convolutional neural networks) are trained on different sets of optical flow image data which helps the network detect different actions. This article shows a method that can detect falls by only using images from a standard video camera without the need to use environmental or depth sensors, that significantly outperforms the results obtained with the best ML model (KNN) in [16]. Fall detection is carried out using pose estimation of the human skeleton for feature extraction. The main contributions of this method is that the detection of the human skeleton can be used by any suitable pose estimation algorithm. The proposed approach has been validated with the multi-modal public data set presented in [16]. A machine learning algorithm has been tested. The results exceed those obtained by other authors 2 with the same data set by a significant margin [16, 17]. The article is organized as follows. In section 2, a description of the UP-FALL fall detection data set is presented. Section 3 details the proposed fall detection approach and explains AlphaPose, the human skeleton proposes detection as a feature extraction method to recognize activities. Section 4 shows the experimental results and a comparison with the previous results on the same data set. And then Section 5 summarizes the conclusions and the future research work. 2 Dataset Work presented here uses UP-FALL as an experimental dataset to evaluate results and comparing it to the original work [16] that also uses the same data set. The dataset included 12 activities, which were performed by 17 healthy young human subjects. Subjects performed 5 different types of falls (forward falls using the hands, forward falls using the knees, backward falls, falls sitting on an empty chair, and falls sideways) and 7 daily human activities (walk, stand, picking up an object, sit, jump, lay and kneel). Table 1 shows the 12 activities. Activity ID 1 2 3 4 5 6 7 8 9 10 11 20 (SVM), Multi-Layer Perceptron (MLP) and k-Nearest Neighbors (kNN). KNN model delivered the best performance using video-only (CAM) data with a F1 -Score of 15.19%. Nevertheless, as shown in Table 3, it gave poor results. These were improved significantly when using a CNN (convolutional neural network), getting an F1 -Score of 71.20%. 3 Method This work focuses on improving the performance of fall detection using only the video data, as in practical applications such as assisted living and public space monitoring, use wearable and other sensor modalities is not realistic. The main hypothesis is that the originally poor results can be improved significantly by using articulated bodies (skeletons) extracted from the video, even when using the same ML methods used in [16]. So, the aim is to implement a fall detection method using information from camera 1 in the dataset, and compare its performance to that using KNN on optical flow, as initially reported. AlphaPose: Human skeleton detection Description Forward falls using the hands Forward falls using the knees Backward falls Sideways falls Sitting falls on an empty chair Walk Stand Sit Picking up an object Jum Lay down Unknown activity Table 1. UP-FALL Activities. The data distribution based on daily activities vs. falls are 77.55% and 22.45%, respectively. The instances of non-fall activities outnumber falls so as to make it more representative of the sporadic nature of fall events. The dataset is multi modal with data captured using five Mbientlab Meta Sensor wearable sensors (IMU), one electroencephalograph (EEG) NeuroSky MindWave headset and six infrared sensors (IR). It also includes data obtained from two Microsoft LifeCam Cinema cameras (CAM) placed at 1.82 m above the ground, one for a side view and the other for a front view. One of the challenges presented by UP-FALL is that of fall detection as a binary classification problem, i.e. to distinguish between a fall (any of classes 1 to 5) and a non-fall (any of the remaining classes). The work in [16] evaluates four different machine learning (ML) methods for the fall detection and activity recognition problems: Random Forest (RF), Support Vector Machine Features: x1,y1 x2,y2 x3,y2 ... x50,y50 x51,y51 Label: ”Fall” or “Not-Fall” Classification model: KNN Frames Fall Figure 1. Workflow for fall detection. The method that was developed for this study is illustrated in Fig. 1. Consists of collecting data, feature extraction using human skeleton estimation, skeleton filtering, and at last a classification model for fall recognition. All the steps have been implemented using Python 3.6. 3.1 Feature extraction and selection Images are located at https://sites.google.com/ up.edu.mx/har-up/. The files are organized in 17 folders, one for each subject. Within each folder, there are 11 subfolders, one for each activity. Within these subfolders, there are three other subfolders, one for each trial. In every subfolder, there is a CSV file that points to a ZIP-file with the images recorded on camera 1. 3.1.1 Human skeleton detection By using AlphaPose [18], human pose detection is possible, obtaining 17 keypoints or joints, with coordinates (x,y), which when joined form a skeleton with the human’s pose, also indicating the score of the detection for each keypoint. With 3 these 51 (17*(2+1)) attributes, the characteristics are obtained to train a classifier to detect falls. AlphaPose is an open source method that allows an accurate multi-person pose estimator [18], available in https:// www.mvig.org/research/alphapose.html. It uses RGB images as input, then performs the pose detection with a pre-trained model (COCO dataset), outputting a JSON-file with the location (x,y) of 17 keypoints, which together form a skeleton with the pose(s) of one or more people. a filter was designed that selects the skeleton with the highest score in each frame eliminating the rest of the skeletons in such frames. 3.2 Classification model This work seeks to improve upon the results of the best MLmodel (KNN) in [16] with the CAM modality. Therefore, kNearest Neighbors (KNN) has been chosen for direct comparison. This is a method based on instances that compares an input with the training points of the k nearest neighbors and determines the output response which is based on the most frequent type observed in the k neighbors. Model Parameters KNN neighbors = 5 leaf size = 30 metric = Euclidean Figure 2. Falling woman skeleton detection using AlphaPose. Table 2. Parameter settings for KNN model. The process to generate the feature extraction begins by defining all the image sequences from UP-FALL. These images are then processed with AlphaPose to generate a dataset of frames of skeleton joints coordinates which can be visualised in images such as the one shown in Figure 2. 3.1.2 Skeletons filtering After converting RGB images to skeletons, a clean up process is needed. First, the detection of more than one skeleton (person) in the same frame. Second, the images from the UP-FALL database supplied by camera 1 are used, because the detection of the fall is more accurate with a side view vs. a front view. Finally, is important noted that some frames in the database do not contain people, so these images are eliminated to generate greater consistency between the skeletonization and the labels. Figure 3. Image with more than one skeleton. Detection of more than one skeleton for the same frame presents a problem for the system, generating an incorrect classification. Figure 3 shows an example of the detection of several skeletons in one image. It is observed that only one skeleton is associated with the label, while the other skeletons are not part of the action (label). Different metrics were designed to find the correct skeleton associated with the label and the AlphaPose skeletonization confidence score was identified as the best indicator. Finally, Table 2 summarizes the parameter settings for the classifier model. These are the same as those used in [16]. 4 Evaluation The fall detection method is evaluated by means of the classifier model with the features extracted from the 2D skeleton coordinates, obtained with AlphaPose, for each frame of the dataset processed independently. Experiments were performed using 70% (154,462 samples) of the training dataset was used for performance the experiment and the remaining 30% (66,198 samples) for testing. The dataset contains 220,660 images of which 49,544 correspond to falls and 171,116 to nonfalls. For the detection of falls, a binary classifier system based on the twelve UP-FALL activities is implemented, for which a re-labeling process is carried out. The 5 falls activities (forward falls using the hands, forward falls using the knees, backward falls, falls sitting on an empty chair, and falls sideways) are labeled ”Fall” and the and 7 daily human activities (walk, stand, picking up an object, sit, jump, lay and knees) are labeled ”Not Fall”. Using the same experimental methodology described in [16], performance evaluation is carried out using ten rounds (kfold = 10) of cross-validation using random 90:10 partitions of the whole dataset for each classification method. In UP-FALL, actors involved in training are the same actors for testing but their training and test actions are not mixed. 4.1 Evaluation metrics For a direct comparison, this work uses the same performance metrics used in [16]: Accuracy, precision, sensitivity, specificity and F1 -Score. Where: • True positives (TP): ”Fall” detected as ”Fall”. 4 • False positives (FP): ”Not Fall” detected as ”Fall”. • True negatives (TN): ”Not Fall” detected as ”Not Fall”. • False negatives (FN): ”Fall” detected as ”Not Fall”. To calculate the accuracy, divide the average number of TP and TN by the total number of cases examined (Equation 1). accuracy = TP + TN TP + FP + TN + FN TP TP + FP (2) TP (3) TP + FN To calculate the Specificity, divide the average number of TN by the sum of TN and FP, as Equation (4) shows. recall = specif icity = TN TN + FP (4) Finally, F1 -Score is calculated as shown in Equation (5), and is used to evaluate the proposed model, as a single figure of merit that considers both precision and recall. F1 − Score = 2 × 4.2 precision × recall precision + recall (5) Results The presented approach successfully recognized falls using a KNN classifier with skeleton human poses. The classifier delivered a high accuracy of 98.84%, a precision and recall of 97.53% and 97.30% respectively, a specificity of 99.29% and an F1 -Score of 97.41%. Fall Not Fall 4784 113 97,69% 2,31% Not Fall Fall 128 17041 0,75% 99,25% KNN in [16] CNN in [16] KNN in [17] CNN in [17] This Work Accuracy Precision Recall Specificity F1 -Score 34.03 15.32 15.54 93.09 15.19 95.10 71.80 71.30 99.50 71.20 27.30 16.32 14.35 90.96 15.27 82.26 74.25 71.67 77.48 72.94 98.84 97.53 97.30 99.29 97.41 (1) As per Equation (2) precision, is the average number of the number of TP divided by the sum of TP and FP. Recall is the average number of TP across all activities and falls divided by the sum of TP and FN (Equation (3)). precision = Model TP FN FP TN Figure 4. Confusion matrix for fall detection method. Figure 4 shows the best confusion matrix from crossvalidation for KNN classifier based on accuracy. It is observed in the confusion matrix of model that of all the fall data only 2.31% data are not recognized as falls. Therefore, it can be concluded that the original hypothesis is demonstrated, namely that it is possible to detect falls with Table 3. Comparison between our proposal and the best models of other camera vision based fall detection systems, that use UP-FALL dataset. the k-nearest neighbor classifier proposed in [16] when using only a single modality (vision from a camera) through the use of human skeleton pose estimate features, obtained via deep neural models, which considerably improves the model performance. Table 3 compares the models with the best performance of camera vision-based systems for fall detection that use the UPFALL dataset. The method proposed here achieves the detection of falls using only one camera, unlike other works that use eight cameras for vision-based systems for fall detection, as from [19] and [15], which makes our method simpler and cheaper to implement. Also, the network structure in this method (Figure 1) is very simple compared to other methods. As an example, in [19] PCA is employed for feature extraction and SVM is employed for classification and in [15] present a VGG-16 architecture modified to receive inputs and CNN for classification. On the other hand, Espinosa et al. [17] use a multi-camera approach using UP-FALL dataset, which gets results similar to [19] and [15] using only two cameras. As Table 3 shows, they gets their best performance with a CNN. In Table 3 it can be seen that the proposed human skeleton features method meets the main objective of this work, exceeding by more than 80% the performance of the models (RF, SVM, MLP and KNN) given in [16] with its windowing method. It is shown that the use of human detection features, in addition to making possible the detection of falls with a high performance, makes our KNN model overcome the best falls detection models based on camera vision with UP-FALL reported in the state-of-the-art. 5 Conclusion This paper presents a system for fall detection based on camera vision with a KNN classification model. A method for features extraction with pose estimation based on human skeletonization is proposed. The method is evaluated with the UP-FALL pubklic access dataset and uses the KNN model in [16] for direct comparison. The method outperforms [16] and other fall detection systems that use the same dataset. The proposed method demonstrated good results using human skeletonization for features extraction, despite the detection of more than one skeleton in a single frame. It is possible 5 that other algorithms use, like an LSTM, could eliminate the problem of confusion when carrying out an analysis over time, managing to identify the skeleton of interest based on a frames sequence. For future work and as part of our ongoing project, a system of fall detection and activity recognition is being developed, pairing four classifier models (RF, SVM, MLP and KNN). Acknowledgements This work was supported by the Chilean Research and Development Agency (ANID) under the Project FONDECYT 1191188. References [1] W. H. Organization, W. H. O. Ageing, and L. C. Unit, WHO global report on falls prevention in older age. World Health Organization, 2008. [2] S. R. de Assis Neto, G. L. Santos, E. da Silva Rocha, M. Bendechache, P. Rosati, T. Lynn, and P. T. Endo, “Detecting human activities based on a multimodal sensor data set using a bidirectional long short-term memory model: A case study,” in Challenges and Trends in Multimodal Fall Detection for Healthcare, pp. 31–51, Springer, 2020. [3] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection: Principles and approaches,” Neurocomputing, vol. 100, pp. 144–152, 2013. [4] K. Adhikari, H. Bouchachia, and H. Nait-Charif, “Activity recognition for indoor fall detection using convolutional neural network,” in 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), pp. 81–84, IEEE, 2017. [5] C.-Y. Lin, S.-M. Wang, J.-W. Hong, L.-W. Kang, and C.-L. Huang, “Vision-based fall detection through shape features,” in 2016 IEEE Second International Conference on Multimedia Big Data (BigMM), pp. 237–240, IEEE, 2016. [6] H. Liu and Y. Guo, “A vision-based fall detection algorithm of human in indoor environment,” in Second International Conference on Photonics and Optical Engineering, vol. 10256, p. 1025644, International Society for Optics and Photonics, 2017. [7] N. Lu, Y. Wu, L. Feng, and J. Song, “Deep learning for fall detection: Three-dimensional cnn combined with lstm on video kinematic data,” IEEE journal of biomedical and health informatics, vol. 23, no. 1, pp. 314–323, 2018. [8] Y. Kong, J. Huang, S. Huang, Z. Wei, and S. Wang, “Learning spatiotemporal representations for human fall detection in surveillance video,” Journal of Visual Communication and Image Representation, vol. 59, pp. 215– 230, 2019. [9] A. Jalal, S. Kamal, and D. Kim, “A depth video sensorbased life-logging human activity recognition system for elderly care in smart indoor environments,” Sensors, vol. 14, no. 7, pp. 11735–11759, 2014. [10] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. ReyesOrtiz, “Training computationally efficient smartphone– based human activity recognition models,” in International Conference on Artificial Neural Networks, pp. 426– 433, Springer, 2013. [11] L. C. Jatoba, U. Grossmann, C. Kunze, J. Ottenbacher, and W. Stork, “Context-aware mobile health monitoring: Evaluation of different pattern recognition methods for classification of physical activity,” in 2008 30th annual international conference of the ieee engineering in medicine and biology society, pp. 5250–5253, IEEE, 2008. [12] S. Münzner, P. Schmidt, A. Reiss, M. Hanselmann, R. Stiefelhagen, and R. Dürichen, “Cnn-based sensor fusion techniques for multimodal human activity recognition,” in Proceedings of the 2017 ACM International Symposium on Wearable Computers, pp. 158–165, 2017. [13] N. Zerrouki, F. Harrou, A. Houacine, and Y. Sun, “Fall detection using supervised machine learning algorithms: A comparative study,” in 2016 8th International Conference on Modelling, Identification and Control (ICMIC), pp. 665–670, IEEE, 2016. [14] Y. Peng, J. Peng, J. Li, P. Yan, and B. Hu, “Design and development of the fall detection system based on point cloud,” Procedia computer science, vol. 147, pp. 271– 275, 2019. [15] A. Núñez-Marcos, G. Azkune, and I. Arganda-Carreras, “Vision-based fall detection with convolutional neural networks,” Wireless communications and mobile computing, vol. 2017, 2017. [16] L. Martı́nez-Villaseñor, H. Ponce, J. Brieva, E. MoyaAlbor, J. Núñez-Martı́nez, and C. Peñafort-Asturiano, “Up-fall detection dataset: A multimodal approach,” Sensors, vol. 19, no. 9, p. 1988, 2019. [17] R. Espinosa, H. Ponce, S. Gutiérrez, L. Martı́nezVillaseñor, J. Brieva, and E. Moya-Albor, “Application of convolutional neural networks for fall detection using multiple cameras,” in Challenges and Trends in Multimodal Fall Detection for Healthcare, pp. 97–120, Springer, 2020. [18] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-person pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343, 2017. [19] S. Wang, L. Chen, Z. Zhou, X. Sun, and J. Dong, “Human fall detection in surveillance video based on pcanet,” 6 Multimedia tools and applications, vol. 75, no. 19, pp. 11603–11613, 2016.

Log In

Fall Detection using Human Skeleton Features

Related papers

Related papers

Related topics