Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach

Hossen, Md Amran; Naim, Abdul Ghani; Abas, Pg Emeroylariffion

doi:10.3390/technologies12070096

Open AccessArticle

Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach

by

Md Amran Hossen

¹

,

Abdul Ghani Naim

² and

Pg Emeroylariffion Abas

^1,*

¹

Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong BE1410, Brunei

²

School of Digital Science, Universiti Brunei Darussalam, Gadong BE1410, Brunei

^*

Author to whom correspondence should be addressed.

Technologies 2024, 12(7), 96; https://doi.org/10.3390/technologies12070096

Submission received: 2 May 2024 / Revised: 13 June 2024 / Accepted: 19 June 2024 / Published: 27 June 2024

(This article belongs to the Section Information and Communication Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Automatic segmentation is essential for enhancing human activity recognition, especially given the limitations of publicly available datasets that often lack diversity in daily activities. This study introduces a novel segmentation method that utilizes skeleton data for a more accurate and efficient analysis of human actions. By employing an autoencoder, this method extracts representative features and reconstructs the dataset, using the discrepancies between the original and reconstructed data to establish a segmentation threshold. This innovative approach allows for the automatic segmentation of activity datasets into distinct segments. Rigorous evaluations against ground truth across three publicly available datasets demonstrate the method’s effectiveness, achieving impressive average annotation error, precision, recall, and F1-score values of 3.6, 90%, 87%, and 88%, respectively. This illustrates the robustness of the proposed method in accurately identifying change points and segmenting continuous skeleton-based activities as compared to two other state-of-the-art techniques: one based on deep learning and another using the classical time-series segmentation algorithm. Additionally, the dynamic thresholding mechanism enhances the adaptability of the segmentation process to different activity dynamics improving overall segmentation accuracy. This performance highlights the potential of the proposed method to significantly advance the field of human activity recognition by improving the accuracy and efficiency of identifying and categorizing human movements.

Keywords:

human activity recognition; segmentation; deep learning; autoencoder

1. Introduction

The recognition of human activity plays a pivotal role across diverse domains, including healthcare, computer vision, human–computer interaction, gaming, entertainment, and robotics [1]. Within the healthcare sector, it enables the monitoring and evaluation of patient movements for rehabilitation purposes or the diagnosis of movement disorders [2]. Meanwhile, in robotics, understanding human activities allows for more natural and intuitive human–robot interactions [3]. In fact, the potential applications of human activity recognition (HAR) are vast, extending to surveillance, sports analysis, virtual reality, and augmented reality [4].

Researchers have employed various data acquisition modalities, including Red, Green and Blue (RGB) cameras, depth sensors, accelerometers, and motion capture (Mocap) devices to acquire more accurate data for modelling human activities. More recently, skeleton-based data have gained popularity for human activity classification [5,6,7,8,9,10,11], thanks to their effectiveness in capturing the three-dimensional spatial coordinates of human joints and body parts. This method, derived from depth data collected with affordable devices like the Kinect sensor [12], offers a cost-effective and versatile markerless alternative to traditional Mocap systems, which necessitate markers on the body for joint capture and are generally more expensive and lab-centric [13]. Additionally, its adaptability to various environments and its affordability, make it an invaluable tool for capturing skeleton data, broadening the accessibility and application of human activity analysis.

The process of segmenting human activity from a continuous stream of skeleton data represents the first but necessary step towards the identification and classification of various actions or gestures performed by an individual [13]. This segmentation is pivotal as it unveils the essence of activities, forming the basis for applications including action motion analysis, activity discovery, and activity recognition [14]. By focusing on the relative positions and movements of joints whilst abstracting away unnecessary details [12], skeleton data provide a concise yet effective representation of human movements, abstracting away unnecessary details whilst preserving the essential characteristics of actions, facilitating the analysis and recognition of human activities in challenging and complex scenarios where other visual cues may be scarce or ambiguous [15].

Several approaches have been developed to tackle the task of recognizing human activities from skeleton data [4,8,9,10,11]. These approaches range from rule-based methods [11] that rely on predefined patterns or rules, to more advanced techniques employing machine learning algorithms [4] and deep neural networks [16]. Particularly, deep neural networks have shown remarkable success in recent years, leveraging their ability to learn complex spatiotemporal patterns, and generalizing effectively to unseen data. Despite these impressive advances in human activity recognition, one significant challenge remains: the majority of existing methods still heavily rely on manually segmented data. This involves the meticulous annotation and labelling of each frame or a range of frames within the skeleton data to segregate the different activities being performed—a labour-intensive process which is prone to many errors and inconsistencies. Consequently, access to large-scale accurately annotated datasets remains limited, hindering the development and evaluation of new methods and more complex activities in this field.

Addressing this limitation to further improve the state of the art in human activity recognition demands a shift away from labour-intensive manual data annotation processes. Approaches such as weakly supervised learning, transfer learning, or innovative data augmentation techniques could harness unsegmented or weakly labelled data more effectively, paving the way for the development of more robust and accurate skeleton-based human activity recognition systems. These advancements hold promise for enhancing applications in human–computer interaction, sports analysis, and healthcare monitoring [17,18].

This paper comprehensively focuses on the automatic segmentation of human activities through skeleton data. The key contributions of the study are as follows:

Introduction of an Autoencoder-Based Segmentation Method: A novel approach was proposed that utilizes autoencoders for continuous activity segmentation. This method leverages skeleton frame reconstruction to precisely identify transition points in continuous activities, setting a new standard for activity recognition.
Dynamic Thresholding Technique: A dynamic thresholding technique was introduced, which adaptively updates the segmentation threshold based on the activity data. This enhances the robustness and adaptability of the segmentation process, accommodating varying dynamics of different activities.
Comprehensive Evaluation and Comparative Performance Analysis: Extensive experiments are conducted to identify the optimum parameters of the proposed approach, demonstrating its superior effectiveness over existing techniques in terms of segmentation accuracy and key performance metrics. Additionally, the performance of this method is rigorously compared against two advanced techniques—one based on deep learning and the other using a traditional time-series segmentation algorithm, further establishing its potential as a leading solution in the field.

This work begins with a review of related works in Section 2 to highlight both advancements and limitations in the field. Section 3 starts with establishing a notation system for activity segmentation, followed by an overview of statistical-based segmentation methods and an introduction to our proposed automatic segmentation approach using an autoencoder. The criteria adopted for assessing segmentation accuracy are also discussed in the section. In Section 4, results and discussions on method performance, strengths, and limitations are presented, leading to a conclusive summary in Section 5. This structured layout ensures a systematic presentation of the proposed methodology, its evaluation, and its implications for human activity discovery and recognition research.

2. Related Works

The utilization of skeleton data, extracted from depth data, such as those acquired with the Kinect sensor, has garnered substantial attention in human activity recognition [8,9,10,11,12]. However, despite significant progress in human activity recognition, challenges persist in effectively recognizing human activities based on skeleton data [19]. These include managing complex and detailed activities, handling occlusion and noise in the data, addressing inter-class variability, and adapting to real-world scenarios [1,17,20]. Researchers have actively pursued novel algorithms, architectures, and methodologies to bolster automatic segmentation techniques, aiming to enhance accuracy and robustness in addressing these hurdles. Nonetheless, existing datasets in this domain often suffer from limited coverage of activities [2,18], with even the largest skeleton activity datasets containing a limited number of activities and frames. For instance, the large NTU-RGBD [20] dataset contains just over 100 activities whilst another larger dataset, the Human 3.36M [21], includes 3.6 million manually annotated frames. These represent a small fraction of the diverse spectrum of human activities [22].

The accurate modelling of human activities necessitates access to more expansive datasets that encompass a more comprehensive set of activities [23]. However, capturing such datasets poses challenges due to the manual efforts required for data collection, activity segmentation, and annotation. Manual segmentation has been proven to be a laborious and time-consuming task, hindering the scalability and availability of large-scale skeleton-based activity datasets. Recognizing this need for larger and more diverse datasets, the research community emphasizes the role of the automatic segmentation of skeleton-based human activities in achieving this goal [7,19]. Recent research has focused on facilitating the segmentation of skeleton-based human activities, aiming to automate this process and thereby eliminate or, at least, reduce manual effort. Automation segmentation facilitates the creation of larger, more diverse datasets, thereby enriching resources for training and evaluating models in human activity recognition and related tasks.

Various approaches exist for segmenting human activities based on windowing techniques, as depicted in Figure 1, which include fixed window sizes [7,23], windows with overlaps [24], and dense labels whereby each frame is labelled as an activity [25,26]. One prevalent method involves segmenting based on a fixed number of windows or frames with a certain overlapping percentage, allowing for the systematic partitioning of the dataset [22]. Yet, this approach often leads to overlap between different activities within these frames, impacting the accurate modelling of activities. To address this, researchers have explored alternative strategies, such as selecting key frames from fixed windows and employing smaller time gaps between each selected frame [13]. Additionally, clustering methods have been utilized to identify key frames [24]; however, this approach may compromise the chronological order of activity frames. Another avenue explored involves leveraging joint kinetic energy derived from the sum of all joints within a frame across a set of fixed windows [25]. Despite its potential, this method is heavily affected if the same activity is performed in varying positions within the view frame, reducing its effectiveness in accurately segmenting activities. The visualization of the selection of the key frame method using the joint kinetic energy is shown in Figure 2. The actions of raising the left and right hands are depicted in Figure 2a and Figure 2b, respectively. Within these figures, the joint kinetic energy varies corresponding to the movement of the hand which is indicated by the arrow. Researchers typically observe changes in joint kinetic energy and often select the frames based on local peaks in joint kinetic energy, indicating moments of notable movement intensity within the action being analysed.

Several other approaches have also been proposed for segmenting skeleton-based human activities, including rule-based methods [24], unsupervised clustering techniques [25], hidden Markov models (HMMs) [26], dynamic time warping (DTW) [26], recurrent neural networks (RNNs) [27], and convolutional neural networks (CNNs) [28]. These approaches exploit the spatial and temporal relationships encoded in skeleton data to detect transitions and patterns indicative of different activities to automate the segmentation process. Interestingly, these methods focus on individually segmented activities, however, and not the transition between activities. Despite the progress in this domain, finding the segmentation of skeleton-based human activity data remains an open challenge [3,14,19].

There have been a limited number of studies that have utilized automatic feature learning for segmenting human activities. Reference [22] utilized a deep learning-based method to automatically learn various features which were then used to segment various accelerometer-based activities into respective segments. The dataset was first divided into smaller windows with overlaps, before using an autoencoder to learn the most representative features from each window. Euclidean distance was then used to calculate the distance between consecutive windows to calculate the local maximum (Local Max), which was then treated as a breakpoint.

To address the challenges with the segmentation of activities, this paper proposes the leveraging of advancements in deep learning techniques to develop an automated segmentation algorithm. The algorithm identifies transitions and patterns indicative of distinct activities by analysing the temporal dynamics and spatial relationships of joints and body parts in the skeleton data. This research contributes to the advancement of activity recognition by enhancing accuracy and efficiency, with broader implications for motion analysis, human–robot interaction, and other related domains.

3. Proposed Segmentation Method

This research introduces a novel approach for automatic breakpoint detection in human activity recognition by utilizing a deep autoencoder model. Marking a significant evolution from traditional techniques, this innovation is pivotal for improving the accuracy and efficiency of activity recognition systems. Contextualized within the overarching human action recognition pipeline, as illustrated in Figure 3, the process starts with the recording of activities captured as skeleton frames (illustrated in the first row of Figure 3). It then transitions these frames into time-series representations of joint movements (shown in the second row), subsequently segmenting these time-series data into meaningful sections to form the core of the activity dataset. Essential for subsequent steps in activity discovery and recognition, this segmentation process is visualized in the third row of Figure 3.

Contrasting sharply with traditional human activity recognition research that primarily focuses on manual data segmentation—a process both laborious and prone to overlooking critical transitions between activities—this methodology adopts a novel path. It shifts focus from the traditional research of merely enhancing the accuracy of recognition systems, as indicated by the red dotted line in Figure 3, by focusing on the automatic segmentation of human activity, as indicated by the green dashed line.

The utilization of the deep learning capabilities of autoencoders allows for the autonomous extraction of distinct features from skeleton-based frames, conceptualized as time-series data. This strategy deviates from the usual reliance on predefined assumptions about data generation, facilitating a more nuanced understanding of the temporal dynamics within the data. The effectiveness of this method in detecting breakpoints is aligned with specified user requirements, representing a noteworthy advancement in the field. Underpinning this study are two key assumptions: that an activity persists for a substantial duration of time and that activity patterns repeat. Based on these premises, the proposed method automatically segments the dataset into meaningful units, adeptly capturing vital transitions and patterns in human activity over time. Such segmentation not only enables an in-depth analysis of activities, including discovery, recognition, and motion analysis, but also signifies a major advancement by shifting away from the reliance on manual segmentation.

3.1. Activity Segmentation Notation

The dataset

D = \{d (t)\}, t = 1, \dots, T

encompasses a sequence of activities up to time T, with the exact number of these activities, represented as

n_{K}

, remaining unknown. Within this dataset, each instance

d (t)

captures the spatial positions of various joints constituting the human skeleton at a specific time

t

. Given

n_{j}

number of joints under consideration and each joint

j_{i}

characterized by its coordinates

{x_{j_{i}}, y_{j_{i}}, z_{j_{i}}}

in the three-dimensional space, the formulation of

d (t)

encapsulates

n_{f} = 3 . n_{j}

features, representing the collection of joints at a specific time

t

as follows,

d (t) = {[x_{j_{1}} (t), y_{j_{1}} (t), z_{j_{1}} (t), \dots, x_{j_{i} (t)}, y_{j_{i}} (t), z_{j_{i}} (t), \dots, x_{j_{n_{j}}} (t), y_{j_{n_{j}}} (t), z_{j_{n_{j}}} (t)]}^{T}

(1)

Equation (1) succinctly encapsulates the spatial dynamics of the skeletal joints across time, laying the groundwork for advanced activity segmentation.

The primary objective is to segment this dataset

D

into distinct, non-overlapping segments, with each segment representing a distinct activity. This segmentation culminates in the creation of an activity set

K = \{K_{1}, K_{2}, \dots, K_{n_{k}}\}

, where each element

K_{i}

is a subset of

D

(

K_{i} \subseteq D

) corresponding to a distinct activity segment. Notably, every

K_{i}

is expressly differentiated from

K_{j}

for all

i \neq j

, ensuring the exclusivity of each segment. This meticulous segmentation is crucial as it captures meaningful transitions or patterns in human activity over time; with the effectiveness of the segmentation measured by its ability to clearly delineate these segments, thus facilitating a more nuanced analysis of the activity data.

For each segment

K_{i}

in

K

, the index

e_{i}

marks the position of its concluding element. These indices form the foundation of the breakpoint set

K_{b r e a k} = {e_{1}, e_{2}, \dots, e_{n_{k}}}

, effectively summarizing the segmentation breakpoints. This set plays a pivotal role in mapping out the sequential transition from one activity to the next, thereby providing a structured framework for interpreting the temporal dynamics of the dataset.

3.2. Proposed Segmentation of Activities Based on Automatic Features Learning

The proposed method leverages an autoencoder architecture for the automatic segmentation of activities within a skeleton-based human activity dataset

D

, containing an unspecified number of activities

n_{K}

. This autoencoder is specifically engineered to minimize the discrepancy between the input data and their reconstructed output, aiming for a high-fidelity replication.

The model is trained on a dataset rich in diverse activities, punctuated by breakpoints characterized by minimal movements. It is postulated that this training strategy will condition the model to exhibit lower reconstruction errors at these critical points of activity transition, or breakpoints, when assessed against new datasets. The identification of these breakpoints is facilitated by establishing a predefined threshold for the reconstruction error and using this threshold as an indication of a breakpoint. The sophisticated nuances of the encoder–decoder architecture, coupled with the strategy for ascertaining the optimal threshold for reconstruction error, are depicted in Figure 4. This approach allows for the precise segmentation of activities by pinpointing their initiation and termination points within the dataset.

Prior to segmentation, the dataset

D

undergoes a series of preprocessing steps, including data cleaning and normalization, to optimize its quality for the segmentation task. Subsequently, the dataset is divided into distinct sets for training, validation, and testing purposes. The training set is used to train the autoencoder, the validation dataset aids in refining the hyperparameters and enhancing the model architecture, and the testing set evaluates the proficiency of the model for the activity segmentation task. The flowchart of the training and testing processes is given in Figure 5.

The autoencoder was meticulously designed to minimize the reconstruction error, quantitatively assessed through the Mean Square Error or loss (L), throughout the training phase. The training protocol, outlined as Algorithm 1, initiates with the random assignment of the weight matrices (

W_{e}

for the encoder and

W_{d}

for the decoder) and bias vectors (

b_{e}

and

b_{d}

, respectively). A technique of using tied weights, denoted as

W^{’} = W^{T}

, where

W^{T}

is the transpose of

W

[29,30,31,32], is adopted, reflecting a strategy consistent with approaches used in previous works for segmenting accelerometer-based human activity data [33,34].

The optimization of the autoencoder relies on the gradient descent technique to adjust its parameters, seeking to reduce the loss between the original skeleton frames

d (t)

and their reconstructed outputs

d' (t)

. This iterative mechanism calculates the gradient of the loss function with respect to each model parameter, thereby guiding the updates to systematically reduce the reconstruction error. The MSE loss function is defined as

L = \frac{1}{n_{f}} \sum_{i = 1}^{n_{f}} ∥ d_{i} (t) - d_{i}^{'} (t) ∥^{2}

(2)

where

d_{i} (t)

represents the actual value of feature i of the input vector and

{d'}_{i} (t)

represents the reconstructed value of element i of the output vector.

Central to the functionality of the autoencoder is the bottleneck layer, which captures a condensed representation of the input frame

d (t)

, selectively preserving only a critical subset of features. This condensed representation is subsequently decoded to reconstruct the input

d' (t)

, with the fidelity of this reconstruction assessed through the loss function. Such a focus ensures the training emphasizes the retention of the integrity of the input data, encapsulating the essential information within a reduced-dimensional space. By incorporating a precise application of the loss function equation and a detailed gradient descent methodology, the training regime effectively enhances the ability of the autoencoder to accurately segment activities, thereby enhancing both data compression and reconstruction capabilities.

Figure 6 illustrates a comparative depiction of selected original skeleton frames alongside their reconstruction errors post-training. This visualization is structured across two distinct rows; the upper row showcases a series of skeleton frames performing varied activities, while the lower row quantifies the reconstruction error between the original and reconstructed frames. Featured activities, including high arm waving, horizontal arm waving, hammering, and high throwing, are visually depicted through detailed skeleton frames. Meanwhile, transitional activities, notably those involving standing positions, are demarcated by black dotted lines. A notable observation from Figure 6 is the fluctuation of reconstruction error across different activities and specifically during transitional moments. Such fluctuations in error rates underscore the sensitivity of the autoencoder to shifts in activity patterns, reflecting its capability to discern subtle movements. This sensitivity is pivotal for the accurate detection of activity transitions, showcasing the effectiveness of the autoencoder in identifying nuances in movement through the analysis of reconstruction errors.

The trained autoencoder was deployed for the task of activity segmentation, with the reconstruction error (

L

) serving as a critical measure. This segmentation process employs a non-overlapping moving window approach, wherein each window of size

n_{ω}

segments the continuous frame sequence for analysis. A unique reconstruction error threshold (

τ

) was computed for each window, based on the distribution of reconstruction errors (

L

) observed within the window.

The automatic segmentation process for the computation of

τ

within each window can be described as follows:

Reconstruction Error Calculation: for a designated window $ω$ spanning frames $d (t)$ to $d (t + n_{ω})$ , the reconstruction errors $L_{i}$ of each frame $i \in ω$ are computed as per Equation (1).
Threshold Determination: Threshold $θ_{ω}$ for a subset of datapoints with the given window size $ω$ is set so that a predetermined x% of errors are below $θ_{ω}$ , while the remaining (100 − x)% are above it. This is formally represented as finding $θ_{ω}$ with $P (L_{i} \leq θ_{ω} | i \in ω) = x$ and $P (L_{i} > θ_{ω} | i \in ω) = (100 - x)$ , ensuring a balanced criterion for segment delineation based on the error distribution within the window. To determine θ, τ is calculated as an integer value representing the position within the $ω$ calculated by multiplying the length of $ω$ by the given percentile p. Subsequently, datapoints within $ω$ are signed and transformed with the sign and fs_d function.
Breakpoint Identification: Within window $ω$ , breakpoints are identified at locations where the reconstruction error $L_{i}$ indicates a potential transition—specifically, instances $j$ where $L_{j - 1} > τ_{ω}$ and $L_{j} \leq τ_{ω}$ —signalling a transition from one activity segment to the next. Once the breakpoints within the current $ω$ are identified, the function repeats until the sum of the start index and $ω$ exceeds the length of the reconstruction errors.

By utilizing this method, the autoencoder facilitates the granular segmentation of activities, with the non-overlapping window strategy ensuring a systematic and comprehensive examination of the frame sequence. The mathematical formalization of the threshold setting and breakpoint identification processes allow for a nuanced and precise approach to discern activity transitions, thereby enhancing the accuracy and reliability of the segmentation outcomes. The overall process for the automatic human activity breakpoint detection method is summarized in Algorithms 1 and 2. Algorithm 1 details the processes involved in training the autoencoder and using the training set for calculating reconstruction errors L. Once L is derived, Algorithm 2 denotes the steps to identify potential breakpoints. Three important parameters of the proposed automatic segmentation method are identified: the size of the non-overlapping window

n_{ω}

, the proportion value

p %

which is used to determine the threshold τ, and also the size of the bottleneck layer of the autoencoder. These parameters are expected to influence the segmentation process. Segmentation results from the proposed method are then compared with two state-of-the-art methods: The Pruning Exact Linear Time (PELT) and Local Max methods [24,34].

Algorithm 1 Train the autoencoder to minimize loss

Input:
- Input skeleton frames (D) divided into

(D_{t r a i n})

,

(D_{v a l i d a t i o n})

and

(D_{t e s t})

, learning rate (α), number of epochs

n_{e}

, encoding dimension, decoding dimension, and activation function f
Output:
- Trained autoencoder

E_{t r a i n}

1. Initialize

W_{e}

,

b_{e}

,

W_{d}

, and

b_{d}

randomly and set tied weights W′ =

W^{T}

according to [29,30,31,32], with an activation function

f

and learning rate

α

2. For epoch = 1 to

n_{e}

, perform the following:

e n c o d e d = f (D_{t r a i n} . W_{e} + b_{e})

d e c o d e d = e n c o d e d . W_{d} + b_{d}

l o s s = mean_squared_error (D_{t r a i n}, d e c o d e d)

δ_{d} = X - d e c o d e d

δ_{e} = δ_{d} . W_{d} . T

W_{d} + = α . e n c o d e d . T . δ_{d}

b_{d} + = α . μ (δ_{d}, a x i s = 0)

W_{e} + = α . X . T . d o t (δ_{e})

b_{e} + = α . μ (δ_{e}, a x i s = 0)

3. Trained autoencoder

E_{t r a i n} = f (D_{t r a i n} \cdot W_{e} + b_{e})

Algorithm 2 Identify breakpoints

Input
- Trained autoencoder, test data, window size

ω

, percentile (p)
Output:
- A set of breakpoint indices

1. Initialize breakpoints = [], start_index = 0
2. Decoded test data

D_{t e s t}^{'} = f (D_{t e s t} \cdot W_{d} + b_{d})

3. Calculate reconstruction error (L) for each sample in the test set:

L_{i} = μ ({(D_{t e s t} [i], D_{t e s t}^{'} [i])}^{2})

while (start + window size

ω

<= len(reconstruction errors):
subset = reconstruction_errors[start_index: start_index + window_size]
τ = int(p × len(subset))
θ = subset[τ]
values_above_threshold = subset − θ
signed_values = sign(values_above_threshold)
fs_d = f′(signed_values)
breakpoints ← where (

f s_d

intercepts L)
start_index +=

ω

Several components contribute to the computational complexity of the proposed skeleton-based activity segmentation method. Initially, the use of the autoencoder entails training on

n

datapoints each with

f

features. Training an autoencoder requires

O (n \cdot f \cdot h)

operations per epoch, where

h

represents the number of hidden units in the autoencoder. Considering that multiple epochs are typically necessary for convergence, the total complexity reaches

O (E \cdot n \cdot f \cdot h)

, with

E

denoting the number of epochs. To reconstruct the data, a forward pass through the network is executed, incurring a complexity of

O (n \cdot f \cdot h)

. Additionally, calculating the reconstruction error for all

n

datapoints requires

O (n \cdot f)

operations.

On the other hand, during the segmentation phase, a dynamic threshold τ is updated based on a given

ω

reconstruction errors to establish the threshold, resulting in a complexity of

O (n \log ω)

, where

ω

is the window size. Finally, detecting change points through a single pass over the windowed data based on the threshold, with the given window, necessitates

O (ω)

operations.

It is evident that the complexity of the proposed method is predominantly governed by the autoencoder training phase. Once the training is complete, the trained autoencoder can be utilized for the segmenting activity with minimal additional computational overhead.

3.3. Comparison to Other Segmentation Methods

Two state-of-the-art methods were chosen for comparison against the proposed method.

The Pruning Exact Linear Time (PELT) method is a statistical-based method with a linear cost function originally introduced in reference [27] and subsequently improved in reference [28] for activity segmentation with the aim of partitioning a human activity dataset

X

into an activity set

K

with an unknown number

n_{K}

of segments based on statistical properties, such as normal distribution, mean, standard deviation, and variance. The method is designed to segment not only multiple instances of the same activity but also distinct activities. While statistical segmentation has been predominantly applied in the time-series domain [35,36,37,38], it has also been effectively employed in accelerometer-based human activity segmentation and recognition [22]. The PELT method operates by detecting segmentation boundaries through the optimization of a cost function augmented by a penalty term, based on statistical properties. It identifies change points in the data, a concept central to PELT. The cost function, denoted as

C (τ)

, is expressed as follows:

C (τ) = - 2 \cdot l o g (L (τ)) + p e n a l t y \cdot | τ |

(3)

where

τ

represents the set of change points,

L (τ)

is the likelihood of the data given these change points, and the penalty is a term that balances the fit of the model to the data against the complexity of the model. The primary goal of the PELT algorithm is to determine the optimal set of change points,

\hat{τ}

, that minimizes this cost function:

\hat{τ} = \arg \min_{τ} (- 2 \cdot \log (L (τ)) + p e n a l t y \cdot |τ|)

(4)

This formulation encapsulates the essence of the PELT method, striking a balance between model fit and the penalty for the number of change points to derive an optimal solution. Unlike methods that evaluate all possible configurations for change points, PELT prunes the search space by discarding suboptimal solutions. It focuses on configurations that minimize the cost function while adhering to constraints like the specified penalty and minimum segment size. This pruning is crucial to prevent overfitting and ensure reasonable segment sizes, enhancing computational efficiency. The algorithm iteratively generates and evaluates potential change points, pruning those that exceed the penalty threshold or do not meet the minimum segment size criteria, producing final segmentation,

K_{i}

, which dictates the assignment of each activity frame to a specific segment.

Additionally, the proposed method was also compared with a deep learning-based method [22], which uses an autoencoder to learn the compact representation of the input data

x

. The encoder represented as

E (\cdot)

maps input data

x \in R^{d_{x} \times 1}

to

z \in R^{d_{z} \times 1}

hidden units to obtain the feature represented as follows:

z = E (x) = G_{e} (W x + b_{e})

(5)

where

z

represents features learnt by the autoencoder,

E (x)

is the encoding function,

W x

is weight matrix,

b_{e}

is bias vector, and

G_{e} (\cdot)

is a sigmoid activation function defined as

G_{e} (x) = \frac{1}{1 + e^{- x}}

(6)

A weight matrix

W \in R^{d_{f} \times d_{x}}

and a bias vector

b_{e} \in R^{d_{z} \times 1}

make up the encoder’s parameters. The decoder function is defined as

\tilde{x} = D (z) = G_{d} (W^{' z} + b_{d})

(7)

where

\tilde{x}

is the reconstruction of input

x

,

D (\cdot)

is the decoding function,

G_{d}

is typically in the same shape as the

G_{e}

, the weight

W^{'} \in R^{d_{x} \times 1}

, and

b_{d} \in R^{d_{z} \times 1}

is the bias vector of the decoder. The objective function

J

of the autoencoder is to minimize the

D (E (x))

with respect to

W, b_{e}, b_{d}, W^{'}

with the cross-entropy loss. Once

z

is extracted by minimizing

J

,

z

is divided into a set of overlapping windows. Then, the distance between the consecutive feature windows

z_{t}

and

z_{t - 1}

is calculated as follows:

D i s t_{t} = \frac{{||z_{t} - z_{t - 1}||}_{2}}{\sqrt{{||z_{t}||}_{2} \times {||z_{t - 1}||}_{2}}}

(8)

A distance curve is constructed based on the

D i s t_{t}

by calculating the distance between each pair of windows. Finally, the local maximum (Local Max) in the distance curve is identified as breakpoints.

3.4. Performance Evaluation Metrics

The evaluation of the segmentation performance entails a direct comparison between the ground truth set of time-series breakpoints, denoted as

K_{b r e a k}

, and the predicted set, denoted as

{K'}_{b r e a k}

. To gauge the accuracy of a detected breakpoint, a specific margin of error,

m > 0

, is predefined. Within this framework, a breakpoint is deemed accurately identified if it falls within this margin relative to the ground truth. Figure 7 presents a visualization of the reconstruction errors observed across selected activities within the UT-Kinect dataset. A grey shaded area within the figure specifies the acceptable margin of error, setting the bounds within which breakpoint predictions are considered accurate. True breakpoints, representing genuine activity transitions as recorded in the dataset, are distinctly marked with green dashed lines for clear visibility. Conversely, the breakpoints predicted using the autoencoder, based on the analysis of reconstruction errors, are marked with red dashed lines. Predictions made using the autoencoder that extend beyond the grey area, hence surpassing the established margin of error

|m| > 0

, are categorized as falsely detected. This systematic differentiation is critical for assessing the precision of the autoencoder in identifying activity transitions, facilitating an analytical comparison with the annotated ground truth of the dataset. The visualization in Figure 7 not only highlights the model’s performance in recognizing breakpoints but also underscores the importance of error margins in determining the success of such detections.

To assess their performance, several metrics are employed, including annotation error and the F1-score. These metrics provide valuable insights into different aspects of segmentation accuracy.

3.4.1. Annotation Error

Annotation error quantifies the disparity between the actual number of breakpoints

n (K_{b r e a k})

and the predicted number of change points,

n ({K'}_{b r e a k})

, and is calculated as follows:

ε_{a n n o t a t i o n} = |n ({K'}_{b r e a k}) - n (K_{b r e a k})|

(9)

where

n (K_{b r e a k})

and

n ({K'}_{b r e a k})

are the actual number and predicted number of segments, respectively. Whilst

ε_{a n n o t a t i o n}

gives a representation of the difference between the number of predicted segments and actual segments of activity, it does not consider whether the predicted segments accurately segment the activities. As such, other metrics are also considered for the evaluation.

3.4.2. Precision, Recall, and F1-Score

The F1-score, a widely recognized metric, combines precision and recall evaluating the accuracy of the predicted set

{K'}_{b r e a k}

compared to the true set

K_{b r e a k}

. It accounts for both the correct identification of breakpoints (precision) and the comprehensive detection of all actual breakpoints (recall). True positives (

T_{p}

) are defined as actual change points that have a corresponding estimated change point within a margin of

m

. The calculation of

T_{p}

between

K_{b r e a k}

and

K_{b r e a k}^{'}

is given by

T_{p} (K_{b r e a k}, K_{b r e a k}^{'}) = {e_{i} \in K_{b r e a k} | \exists e_{j}^{'} \in {K^{'}}_{b r e a k} s u c h t h a t |{e^{'}}_{j} - e_{i}| < m}

(10)

False positives (

F_{p}

) are defined change points that exist in the predicted set but not in the true set.

F_{p}

is given by

F_{p} (K_{b r e a k}, K_{b r e a k}^{'}) = {e_{j}^{'} \in {K^{'}}_{b r e a k} | ∄ e_{i} \in K_{b r e a k} s u c h t h a t |{e^{'}}_{j} - e_{i}| < m}

(11)

False negatives

F_{n}

are breakpoints that exist in the true set but do not have a corresponding breakpoint in the predicted set and are given by

F_{n} (K_{b r e a k}, K_{b r e a k}^{'}) = {e_{i} \in K_{b r e a k} | ∄ e_{j}^{'} \in {K^{'}}_{b r e a k} s u c h t h a t |{e^{'}}_{j} - e_{i}| < m}

(12)

A breakpoint that is detected but does not correspond to a true breakpoint is a false positive (

F_{p}

), a point in the time series that is correctly labelled as not being a change point is a true negative (

T_{n}

), and a true change point that was not detected by the algorithm is a false negative (

F_{n}

). Precision and recall can then be determined as follows:

P r e c (K_{b r e a k}, K_{b r e a k}^{'}) = \frac{T_{p}}{T_{p} + F_{p}}

(13)

R e c (K_{b r e a k}, K_{b r e a k}^{'}) = \frac{T_{p}}{T_{p} + F_{n}}

(14)

The F1-score is the harmonic mean of precision and recall, offering a balanced measure of both measures and can be determined as

F 1 = 2 \times \frac{P r e c \times R e c}{P r e c + R e c}

(15)

Precision, recall, accuracy, and F1-score are computed allowing a fixed margin of ten frames between the predicted and true segments, i.e.,

|m| = 10

, as illustrated in Figure 7. The grey area represents the allowed margin, with the true breakpoints denoted by a green dashed line and the predicted breakpoints by a red dashed line. Predicted change points falling outside this margin are considered incorrectly detected.

3.4.3. Area under the Curve

To measure the performance of the proposed method under different window sizes

n_{ω}

, the receiver operating characteristic (ROC) curve is used. The following defines the true positive rate (TPR) and false positive rate (FPR) in the ROC curve [24,39,40] by allowing a certain margin m.

T P R = \frac{n {(T}_{p})}{n (K_{b r e a k})}

(16)

F P R = \frac{n {(K}_{b r e a k}^{'}) - n {(T}_{p})}{n (K_{b r e a k}^{'})}

(17)

where

n {(T}_{p})

is the number of times breakpoints that are correctly detected,

{n (K}_{b r e a k})

is the number of ground truth breakpoints, and

n {(K}_{b r e a k}^{'})

is the number of predicted breakpoints.

4. Results

The skeletal joint dataset

D (t) = {d (t)}

as given in Equation (1), consists of each instance

d (t)

capturing the spatial positions of various joints, stacked as columns. Each joint, represented in a three-dimensional space, contributes

n_{j} \times 3

features per instance. Thus, for a dataset composed of

n_{j} = 25

skeleton joints, each instance

d (t)

is represented by 75 input features, as illustrated in Figure 4.

The performance of the proposed automatic segmentation method was evaluated across three publicly available skeleton-based human activity datasets. A series of experiments were designed to assess the effectiveness of the proposed automatic skeleton segmentation method in segmenting human activity from skeleton data. The primary objective of these experiments was to segment the dataset

D (t)

into an unspecified number of distinct segments

K

and to identify the segmentation change points

K_{b r e a k}

.

To gauge the accuracy of the segmentation, the results generated with the proposed segmentation method were compared with ground truth annotations. Three critical parameters—window size

n_{ω}

, threshold

x %

, and the bottleneck layer size—of the proposed method, were varied to determine their optimum values. Additionally, the performance of the method was benchmarked against two other segmentation approaches: the Pruning Linear Exact Time (PELT) algorithm, as detailed in reference [28], and a deep learning-based time-series segmentation method known as Local Max, proposed in reference [22].

These experiments were conducted on a standard computing system, equipped with an Intel Core-i7 7th generation processor and SK Hynix 32 GB DDR4 RAM sourced from US-based companies. This setup underscores the applicability of the method, demonstrating that computational optimization does not require specialized hardware. To ensure the robustness and consistency of the results, the experiments were repeated multiple times.

4.1. Datasets

Three distinct datasets were utilized in the research to evaluate the effectiveness of the proposed automatic segmentation method: the UOW activity [40], UTKinect-Action3D [30], and UBD-Kinect datasets [13]. The first dataset, UTKinect-Action3D [30], consists of a sequence of ten (10) activities performed sequentially by ten (10) subjects, with each activity occurring once, whilst the second dataset, the UOW activity dataset [40], encompasses twenty (20) activities performed sequentially by twenty (20) subjects. Additionally, the UBD-Kinect Dataset [13] was also employed, featuring seventeen (17) activities executed by three (3) subjects, including multiple instances of each activity. The selection of these diverse datasets intends to test the robustness of the proposed method in capturing activity transitions and handling multiple instances of the same activity, thereby assessing the method’s applicability in varied real-world scenarios. Notably, largescale datasets, particularly, the NTU-RGBD 120 [20] and Human 3.6M [21] datasets, which contain manually pre-segmented and labelled activities, were not included in this study as the publicly available version of the datasets does not contain continuous activity datasets. Details of the three datasets used are given in Table 1.

Each dataset was split into training, validation, and test sets. The training set was used to train the autoencoder, while the validation set facilitated hyperparameter tuning, optimizing the network’s ability to compress and reconstruct data without overfitting. This division ensured that the autoencoder would generalize effectively to new data. The test sets were exclusively used to evaluate the segmentation performance of the model. This careful separation of the datasets was crucial for a rigorous evaluation, preventing data leakage and the accurate assessment of the model’s performance on unseen data.

The fundamental concept behind the proposed automatic segmentation method is based on the observation that the reconstruction error decreases during activity transitions. This is illustrated in Figure 8a–c, showing the skeleton frame reconstruction errors for multiple activities within the UOW, UT-Kinect, and UBD-Kinect activity datasets, respectively. These figures collectively validate the initial hypothesis that the reconstruction error varies between different activities and their transitions.

4.2. Analysis of Variables of the Proposed Segmentation Method

This section presents a comparative analysis of the performance of the proposed automatic segmentation method, focusing on three critical parameters: window size

n_{ω}

, threshold

x %

, and the bottleneck layer size. These parameters were adjusted to identify their optimum values for segmenting human activity data.

Figure 9 presents the Receiver Operating Characteristic (ROC) curves for different window sizes

n_{ω}

across the UOW, UT-Kinect, and UBD-Kinect activity datasets, as illustrated in Figure 9a, Figure 9b, and Figure 9c, respectively. These curves highlight the trade-off between the true positive rate (TPR) and false positive rate (FPR) across different window sizes

n_{ω}

. The specific AUC values derived from these curves are summarized in Table 2, with AUC values across the three datasets ranging from 0.40 to 0.87. The analysis reveals that window sizes of 150 and 100 frames achieve the best performance, marked by higher AUC values, whereas sizes of 50, 200, 250, and 300 frames demonstrate lower efficacy. This indicates that a window size range of 100–150 frames is optimal for effectively capturing relevant temporal patterns, thereby enhancing breakpoint detection. Table 2 further confirms the model’s superior discriminatory capability, with a window size of 150 frames consistently registering the highest AUC values across all datasets, with the highest AUC values of 0.85, 0.87, and 0.84 across the UOW, UT-Kinect, and UBD-Kinect activity datasets, respectively.

Figure 10a–c illustrate the ROC curves for varying sizes of the bottleneck layer in the autoencoder across the UOW, UT-Kinect, and UBD-Kinect datasets, respectively. In these tests, the original set of

n_{j} \times 3

input features for each time frame were reduced to 25, 20, 15, and the notably effective size of 10 features before being reconstructed back to their original dimension. Surprisingly, the smallest bottleneck size of 10 features not only efficiently captures the essential information of the

n_{j} \times 3

-dimensional features but also delivers robust performance in activity segmentation detection tasks across all datasets. This demonstrates the capability of the autoencoder to maintain the accurate representation of the skeleton joint features while reducing computational complexity and resource demands.

The effectiveness of the different bottleneck sizes is quantitatively supported by the data in Table 3, which documents the Area Under the Curve (AUC) values for each dataset corresponding to various bottleneck sizes. Specifically, the AUC values for a bottleneck size of 10 features stand out with exceptional scores of 0.89 for UOW, 0.85 for UT-Kinect, and 0.88 for UBD-Kinect, indicating superior performance. These values highlight the high efficacy of the mode and the role of the optimal bottleneck size in achieving the best model performance across the datasets.

Figure 11a–c illustrates the ROC curves for varying threshold values across the UOW, UT-Kinect, and UBD-Kinect activity datasets, respectively. An analysis of Figure 11 indicates that threshold values between 30 and 40 are particularly effective for the activity segmentation model, achieving a balanced performance in correctly identifying positive instances (TPR) while minimizing false positives (FPR). This range demonstrates robust performance, suggesting it as the optimal threshold for the model. However, when the threshold value exceeds 40%, the model becomes overly conservative, leading to a decrease in the number of activity segments detected (lower TPR). This conservatism may cause the model to overlook transitions between activities, focusing instead only on distinct peaks of activity.

Supporting these findings, Table 4 presents the Area Under the Curve (AUC) values for each dataset at various thresholds. The data show a consistent increase in AUC values as thresholds rise, peaking at 35%. Above this value, the AUC generally declines, indicating that a threshold of 35% optimally balances sensitivity and specificity across all datasets. The specific AUC values at this optimal threshold are notably high, with 0.87 for both UOW and UT-Kinect and 0.86 for UBD-Kinect, confirming the effectiveness of the model in the segmentation task.

The comprehensive analysis conducted on different parameters—window size

n_{ω}

of 150 frames, threshold limit

x %

of 35%, and bottleneck size of 10—has conclusively identified these settings as optimal for the proposed automatic segmentation method. These configurations effectively harness the intrinsic characteristics and temporal dynamics of the skeleton data, ensuring that the segmentation process not only exhibits coherent patterns but also accurately captures the transitions between different activities. Consequently, these optimal parameters will be employed for further evaluations throughout the remainder of the study, maintaining a focus on maximizing computational efficiency while preserving the ability of the model to capture essential feature representations effectively.

4.3. Comparison of the Proposed Automatic Segmentation Method against PELT and Local Max Methods

Figure 12a–c illustrate the ROC curves of UOW, UT-Kinect dataset, and UBD-Kinect activity datasets, comparing the performance of the PELT [28], Local Max [22], and the proposed automatic segmentation methods. The ROC graphs clearly demonstrate that the proposed automatic segmentation method outperforms the earlier techniques in the task of human activity segmentation. Table 5 illustrates the Area Under the Curve (AUC) values for these comparisons, with the proposed automatic segmentation method achieving consistently higher AUC scores of 0.90, 0.88, and 0.87 for the UOW, UTK-Kinect, and UBD-Kinect activity datasets, respectively. These superior AUC values highlight the enhanced ability of the proposed method to accurately identify the start and end points of activities across a variety of datasets. The consistently high performance across these datasets underscores the robustness and effectiveness of the proposed method in skeleton-based activity segmentation tasks.

Table 6 tabulates annotation Error, precision, recall, and F1-scores across three datasets—the UOW, UT-Kinect, and UBD-Kinect activity datasets—for three segmentation methods: PELT, Local Max, and the proposed automatic segmentation method. Annotation error represents deviations from ground truth segmentation, while precision quantifies the percentage of accurately identified positive segmentations, reflecting the capability of the model to minimize false positives. On the other hand, recall measures the proportion of detected segments against all actual segments in the dataset. The F1-score, a harmonic mean of precision and recall, assesses the balance between these two metrics in terms of segmentation accuracy.

Across the UOW, UT-Kinect, and UBD-Kinect activity datasets, the proposed automatic segmentation method consistently demonstrates the lowest annotation errors and the highest F1-scores compared to the PELT and Local Max methods, with maximum F1-scores reaching 93%, 81%, and 91%, respectively. These scores correspond to annotation errors of 2.2, 5.8, and 2.8. The average F1-score of the proposed method was 88%, considerably higher than the average F1-scores of the PELT and Local Max methods, at 72% and 62%, respectively. These results illustrate the better segmentation performance of the proposed method and showcase its effectiveness in identifying true positives while minimizing false positives and false negatives. This consistency suggests that the proposed method excels in accurately replicating ground truth segmentation and achieving a better trade-off between precision and recall.

Moreover, the proposed method not only outperforms in annotation error and F1-score but also consistently excels across all performance metrics. While the PELT method matches the proposed method’s precision in the UBD-Kinect dataset, it falls short in other measures and datasets. The robust performance of the proposed method across various datasets underscores its reliability and effectiveness in accurately segmenting data compared to the PELT and Local Max methods.

Figure 13a–c illustrate the segmentation accuracy on the UOW, UT-Kinect, and UBD-Kinect activity datasets, respectively. These figures provide a comparative analysis of the segmentation boundaries detected using the proposed method, the PELT method, and the Local Max method against the ground truth. In these comparisons, the Local Max method marks activity breakpoints with green diamonds, identifying these points based on local maxima within the data windows. Meanwhile, the PELT method flags breakpoints where patterns change, highlighted by yellow Xs. In contrast, the proposed method adeptly focuses on transitions between activities, pinpointing most breakpoints with high accuracy, indicated by red dots.

The results underscore the capability of the proposed automatic segmentation method to fully automate the segmentation process of human activities based on skeleton data, effectively eliminating the need for manual annotation. This automation significantly enhances the feasibility of analysing larger datasets. Moreover, the proficiency of the proposed method in adapting to the intrinsic structure of the data not only improves the accuracy but also increases the efficiency of activity recognition and understanding, demonstrating its potential to revolutionize approaches to motion analysis and related fields.

5. Conclusions

The automatic segmentation of human activities is pivotal for advancing applications in various fields such as healthcare, sports science, interactive gaming, and surveillance. The ability to precisely segment and analyse human activities using skeleton data enhances both the accuracy and efficiency of activity recognition systems. This study introduced a novel deep learning-based approach for automatically segmenting skeleton-based human activities into distinct actions, leveraging advancements in unsupervised machine learning.

Utilizing frame-level reconstruction error calculations, the method systematically identifies breakpoints to segment continuous streams of skeleton data. The proposed method employs an autoencoder that effectively discerns transitions between activities by comparing original and reconstructed data, thereby enabling precise segmentation without human intervention. Experiments conducted across three publicly available datasets demonstrated that the proposed method not only enhances the efficiency of activity boundary detection but also significantly outperforms existing methods. It achieved an impressive average F1-score of 88%, substantially higher compared to 62% for the Local Max method and 72% for the PELT method. This superior performance underscores the ability of the proposed method to accurately identify activity segments, achieving an accuracy rate of 88% in detecting correct segments within the datasets tested. The method exhibited robustness in handling variations in activity boundaries, as evidenced by an annotation error of just 2.2 at its maximum. This precision is critical for applications requiring a detailed activity analysis. Moreover, the parameters selected for the study—based on the ROC curve analysis—played a pivotal role in optimizing the segmentation process. These parameters were rigorously tested to ensure the best combination for achieving high segmentation accuracy across different activities and datasets.

The significant advancements presented in this work suggest that the proposed automatic segmentation method has the potential to set a new standard in human activity recognition. By reducing the reliance on labour-intensive manual annotations, this method paves the way for more extensive and detailed analyses of human activities, extending its applicability to larger datasets and more complex scenarios. One of the limitations of the proposed segmentation method is its reliance on the assumption that transitions between activities are present. Due to the scarcity of publicly available, continuous human activity datasets that lack transitions, investigating segmentation without explicit transitions was not feasible. Additionally, the choice of parameters significantly influences the overall performance of the segmentation. Future research could address these limitations by developing novel segmentation techniques suited for scenarios where activities occur without clear transitions. Such contexts may include continuous and uninterrupted human activities where traditional segmentation approaches may struggle. Furthermore, exploring methods that minimize or eliminate the need for explicit parameter specification could enhance segmentation accuracy and robustness across diverse real-world settings. This direction would aim to make the segmentation process more adaptive and less dependent on predefined settings, thereby increasing its applicability and effectiveness. The work could revolutionize fields that depend on precise and reliable activity segmentation, such as interactive gaming, advanced surveillance systems, and interactive robotics.

Author Contributions

Conceptualization, M.A.H. and P.E.A.; methodology, M.A.H. and P.E.A.; software, M.A.H.; validation, M.A.H.; formal analysis, M.A.H. and P.E.A.; investigation, M.A.H. and P.E.A.; resources, P.E.A.; data curation, M.A.H.; writing—original draft preparation, M.A.H.; writing—review and editing, P.E.A.; visualization, M.A.H. and P.E.A.; supervision, A.G.N. and P.E.A.; project administration, A.G.N. and P.E.A.; funding acquisition, P.E.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Universiti Brunei Darussalam, grant number: UBD/RSCH/1.3/FICBF(b)/2024/023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study were derived from the following resources available in the public domain: UTKinect-Action3D, accessible at https://cvrc.ece.utexas.edu/KinectDatasets/HOJ3D.html, accessed on 18 June 2024; UOW activity dataset, accessible at https://sites.google.com/view/wanqingli/data-sets/uow-online-action3d, accessed on 18 June 2024; and UBD-Kinect Dataset, accessible at https://github.com/MicroBugTracker/UBD-Kinect, accessed on 18 June 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, C.; Yan, J. A Comprehensive Survey of RGB-Based and Skeleton-Based Human Action Recognition. IEEE Access 2023, 11, 53880–53898. [Google Scholar] [CrossRef]
Hossen, M.A.; Naim, A.G.; Abas, P.E. Evaluation of 2D and 3D posture for human activity recognition. AIP Conf. Proc. 2023, 2643, 40013. [Google Scholar] [CrossRef]
Hossen, M.A.; Abas, P.E. A comparative study of supervised and unsupervised approaches in human activity analysis based on skeleton data. Int. J. Comput. Digit. Syst. 2023, 14, 10407–10421. [Google Scholar] [CrossRef]
Wang, L.; Huynh, D.Q.; Koniusz, P. A Comparative Review of Recent Kinect-Based Action Recognition Algorithms. IEEE Trans. Image Process. 2020, 29, 15–28. [Google Scholar] [CrossRef]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
Feng, M.; Meunier, J. Skeleton graph-neural-network-based human action recognition: A survey. Sensors 2022, 22, 2091. [Google Scholar] [CrossRef]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
Zhang, H.-B.; Zhang, Y.-X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.-X.; Chen, D.-S. A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef] [PubMed]
Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
Pareek, P.; Thakkar, A. A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
Xing, Y.; Zhu, J. Deep Learning-Based Action Recognition with 3D Skeleton: A Survey; Wiley Online Library: Hoboken, NJ, USA, 2021. [Google Scholar]
Shotton, J.; Sharp, T.; Kipman, A.; Fitzgibbon, A.; Finocchio, M.; Blake, A.; Cook, M.; Moore, R. Real-Time human pose recognition in parts from single depth images. Commun. ACM 2013, 56, 116–124. [Google Scholar] [CrossRef]
Hossen, M.A.; Hong, O.W.; Caesarendra, W. Investigation of the Unsupervised Machine Learning Techniques for Human Activity Discovery. In Proceedings of the 2nd International Conference on Electronics, Biomedical Engineering, and Health Informatics, Surabaya, Indonesia, 3–4 November 2022; pp. 499–514. [Google Scholar]
Kim, E.; Helal, S.; Cook, D. Human Activity Recognition and Pattern Discovery. IEEE Pervasive Comput. 2010, 9, 48–53. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar] [CrossRef]
Dang, L.M.; Min, K.; Wang, H.; Piran, M.J.; Lee, C.H.; Moon, H. Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognit. 2020, 108, 107561. [Google Scholar] [CrossRef]
Beddiar, D.R.; Nini, B.; Sabokrou, M.; Hadid, A. Vision-based human activity recognition: A survey. Multimed. Tools Appl. 2020, 79, 30509–30555. [Google Scholar] [CrossRef]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [PubMed]
Lee, W.-H.; Ortiz, J.; Ko, B.; Lee, R. Time series segmentation through automatic feature learning. arXiv 2018, arXiv:1801.05394. [Google Scholar]
Singh, R.; Sonawane, A.; Srivastava, R. Recent evolution of modern datasets for human activity recognition: A deep survey. Multimed. Syst. 2020, 26, 83–106. [Google Scholar] [CrossRef]
Gaugel, S.; Reichert, M. PrecTime: A deep learning architecture for precise time series segmentation in industrial manufacturing operations. Eng. Appl. Artif. Intell. 2023, 122, 106078. [Google Scholar] [CrossRef]
Cippitelli, E.; Gasparrini, S.; Gambi, E.; Spinsante, S. A Human Activity Recognition System Using Skeleton Data from RGBD Sensors. Comput. Intell. Neurosci. 2016, 2016, 4351435. [Google Scholar] [CrossRef] [PubMed]
Shan, J.; Akella, S. 3D human action segmentation and recognition using pose kinetic energy. In Proceedings of the 2014 IEEE International Workshop on Advanced Robotics and its Social Impacts, Evanston, IL, USA, 11–13 September 2014; pp. 69–75. [Google Scholar] [CrossRef]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
Yang, K.; Liu, Y.; Yu, Z.; Chen, C.L.P. Extracting and composing robust features with broad learning system. IEEE Trans. Knowl. Data Eng. 2021, 35, 3885–3896. [Google Scholar] [CrossRef]
Creswell, A.; Bharath, A.A. Denoising adversarial autoencoders. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 968–984. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science (80-.) 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Jackson, B.; Scargle, J.D.; Barnes, D.; Arabhi, S.; Alt, A.; Gioumousis, P.; Gwin, E.; Sangtrakulcharoen, P.; Tan, L.; Tsai, T.T. An algorithm for optimal partitioning of data on an interval. IEEE Signal Process. Lett. 2005, 12, 105–108. [Google Scholar] [CrossRef]
Killick, R.; Fearnhead, P.; Eckley, I.A. Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 2012, 107, 1590–1598. [Google Scholar] [CrossRef]
Truong, C.; Oudre, L.; Vayatis, N. Selective review of offline change point detection methods. Signal Process. 2020, 167, 107299. [Google Scholar] [CrossRef]
Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
Zhu, Z. Change detection using landsat time series: A review of frequencies, preprocessing, algorithms, and applications. ISPRS J. Photogramm. Remote Sens. 2017, 130, 370–384. [Google Scholar] [CrossRef]
Lovrić, M.; Milanović, M.; Stamenković, M. Algoritmic methods for segmentation of time series: An overview. J. Contemp. Econ. Bus. Issues 2014, 1, 31–53. [Google Scholar]
Kawahara, Y.; Sugiyama, M. Change-point detection in time-series data by direct density-ratio estimation. In Proceedings of the 2009 SIAM International Conference on Data Mining, Sparks, NV, USA, 30 April–2 May 2009; pp. 389–400. [Google Scholar]
Liu, S.; Yamada, M.; Collier, N.; Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Netw. 2013, 43, 72–83. [Google Scholar] [CrossRef] [PubMed]
Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar] [CrossRef]
Tang, C.; Li, W.; Wang, P.; Wang, L. Online human action recognition based on incremental learning of weighted covariance descriptors. Inf. Sci. 2018, 467, 219–237. [Google Scholar] [CrossRef]

Figure 1. Different windowing methods of segmenting activities compared with the ground truth segmentation boundaries.

Figure 2. Visualization of key frame selection method from the literature using the UBD-Kinect dataset; (a) raised left hand and (b) raised right hand.

Figure 3. Overview of Complete Human Activity Recognition/Discovery Framework. The Green Dashed Line Highlights the Focus of this Work. The Red Dotted Line Shows the Focus of Most of the Human Activity Recognition Research.

Figure 4. The Encoder–Decoder Architecture and Scheme to Calculate the Threshold Value of the Proposed Method.

Figure 5. Training and Testing Processes of the Proposed Automatic Segmentation Method.

Figure 6. Reconstruction Error and Transitions Between Different Activities. Upper Row Shows Sample Skeleton Frames and Lower Row Displays Reconstruction Error.

Figure 7. Reconstruction Error Between Some of the Activities from the UT-Kinect Dataset. The Green Dashed Line Represents Ground Truth Change Points and the Red Dashed Line Represents Predicted Change Points with the Allowed Margin

|m|

. Shown in the Grey Areas.

Figure 7. Reconstruction Error Between Some of the Activities from the UT-Kinect Dataset. The Green Dashed Line Represents Ground Truth Change Points and the Red Dashed Line Represents Predicted Change Points with the Allowed Margin

|m|

. Shown in the Grey Areas.

Figure 8. Reconstruction Errors of the (a) UOW, (b) UT-Kinect, and (c) UBD-Kinect Activity Datasets.

Figure 9. The ROC curve of the (a) UOW activity dataset (b) UT-Kinect dataset and (c) UBD-Kinect dataset under different window sizes.

Figure 10. The ROC curve of the (a) UOW activity dataset (b) UT-Kinect dataset, and (c) UBD-Kinect dataset with different feature sizes in the bottleneck layer.

Figure 11. The ROC curve of the (a) UOW activity dataset, (b) UT-Kinect dataset, and (c) UBD-Kinect dataset at different threshold values.

Figure 12. The ROC curve of the (a) UOW activity dataset, (b) UT-Kinect dataset, and (c) UBD-Kinect dataset under different methods.

Figure 13. Segmentation accuracy of the different methods across the (a) UOW, (b) UT-Kinect, and (c) UBD-Kinect Activity Datasets.

Table 1. Details of the datasets used for the study. Columns represent the name of the dataset, number of activities, number of subjects, and total number of action frames.

Dataset	Activities	Subjects	Action Frames	Number of Joints/Frame
UT-Kinect-Action3D	10	10	15,860	20
UOW Activity Dataset	20	20	92,505	25
UBD-Kinect Dataset	13	5	221,818	20

Table 2. Area Under the Curve (AUC) for all three datasets with different window sizes.

	UOW	UTK	UBD
Window Size 50	0.57	0.75	0.62
Window Size 100	0.74	0.83	0.75
Window Size 150	0.85	0.87	0.84
Window Size 200	0.65	0.61	0.56

Highest AUC values for every Window Size are bolded.

Table 3. Area Under the Curve (AUC) for all three datasets with different feature sizes.

	UOW	UTK	UBD
Feature Size 10	0.89	0.85	0.88
Feature Size 15	0.84	0.83	0.80
Feature Size 20	0.85	0.83	0.82
Feature Size 25	0.78	0.79	0.75

Highest AUC values for each Feature Size are bolded.

Table 4. Area Under the Curve (AUC) for all three datasets with different threshold values.

	UOW	UTK	UBD
Threshold 15	0.62	0.66	0.60
Threshold 30	0.75	0.81	0.75
Threshold 35	0.87	0.87	0.86
Threshold 40	0.67	0.73	0.68
Threshold 55	0.40	0.41	0.48
Threshold 60	0.35	0.39	0.44

Highest AUC values for each Threshold are bolded.

Table 5. Area Under the Curve (AUC) Values for All Three Methods with Different Datasets.

	UOW	UTK	UBD
PELT [28]	0.75	0.75	0.80
Local MAX [22]	0.63	0.78	0.82
Proposed Method	0.90	0.88	0.87

Table 6. Annotation Error, Precision (P), recall (R), and F1-score (F1) of all datasets and methods combined.

Methods	PELT [28]				Local MAX [22]				Proposed Method
Dataset	Ann. Error	Prec.	Rec.	F1-Score	Ann. Error	Prec.	Rec.	F1-Score	Ann. Error	Prec.	Rec.	F1-Score
UOW	3.8	0.79	0.68	0.73	4.6	0.73	0.70	0.71	2.2	0.92	0.94	0.93
UT-Kinect	7.4	0.67	0. 57	0.61	7.1	0.81	0.63	0.71	5.8	0.87	0.76	0.81
UBD-Kinect	4.1	0.91	0.73	0.81	3.8	0.73	0.54	0.61	2.8	0.91	0.92	0.91
Average	5.1	0.79	0.66	0.72	5.17	0.76	0.62	0.68	3.6	0.9	0.87	0.88

Each row represents a dataset, and each column shows the annotation error (

A_{e}

>), P, R, and F1-scores of the respective methods. Highest Annotation error, Precision, Recall and F1-Score for each dataset are bolded.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hossen, M.A.; Naim, A.G.; Abas, P.E. Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach. Technologies 2024, 12, 96. https://doi.org/10.3390/technologies12070096

AMA Style

Hossen MA, Naim AG, Abas PE. Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach. Technologies. 2024; 12(7):96. https://doi.org/10.3390/technologies12070096

Chicago/Turabian Style

Hossen, Md Amran, Abdul Ghani Naim, and Pg Emeroylariffion Abas. 2024. "Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach" Technologies 12, no. 7: 96. https://doi.org/10.3390/technologies12070096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Skeleton-Based Human Activity Segmentation: An Autoencoder Approach

Abstract

1. Introduction

2. Related Works

3. Proposed Segmentation Method

3.1. Activity Segmentation Notation

3.2. Proposed Segmentation of Activities Based on Automatic Features Learning

3.3. Comparison to Other Segmentation Methods

3.4. Performance Evaluation Metrics

3.4.1. Annotation Error

3.4.2. Precision, Recall, and F1-Score

3.4.3. Area under the Curve

4. Results

4.1. Datasets

4.2. Analysis of Variables of the Proposed Segmentation Method

4.3. Comparison of the Proposed Automatic Segmentation Method against PELT and Local Max Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI