1. Introduction
Owing to the increasing demand for intelligent health monitoring, comprehensive traffic management, smart security, and advanced surveillance systems, human activity recognition (HAR) is being promoted. These application scenarios place high requirements on the speed and accuracy of the HAR method. As the two main non-wearable sensing modalities, video and radar can meet the above requirements. Compared to the ubiquitous availability of video sensors, radar-based HAR is characterized by a number of fundamental relative merits, including robustness to lighting conditions and visual obstructions. Moreover, a radar sensor can preserve the visual privacy of the identified human subjects.
Radar-based HAR mainly constitutes the analysis of the human activity characteristics embedded in the radar echoes. In addition to the essential range and velocity information that can be extracted from the main reflection of a human target, radar echoes contain the micro-motion information of the different moving parts of the human body. All the movements can be characterized by the Doppler signal of the human target. The superposition of all these Doppler signals constitutes the micro-Doppler feature (MDF), which is one of the most promising features for HAR. The offset of target echo in the Doppler frequency domain can also play an important role in multiple input multiple output (MIMO) radar SAR imaging. In [
1], high-resolution imaging is achieved by adopting the Doppler-division multiplexing (DDM) technique for MIMO channel separation and combining it with a single-channel BP algorithm and multi-channel synthesis.
Aimed at MDF-based HAR, a coherent integration scheme, such as the short-time Fourier transform (STFT) or the Wigner–Ville distribution, is usually applied to obtain the time-frequency (TF) spectrogram. Thereafter, HAR can be realized by extracting the time-varying features of the micro-Doppler signal in the spectrogram. Traditional methods for MDF categorization are based on manually defined spectral description words, which comprise several statistical features, including the bandwidth of the Doppler signal, the torso Doppler frequency, and the normalized standard deviation of the Doppler signal strength [
2]. Fairchild et al. [
3] used the empirical mode decomposition to produce a unique feature vector to represent the MDF. Then, HAR is realized in conjunction with a support vector machine (SVM) classifier. As a supervised learning algorithm, SVM is usually suitable for large sample sizes. For limited sample sizes in unsupervised learning, Sharifi [
4] used a relevance vector machine (RVM) to extract flood maps and achieved a classification performance of 89%. However, the accuracy and efficiency of the abovementioned methods are limited by classification complexity.
Recently, deep learning (DL) has received significant attention for its superior recognition performance. Radar-based HAR has also experienced an influx of DL research. These approaches to DL can be roughly divided into the following three classes: a convolutional neural network (CNN); a recurrent neural network (RNN); and a hybrid network. These methods automatically extract the features of the samples using supervised learning, overcoming the deficiencies of traditional models in feature extraction.
Typical CNN and numerous deep variants of CNN have been proposed to enhance radar-based HAR performance. Kim et al. [
5] first applied deep CNN to HAR and achieved impressive classification results. The literature [
6,
7] employed a fusion of multi-dimensional CNN to improve recognition accuracy. Moreover, some researchers have attempted to reduce the computational complexity and improve the inference speed using lightweight CNN. Zhu et al. [
8] proposed a lightweight CNN, which gives a high recognition accuracy but requires only a few parameters. Despite these considerable advantages, the interpretation of MDF as two-dimensional (2D) images focuses on the spatial correlation features between the pixels. Thus, the extracted features are highly redundant, which affords a complex network with numerous parameters and high hardware requirements. In fact, the activity features in TF are better reflected in the temporal correlation of Doppler sequences rather than in the spatial structure of the 2D image.
A temporal network model employing a recurrent neural network can extract the temporal correlation features between data sequences. The literature [
9,
10,
11] employed RNN, long short-term memory (LSTM), and bi-directional LSTM (Bi-LSTM) to realize HAR, respectively. They can all achieve good recognition results. In addition, Shrestha et al. [
12] combined LSTM with Bi-LSTM to implement HAR. Jiang et al. [
13] used stack-LSTM to implement human activity classification. They both can achieve over 90% average accuracy, but they have a very large number of network parameters. This means that such a complex network model is difficult to apply to embedded applications that are limited by hardware resources and computing power.
Hybrid networks, such as CNN-LSTM [
14,
15,
16,
17,
18,
19], can achieve enhanced performance compared to individual networks as they combine the expertise of the constituent networks. The hybrid structure can fully exploit the space–time characteristics of input data and improve the accuracy of recognition. The above network typically feeds the hidden state output from the last time step (
) or the hidden state output from all time steps of the LSTM to the classifier for output. In the former case,
only highlights the feature representation within the current time period and the memory effect is limited for long sequences. In the latter case, irrelevant temporal features may also be introduced into the output. Both of these processing methods can lead to limited recognition performance of the network.
Attention is a cognitive process of selectively concentrating on the important things while ignoring others in psychology. Researchers have applied this idea in many tasks, such as semantic analysis [
20] and image segmentation [
21]. The attention module is usually not used alone, it can be added to different neural networks to improve the performance of the networks. The literature [
22,
23,
24,
25,
26] introduced the attention mechanism to the residual network (ResNet101), convolutional autoencoder (CAE), CNN, LSTM, and Bi-LSTM, respectively. The networks with the attention mechanism converge faster than the networks without the attention mechanism, and their recognition accuracy is higher than the network without the attention mechanism. In addition, the attention module is quite flexible; it can be added anywhere in the network, such as [
27] who used the attention modules before the proposed Multi-RNN. Attention usually eliminates the vanishing gradient problem as it provides direct connections between all the time step data. Moreover, the distribution of attention weights provides an intuitive insight into the activity of the training model.
Most of the above studies have focused on improving the accuracy of HAR while ignoring the issue of limited computational and storage resources on embedded devices. That is to say, there is relatively little research on lightweight networks in the field of HAR. Therefore, to achieve an efficient recognition network with a lightweight structure for HAR, researchers need to consider various factors such as the generation mechanism of micro-Doppler signals mapped from human activity, pre-process the characteristics of target echoes. Furthermore, researchers also need to consider the data format at the network input and design a reasonable network structure based on the distribution characteristics of micro-Doppler signals, in order to expand HAR to some embedded applications.
In this paper, we proposed a solution for HAR.
Figure 1 displays the block diagram of the proposed system. The system consists of three parts: human activity collecting; radar signal pre-processing; and the CLA network. In this study, the analysis is performed based on the experimental data collected by a millimeter-wave band radar, which advantageously has a high resolution, high detection accuracy, small size, and low cost. In the radar signal pre-processing stage, we utilized fast Fourier transform (FFT) to compress the distance and angle of the target and implemented the average cancellation algorithm to suppress fixed clutter. Additionally, we used the constant false alarm rate (CFAR) for target detection, which improved the availability of data. Finally, STFT was employed to obtain Doppler sequences of human activity data. In this study, a hybrid DL model based on the attention mechanism CLA (CNN–LSTM–Attention) is proposed. The proposed model has a relatively light framework and supports advanced features. One-dimensional (1D) CNN is adopted herein for spectral feature acquisition of Doppler sequences. Without the completion of the 2D micro-Doppler map (MDM) after STFT sliding window processing, the spectral vector generated by each window can be fed into the 1D CNN. This method effectively avoids the high redundancy of 2D CNN feature extraction and reduces time consumption. Then, LSTM extracts the temporal dependence of the time-varying frequency features under different time steps, obtaining the global temporal information related to human activities. Finally, the attention module utilizes the attention values to reassign the weights of the time-varying frequency features and integrates the frequency and time dimension characteristics for recognition, which effectively enhances feature representation.
This article proposes an efficient and lightweight HAR scheme based on the attention mechanism, which focuses on two aspects: radar signal pre-processing and the CLA network. The introduced average cancellation algorithm in the radar signal processing part can enhance the features of human activity, thus improving the recognition performance of HAR. The proposed CLA adopts a 1D processing network based on the attention mechanism, which fully exploits the Doppler and timing characteristics of the target activities reflected in the radar signal. The experimental results show that our proposed method can not only achieve high classification accuracy but also has a lighter network structure compared to traditional algorithms with similar recognition accuracy. This has great potential for resource- and storage-constrained embedded applications.
The contributions of our work are summarized as follows:
(1) We proposed a hybrid network that incorporates the attention module. The network decouples the Doppler and temporal features of human activity in MDM. Based on 1D CNN to extract the Doppler features of time sequences, the network can be more lightweight. Meanwhile, the attention-based LSTM can extract important time features between the Doppler feature sequences, thus enhancing the network’s ability to capture key features. The experimental results show that the network proposed herein can achieve the best accuracy and relatively low complexity.
(2) In the raw radar signal processing part, the average cancellation method is utilized to suppress the fixed clutter interference, which has been proven to be more suitable for micro-Doppler analysis than the traditional moving target indicator (MTI). Moreover, it improves the recognition performance of HAR compared to signal processing without suppression.
(3) This study explores the optimal structure of the proposed network through a self-established dataset with five types of human behavior. A comparative analysis was conducted with some state-of-the-art HAR networks using a self-established dataset and a public dataset. The use of two different datasets ensures that the final results of the experiment are fairer and more reasonable. The results show that the proposed method in this paper achieves satisfactory accuracy in both datasets.
This paper is organized as follows.
Section 2 provides the details of the micro-Doppler signal process based on the FMCW radar signal model.
Section 3 illustrates the structure of the CLA hybrid multi-network for HAR.
Section 4 provides the experimental results of real data to verify the superiority of the proposed algorithm. Finally,
Section 5 presents the conclusions.
3. CNN-LSTM-Attention Hybrid Multi-Network
Many studies on HAR have directly used MDM as a 2D image. However, MDM comprises multiple 1D Doppler spectrums splicing along the slow time segment-by-segment. The most significant disparity between the mechanism of MDM and 2D image is that MDM’s feature expression is the Doppler spectrum distribution change with time. Therefore, the direct processing of MDM as a 2D image not only cannot meet the real-time processing requirements but also ignores the unique continuous expression of human activity characteristics by MDM, resulting in unsatisfactory efficiency and accuracy.
This study gives full consideration to the real-time performance and effectiveness of HAR. Three network structures are mixed to benefit from their respective advantages. A CLA hybrid multi-network is proposed to fully extract the MDF of human activity. The overall framework of the proposed network is shown in
Figure 7. The 1D CNN extracts the target Doppler features within each time window through a compact network. Then, the LSTM network obtains the temporal correlation information between the Doppler features corresponding to the different windows. Finally, the weight allocation mechanism of the attention mechanism is applied to highlight the important features from the hidden states of LSTM.
Instead of waiting for the end of the window sliding operation of STFT, the Doppler distribution of each sliding window can be directly fed into the 1D CNN. Compared to multi-dimensional CNN, the 1D CNN better processes data in sequence with fewer network parameters, reducing the hybrid network complexity. The Doppler sequence output by each sliding window is defined as , and the set of sequences generated by the entire STFT is .
The 1D CNN network employed herein comprises several convolutional layers, and the layers maintain the same feature scale using the same padding. The specific number of the convolutional layers and the convolutional kernels in each layer are discussed in
Section 4 based on the accuracy and efficiency of the hybrid network. Furthermore, ReLU is adopted as the activation function in the 1D CNN convolution process. The 1D CNN extracts the Doppler feature of
according to the sequence generation order. Then, the Doppler distribution of
under the current time window is converted into a Doppler feature sequence
through the convolution modules. The sequence set composed of
preserves all the temporal correlation information between the Doppler feature sequences.
LSTM is used to process the temporal correlation features between the Doppler feature sequences. LSTM is an improved RNN network, which alleviates the problem of RNN gradient disappearance and captures long-time dependents. It controls the transmission state of information through a gate mechanism to retain the temporally pertinent information across the time steps. The internal network structure of an LSTM cell is shown in
Figure 8.
In the proposed CLA network, the input sequence of the LSTM cell is
. The expression of the corresponding forget gate
, the input gate
, the output gate
, the memory state
, and hidden state
are as follows:
where
and
are the activation functions,
is the weight matrix, and
is the splicing vector between the input
at moment
and the hidden state
at the previous moment.
The forget gate of LSTM is used to control the forget strategy of . The input gate and memory gate control the memory strategy of and , respectively. Each LSTM cell outputs two important states: the memory state () and the hidden state (). The information passed down from the previous time step is selected for forgetting and updating through the gate mechanism, thus maintaining the memory state of at time . not only contains the information of the input at time in the cell, but also contains historical information from previous time steps, and it is directly output to the next time step without further processing. For each time step, the changes to are relatively small. selects a portion of the content from as the output of the cell at time through the output gate . Compared to , can focus more on the historical time information that is more important for the input at time , that is, highlights the feature representation of the current moment more prominently. The differences in the passed to the next time step for inputs at different time steps are more noticeable.
Each Doppler feature sequence is fed into the LSTM network as the input to the corresponding LSTM cell. Typically, HAR networks use LSTM outputs in two ways: the first uses only the last time step output , while the second uses all time step outputs . In the first case, as the hidden state with the longest time step, is believed to contain the temporal correlation information of the entire input sequences. However, since emphasizes the current time step’s feature expression, it may not “remember” relatively distant information, such as . Therefore, obtaining accurate and reasonable feature expressions solely through is difficult. In the second case, outputting all time steps can ensure that the feature information at each time step can be fully utilized, but it also introduces irrelevant features that are not important for the current time. Both methods limit the recognition accuracy of HAR.
The proposed network combines these two methods using the attention mechanism. The attention mechanism takes full advantage of all the intermediate hidden layer outputs of the LSTM network, evaluates the importance of hidden layer information at every time step in combination with
and correlates it with the outputs. The weight allocation mechanism in the attention mechanism focuses on important hidden layer temporal information, that is, the more important temporal features in the Doppler feature sequences. Therefore, the recognition accuracy of the hybrid network can be effectively improved by introducing the attention mechanism. The network structure of the attention mechanism is shown in
Figure 9.
First, the weight of the output of each LTSM cell hidden layer is calculated. Correlation is performed to determine the similarity score between the output of each hidden layer and the final output.
Subsequently, the attention weights of each hidden layer
are obtained by normalizing
using Softmax.
Weighting
with each
yields the weighting factor
.
provides the identification of the importance level for the features at different time steps. The final feature expression is the output of , which can help the subsequent classification module make accurate judgments based on the feature importance.
The final classification requires a transformation from
to conditional probability distributions of different activity classes.
where
is the weight matrix and
is the bias vector. Each element of
denotes the predicted probability of the
-th type of behavior.
During the training phase, cross-entropy is applied as the loss function. The label
of the real activity is one-hot encoded, and the length
is the number of activity categories. Cross-entropy is defined as the difference between the real activity label and the predicted action probability.
Finally, the network parameters are continuously updated via backpropagation through time to reduce the loss value. Thus, the predicted value of the network converges to the real value.
4. Results and Discussions
4.1. Experimental Platform and Parameter Setting
The experimental data were collected using TI’s radar platform at Guangzhou University in September 2021. The radar platform is composed of AWR1843 and DCA1000EVM. AWR1843 is a multi-channel millimeter-wave radar sensor, while DCA1000EVM is a capture card for interfacing with AWR1843 that enables users to stream digital IF data over the Ethernet to the laptop. The experiments were performed in an indoor environment with the radar platform mounted at a height of 1.5 m, as shown in
Figure 10. The radar parameters are listed in
Table 1.
In each individual activity, the IF signals were processed to generate a Doppler sequence group, containing 112 Doppler sequences of length 112. Sequentially, this group was fed into the subsequent DL networks for recognition. MATLAB was used for radar signal processing. Tensorflow v1.4.0 was used as the DL framework. All the networks were trained for 50 epochs using the adaptive moment estimation (Adam) optimizer with the batch size set to 64 and the learning rate set to 0.0001.
4.2. Dataset
In this study, we explored the optimal structure of the proposed network and completed the ablation experiment through a self-established dataset with five types of human behavior. The self-established dataset selected relatively simple activity acquisition patterns, and we will supplement more diverse and larger-scale activity patterns in our future work. Meanwhile, we conducted a comparative experiment to verify the superiority and adaptability of the proposed algorithm through a self-established dataset and a public dataset [
30].
The self-established dataset records five human activities from 10 participants: (0) walking, (1) running, (2) standing up after squatting down, (3) bending, and (4) turning.
Figure 11 shows the sketch maps of the five human activities. The range of movement of walking and running is about 1 m. Considering that the complete cycle of actions such as standing up after squatting down takes a relatively long time, the duration of each sample data is 5 s. The participants were seven males and three females with different heights, weights, and ages. During the experiment, for the diversity of the dataset, each participant acted according to their personal habits, without constraints on specific activities. Particularly, each participant repeated a specific activity 20 times. Totally, the activity distribution generated 1000 Doppler sequence groups. The public dataset developed by the University of Glasgow, UK, contains six human activities: walking, sitting, standing, picking up an object, drinking water, and falling down. Among them, the duration of walking is 10 s, and the duration of the other activities is 5 s. The range of movement of walking is about 60 m. They employed an FMCW radar with a 5.8 GHz operating frequency and a chirp bandwidth of 400 MHz to record the data. A total of 1754 micro-Doppler signature samples were generated. The datasets were randomly divided into 60% for training, 20% for validating, and 20% for testing.
4.3. Discussion of the Proposed Network Structure
The specific recognition performance and the efficiency are closely related to the network structure. In the proposed CLA hybrid network, depth (the number of layers in a neural network) and width (the number of channels in one layer) are often used to describe the network hierarchy of the 1D CNN, which reflects the ability to conceptualize Doppler features from the input Doppler sequences.
The number of LSTM CELLs in a single-layer LTSM network is determined by the number of timing states required for HAR. Herein, the CELL number is fixed as 112. This value is closely related to the signal acquisition duration and Doppler window interval, and its optimization is not within the scope of this study. The hidden state dimension in the LSTM CELL decides the expressiveness of the network under the current time. With the increasing number of hidden units, more feature details are contained, and the expression ability is richer; however, it causes parameters surging and is time consuming.
The hybrid network CLA requires a trade-off between the time consumption and the expression ability. The feature representative capability can be amplified during training due to the positive feedback between layers. Deepening of the network can improve the nonlinear expression ability, which can implement more complex feature fitting. The width of each layer, that is, the number of convolutional kernels, determines the richness of the captured features in each layer, which is correlated to the difficulty of network optimization. However, blindly deepening the network will cause optimization difficulties, performance saturation, and degradation of the shallow learning capability. Similarly, exceeding the appropriate width will reduce the network efficiency due to repeated feature extraction.
Therefore, to determine the optimal structure of the proposed CLA network, the depth, and width of the 1D CNN as well as the number of LSTM hidden units were verified to analyze the corresponding network performance. The parameter quantity, the inference time, and the accuracy are the indicators of the network performance. Since the frequency domain representation is not sparse and the kernel size is not a key factor affecting the accuracy, CLA adopts a fixed 3-length convolution kernel for 1D CNN. The time step of LSTM is fixed at 112, which also fixes the time series feature number of the connections in the subsequent attention module. All structures were analyzed under the condition of Epochs = 50. The experimental results are listed in
Table 2.
Table 2 shows that the improvement of accuracy is basically consistent with the increase in network complexity. Moreover, the network performance gradually saturates as the depth of the 1D CNN expands. Specifically, the accuracy of the CLA network with 64–128 1D CNN and 32 LSTM hidden units is close to that of the CLA network with 64–128–256 1D CNN and 32 LSTM hidden units, but the amount of network parameters of the latter is nearly twice that of the former.
In terms of the network efficiency, when the two basic dimensions (depth and width) of 1D CNN are fixed, the accuracy increases with the increase in LSTM hidden units, but the parameters and inference time also increase. Moreover, when the LSTM hidden units reach a certain level, the network also gradually saturates. From
Table 2, the accuracy of the CLA network with 64–128 1D CNN and 128 LSTM hidden units only has a slight increase compared with the CLA network with 64–128 1D CNN and 64 LSTM hidden units, but the network parameters and inference time of the former were nearly twice that of the latter.
Finally, the CLA network based on 64–128 1D CNN and 32 LSTM hidden units, marked in bold in
Table 2, is analyzed, compared, and discussed in the following sections. It has the smallest number of parameters and the shortest inference time among the networks, with nearly 97% accuracy.
4.4. Ablation Experiments
This section discusses ablation to evaluate each module’s contribution to the proposed CLA network. The change in accuracy is examined when removing different modules. In total, five network structures are discussed: 1D CNN, LSTM, 1D CNN-LSTM, LSTM-Attention, and CLA. The detailed training, validating, and testing results are shown in
Figure 12 and
Figure 13 and
Table 3.
The comparison shows that the attention-based hybrid network is superior to the single networks in terms of network performance. From the perspective of the training and validating process, single networks, such as 1D CNN and LSTM, have a weak ability to represent the HAR features due to their limited network structures. Their accuracy and convergence speed are lower than those of the hybrid network. In addition, for 1D CNN, there be some differences between the training and validating performance.
The hybrid networks comprising two networks were compared. The results show that the hybrid networks all outperform the single network, and the 1D CNN-LSTM with poorer performance in the hybrid network has 7.93% higher accuracy than the 1D CNN with better performance in the single network. The performance of LSTM-Attention is closer to CLA in terms of the indicators and convergence curves. Its accuracy is higher than that of 1D CNN-LSTM and single LSTM networks by 10.02% and 19.92%, respectively. Unlike LSTM, the attention mechanism performs a weighted synthesis of all the hidden states output using the LSTM module at different time steps; this effectively improves the HAR network performance. Furthermore, this result implies that temporal features embedded in the radar signals have more explicit representations of different human activities than Doppler features.
The CLA network performs significantly better than the other four networks. It requires the fewest epochs and its network converges the fastest. Among the five networks, the CLA network affords the highest accuracy and the least inference time. On the basis of LSTM-Attention, the CLA applies 1D CNN to pre-extract the Doppler features of the input sequences, effectively solving the insufficient representation of Doppler features by LSTM-Attention. Consequently, the recognition accuracy is improved by 11.06%.
The results of the confusion matrices show that the actions of label 2 (squatting and standing) and label 3 (bending) are easier to confuse than those of other labels. Since HAR is performed depending on the time-dependent changes of the micro-Doppler components introduced by the limb movements, the frequency characteristics of the limb movements determine how similar the actions are. For label 0 (walking), its time-varying frequency is similar to that of label 4 (turning), so there is a misclassification between them. Label 1 (running) has a high recognition probability due to its high movement frequency. For labels 2, 3, and 4, the symmetry, variation law, and amplitude of their micro-Doppler distribution are similar in some cases; thus, they exhibit similar features along slow time in the Doppler domain, leading to recognition errors. However, since CLA integrates Doppler features and attention-based weighted time series features, the recognition accuracy can still be greatly improved.
4.5. Attention Mechanism Discussion
To obtain a detailed understanding of the role of the attention mechanism in the HAR task based on the radar signals, the part of the features that the attention mechanism focuses on is visualized. A heatmap can help obtain a visual representation from the integrated MDM by highlighting the regions considered to be important for HAR. The first row of
Figure 14 displays the grayscale MDM images of the five labeled activities. These images are formed by arranging the input Doppler sequences in the time series. The heatmaps of the grayscale MDM images above are correspondingly displayed in the second row. The red parts in the heatmaps denote the regions that the network focuses on. The attention heatmaps clearly show that most of the red regions are distributed in the endpoint and contour positions that reflect the change of the micro-Doppler distribution. These concerns are consistent with the Doppler distribution characteristics of different activities in the radar signals. Moreover, this demonstrates that CLA can focus on the more representative features in the input sequences along the slow time. Therefore, the attention mechanism improves the performance of CLA.
4.6. Comparison of Different Literature Networks
In this section, multiple networks from the literature are employed for performance comparison. The parameters, inference time, and floating point operations (FLOPs), which indicate the possibility of embedded applications of the algorithm, are taken into consideration while discussing the accuracy. The parameters, inference time, and FLOPs are important indicators that reflect the complexity of a network. Larger parameters and higher FLOPs usually result in a longer inference time, but this is not absolute and is also related to factors such as network structure and computational device performance. To fully evaluate the speed and efficiency of different network models, these three indicators need to be considered comprehensively.
Table 4 summarizes the results of these state-of-the-art studies in terms of HAR with the self-established dataset. The training and validating of the networks results are shown in
Figure 15.
The feature extraction method based on 2D images was adopted in [
5]. The researchers treated the output MDM of STFT as a 2D image and used a CNN with a 2-layer of 5 × 5 convolutional kernels to extract the local spatial features. Although this network has few parameters and a simple structure, it does not explore the difference between MDM feature expression and visual images. Thus, the HAR accuracy rate is only 81.03%, and the inference time is 0.94 ms.
Considering the timing characteristics of HAR in the radar signals, the literature [
12,
13] achieved high recognition accuracy and fast convergence using an LSTM-BiLSTM hybrid network and a stacked LSTM network with three layers, respectively. The HAR accuracy of both networks is higher than 95%, but the results in
Table 4 denote that the high accuracy of [
12,
13] comes at the cost of a large number of network parameters and time overhead. In [
12], the number of parameters reached 1.5 M, FLOPs reached 2.9 M, and the inference time of a single sample reached 5.12 ms. This is attributed to the serial network structure between the LSTM CELLs and the fully connected structure in each LSTM CELL. Multi-layer stacking of LSTMs or the combination of LSTM and BiLSTM will lead to a significant increase in the network size. Moreover, LSTM relies on the calculation of the previous moment to obtain the results of the next moment; therefore, FLOPs and the inference time will increase when more LSTM CELLs are introduced. In addition, combining the curves in
Figure 15, we can see that the networks of [
12,
13] have different degrees of overfitting problems. One reason is that the number of parameters is too large, and the fitting ability of the neural network becomes very strong, which means that the expressed function will be more complex, and it is very easy to cause overfitting by using overly complex functions. The second reason is that the dropout is applied in the training process, but not in the validation process, which can also lead to overfitting; however, based on the second reason, the overfitting problem of CLA is obviously less serious under the same conditions. The network comparison with [
17] demonstrates the superiority of the attention mechanism due to the focus on the key time features. The proposed CLA network can achieve an accuracy improvement of 2.92% and a faster convergence speed compared to [
17]. Moreover, the proposed CLA has higher accuracy and lower network complexity compared to the previous literature networks. In addition, we also used the public dataset to compare the accuracy of different literature networks (Accuracy* in
Table 4); the results are consistent with the above discussion.
4.7. Clutter Suppression Performance Experiment
To verify the superiority of the average cancellation method used herein for improving the recognition performance in HAR, two other datasets were established with the collected radar digital IF data. The two datasets were processed without clutter suppression and with MTI. The proposed CLA is applied on these three datasets and the results are shown in
Figure 16.
The comparison shows that the dataset processed by the average cancellation method yields higher accuracy using the same recognition network than the other two datasets. The accuracy is higher by 6.70% and 3.73% compared to the dataset without clutter suppression and the MTI processing dataset, respectively. This is mainly because the average cancellation method effectively suppresses the static clutter interference in micro-Doppler application. Additionally, it preserves and highlights the MDFs introduced by human activities more completely. The experimental results fully reveal that to realize the performance improvement of HAR, the features of human activity reflected in radar signals need to be fully utilized and deeply mined from both radar preprocessing and DL network optimization.