Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism

Zhang, Yuanjie; Gao, Ting; Xie, Hongtu; Liu, Haozong; Ge, Mengfan; Xu, Bin; Zhu, Nannan; Lu, Zheng

doi:10.3390/rs17040641

Open AccessArticle

Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism

by

Yuanjie Zhang

¹,

Ting Gao

²,

Hongtu Xie

^1,*,

Haozong Liu

¹

,

Mengfan Ge

¹,

Bin Xu

¹,

Nannan Zhu

³

and

Zheng Lu

⁴

¹

School of Electronics and Communication Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China

²

Air Early Warning System Application Laboratory, Air Force Early Warning Academy, Wuhan 430019, China

³

School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou 510275, China

⁴

Institute of Remote Sensing Satellite, China Academy of Space Technology, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(4), 641; https://doi.org/10.3390/rs17040641

Submission received: 12 December 2024 / Revised: 18 January 2025 / Accepted: 11 February 2025 / Published: 13 February 2025

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

In the domain of micromotion target recognition, target characteristics can be extracted through either broadband or narrowband radar echoes. However, due to technical limitations and cost constraints in acquiring broadband radar waveform data, recognition can often only be performed through narrowband radar waveforms. To fully utilize the information embedded within narrowband radar waveforms, it is necessary to conduct in-depth research on multi-dimensional features of micromotion targets, including radar cross-sections (RCSs), time frequency (TF) images, and cadence velocity diagrams (CVDs). To address the limitations of existing identification methodologies in achieving accurate recognition with narrowband echoes, this paper proposes a graph fusion network based on a cross-modal attention mechanism, termed GF-AM Net. The network first adopts convolutional neural networks (CNNs) to extract unimodal features from RCSs, TF images, and CVDs independently. Subsequently, a cross-modal attention mechanism integrates these extracted features into a graph structure, achieving multi-level interactions among unimodal, bimodal, and trimodal features. Finally, the fused features are input into a classification module to accomplish narrowband radar micromotion target identification. Experimental results demonstrate that the proposed methodology successfully captures potential correlations between modal features by incorporating cross-modal multi-level information interactions across different processing stages, exhibiting exceptional accuracy and robustness in narrowband radar micromotion target identification tasks.

Keywords:

narrowband radar; micromotion target recognition; graph fusion network; cross-modal attention mechanism; multimodal feature fusion

1. Introduction

With the rapid development of radar technology, significant progress has been made in the application of radars for the target detection and recognition, both in classifying stationary targets [1,2,3] and detecting high-speed moving targets [4,5,6,7,8,9,10]. A radar system works by transmitting electromagnetic waves and receiving their reflected signals, enabling the effective detection and tracking of the target’s motion state. Traditional radar target recognition methods typically rely on geometric features and radar scattering characteristics of the targets. However, as the movement patterns of targets become more complex, especially with the emergence of micromotion targets, traditional methods face significant challenges in terms of the recognition accuracy and robustness.

In complex adversarial environments, primary targets and interfering targets frequently co-exist, significantly challenging the identification capabilities of defense systems. Given their similar physical properties and morphological characteristics, conventional shape-based detection methodologies demonstrate limited efficacy in discriminating between these entities. Consequently, there exists an urgent imperative to explore and utilize alternative feature information of targets to enhance the accuracy and reliability of identification processes [9,10,11,12,13].

In addition to the translational motion, primary targets and interfering targets exhibit the micromotion in the form of the spin, precession, and nutation motion during the spatial flight [14]. Since primary targets typically employ attitude control techniques to enhance hit accuracy, their micromotion characteristics diverge significantly from those of interfering targets. The diverse forms of the micromotion yield disparate modulation effects on radar echo signals, thereby enabling the effective distinction between primary targets and interfering targets through the analysis of their respective micromotion characteristics in radar echo signals.

The identification of micromotion targets can be broadly divided into two main categories [15]. The first category involves extracting physical and statistical features from radar echo signals, which can be subsequently used for classification through template matching or statistical classifiers. For example, researchers extract physical features such as the fundamental frequency, peak-to-peak difference, and 3dB Doppler bandwidth from the radar cross-section (RCS), as well as statistical features such as the mean, variance, and sequential center distance [9,16]. Moreover, features can be extracted through image processing methods, such as converting echo signals into time frequency (TF) images using the short-time Fourier transform (STFT). From these images, physical features such as the Doppler period, frequency, and bandwidth, as well as statistical features such as the entropy and third-order moments, can be obtained [17,18]. In addition to the RCS and TF features, other features, such as the cadence velocity diagram (CVD) [19,20] and high-resolution range profile (HRRP) [14,21,22,23] are also used for the micromotion target identification.

However, traditional classification methods, such as the support vector machine (SVM) and k-nearest neighbor (KNN) methods, that rely on physical and statistical features, demonstrate the limited performance in complex scenarios. To address these limitations, the second category of methods focuses on designing deep learning networks to automatically learn features from the data for the micromotion target recognition [24,25,26,27,28,29,30,31,32,33,34,35]. For instance, Chen et al. proposed the RCS Net to facilitate the feature learning from the RCS data [25,26], while Bai et al. developed a network specifically designed for recognizing the TF data to classify micromotion targets [27]. Similarly, Feng et al. utilized the HRRP data for the target recognition, achieving the micromotion target classification [28]. However, while TF data effectively capture the Doppler information of individual scattering centers, they lack the ability to provide the detailed distance information. In contrast, the HRRP excels in representing the target distance information but cannot intuitively express the Doppler characteristics. Consequently, relying on a single data modality often fails to comprehensively describe the micromotion target’s shape, structure, and motion. Although single-modality micromotion target recognition methods can generally achieve accuracies above 90% under their respective experimental conditions, the inherent limitations of each modality restrict their potential for achieving higher recognition performance.

To address the limitations of single-modality approaches, researchers have increasingly focused on fusing the multimodal data to enhance target recognition performance [36,37,38]. Multimodal learning involves extracting and integrating features from multiple data sources, and the multimodal fusion methods generally categorized into the early fusion, intermediate fusion, and late fusion methods. The early fusion method concatenates features from different modalities at the pre-model stage and processes them through a unified model. In the intermediate fusion, features from different modalities are extracted independently using specific networks and fused at an intermediate layer before further processing through a shared network. For example, the feature-level fusion by combining the HRRP, range instantaneous Doppler (RID), and joint time frequency (JTF) data has been presented in [33], which can achieve the improved target recognition performance. Similarly, the multidomain feature-level fusion network based on an attention mechanism named MdFFNet [34] leverages the complementary characteristics of the HRRP, RCS, and TF data, incorporating an attention mechanism in the feature-level fusion module to selectively enhance important domain features. The late fusion method processes the data from different modalities through independent networks and performs the fusion at the decision-making stage. For instance, Tian et al. fused the narrowband and broadband radar data at the decision level [32], and other studies have also combined broadband and narrowband radar echoes for the target recognition.These methods combining broadband and narrowband radar data have shown promising results in their respective experimental settings, with recognition accuracies typically reaching above 95%. Although the multimodal fusion of broadband and narrowband radar demonstrates significant performance advantages in target recognition, radar systems typically employ narrowband radar echoes due to practical engineering constraints. Narrowband radar echoes offer numerous advantages, including highly focused energy, strong anti-interference capabilities, and low signal processing complexity. Consequently, achieving efficient micromotion target recognition based on single narrowband echoes has become an important research direction in this field.

To fully excavate the information embedded in the multimodal data derived from narrowband radar echoes, this paper emphasizes the exploration of the role and interdependence of each modality. By interpreting the multimodal fusion as a hierarchical interaction process, this paper proposes a graph fusion network constructed by the cross-modal attention mechanism named GF-AM Net, which can achieve the high-precision micromotion target identification by fusing multimodal features. First, the convolutional neural network (CNN) is used to extract features from RCS, TF, and CVD modalities, effectively capturing their unique characteristics. Then, these features are progressively fused within a graph structure through a cross-modal attention mechanism, simulating multilevel interactions across the unimodal, bimodal, and trimodal configurations. The hierarchically fused features are subsequently processed through a classification module, facilitating the accurate recognition of the micromotion target based on narrowband radar data.

The main contributions of this paper are as follows: (1) A multi-domain feature fusion method is proposed for narrowband radar signals, which effectively integrates RCS, TF, and CVD characteristics extracted from narrowband radar echoes to achieve robust micromotion target identification. (2) A graph fusion network based on the cross-modal attention mechanism is designed. In the network architecture, a graph-like structure is utilized to hierarchically design single-modal, dual-modal, and tri-modal interactions, where modal interactions are treated as vertices. Different vertex information is fused through the cross-modal attention mechanism to better capture the inter-modal interactions. (3) Through simulation experiments, the effectiveness of the GF-AM Net in narrowband radar data is verified, which significantly improves the identification accuracy of micromotion targets.

2. Signal Modeling

2.1. Micromotion Signal Modeling

In challenging radar detection scenarios, primary targets and interfering targets will show different micromotion forms during the flight, through which the targets can be well classified and identified. The motion process of the micromotion model is usually manifested as a precession motion [33,34], as shown in Figure 1. If the target spins, the target will rotate around its symmetry axis with an angular velocity

w_{s}

, and the precession motion is to rotate around the symmetry axis

O N

at the same time as the target spins, with an angular velocity of

w_{c}

and a precession angle of

θ

, i.e., the angle between symmetry axes

O N

and

O z

. Let the scattering center P on the spatial cone be in the coordinate system

o x y z

, with the initial coordinate is

p_{0} = {(x_{p}, y_{p}, z_{p})}^{T}

. At the moment t, in the precession model, the coordinates of scattering center P in the coordinate system

o x y z

are given by

p_{c} (t) = R_{c} (t) R_{s} (t) p_{0}

(1)

where

R_{s} (t)

is the spin matrix,

R_{c} (t)

is the precession matrix, and the angular velocity of the spin is

w_{s}

. From the Rodrigues equation, the rotational transformation matrix

R (t)

can be expressed as follows:

R (t) = \vec{I} + \hat{e} sin ω t + {\hat{e}}^{2} (1 - cos ω t)

(2)

where

\hat{e}

is skew symmetric matrix of the angular rotation axis

\vec{e} = {[e_{x}, e_{y}, e_{z}]}^{T}

. Thus, the spin matrix expression is given by

R_{s} (t) = [\begin{matrix} cos (w_{s} t) & - sin (w_{s} t) & 0 \\ sin (w_{s} t) & cos (w_{s} t) & 0 \\ 0 & 0 & 1 \end{matrix}]

(3)

Similarly, the precession matrix

R_{c} (t)

can be expressed as follows:

R_{c} (t) = [\begin{matrix} cos (w_{c} t) & - cos θ sin (w_{c} t) & sin θ sin (w_{c} t) \\ cos θ sin (w_{c} t) & 1 - {cos}^{2} θ [1 - cos (w_{c} t)] & sin θ cos θ [1 - cos (w_{c} t)] \\ - sin θ sin (w_{c} t) & sin θ cos θ [1 - cos (w_{c} t)] & 1 - {sin}^{2} θ [1 - cos (w_{c} t)] \end{matrix}]

(4)

where

w_{c}

is the precession angular velocity, and

θ

denotes the precession cone angle. These matrices provide a comprehensive description of the motion process and positional changes of the scattering center P in the precession model.

2.2. Narrowband Radar Signal Modeling

Assuming that

S_{t h} (θ_{s}, f_{s})

represents the static electromagnetic scattering data of the micromotion target, which changes with the attitude angle

θ_{s}

and frequency

f_{s}

. The attitude angle change of the micromotion target in the flight process can be obtained by establishing the micromotion signal model, and the narrowband radar echo signals can be obtained by introducing this dynamically changing attitude angle into the electromagnetic scattering data of the narrowband radar target. Thus, the RCS data and narrowband radar echo signals can be obtained during the micromotion target’s motion data

S_{t h} (θ_{a} (t), f_{p})

, where

f_{p}

represents a frequency point within the signal bandwidth.

The narrowband radar echo can be processed using STFT to obtain the TF image, which illustrates the signal’s distribution in both time and frequency. The critical step in constructing the TF image is selecting an appropriate window function. This function is applied to extract local signals at each time moment t of the narrowband radar signal. Then, a Fourier transform is performed on the local signals within each window to obtain the local spectrum. Finally, the local spectra are arranged sequentially to form the TF image of the signal. The TF data are constructed by organizing these spectra as follows:

T (t, f_{d}) = S T F T (S_{t h} (θ_{a} (t), f_{p}))

(5)

where

f_{d}

denotes the Doppler frequency. The TF image displays the variation in the frequency of the micromotion target over time during its flight, providing insight into the trend of the micromotion characteristics as the micromotion target moves.

To analyze the micromotion characteristics more comprehensively, the fast Fourier transform (FFT) of the obtained TF image along the time axis can be used to derive the CVD [30,31]. The CVD is a visual representation, which is used to analyze the frequency velocity characteristics of the micromotion target. It provides the valuable information about the repetition tempo and the maximum Doppler shift of each frequency component. Moreover, the CVD does not rely on the initial position of the moving target, making it more robust than the TF image. The relationship between narrowband radar echo, TF image, and CVD is illustrated in Figure 2. The formula for CVD is as follows:

C (f_{c}, f_{d}) = F F T (T (t, f_{d}))

(6)

where

f_{c}

denotes the cadence frequency.

3. Proposed Method

In this section, a narrowband radar micromotion target identification method based on the GF-AM Net using the multimodal feature fusion is proposed. The detailed network structure of the GF-AM Net is designed to achieve the high-precision recognition of the micromotion target in the narrowband radar. The GF-AM Net operates in several key stages to efficiently process the multimodal data and enhance the micromotion target recognition accuracy. First, this network extracts unimodal features from three different modalities (the RCS, TF, and CVD) using the CNNs. Each modality provides the distinct information about the micromotion target’s micromotion characteristics, and CNNs are well-suited to effectively extract deep-level features within these modalities. The unimodal features derived from each modality are then fed into a graph fusion structure that incorporates a cross-modal attention mechanism. This structure is designed to simulate layer-by-layer interactions between different modalities, aiming to uncover potential inter-modal associations and facilitate progressive fusion from the unimodal to multimodal data. Within this fusion structure, the network initially generates bimodal features by combining two different modalities to explore the interrelationships between them. Subsequently, the three-modal features are further interacted to generate trimodal features, completing the deep fusion of the multimodal data. To optimize feature representation further, different weights are assigned to the unimodal, bimodal, and trimodal features based on their relative importance. This weighting mechanism enhances the recognition task by emphasizing the most informative features from each modality, allowing for a more comprehensive portrayal of the micromotion characteristics of micromotion targets. By incorporating these weighted features, the network improves both the accuracy and robustness of the recognition task. Finally, the weighted output is passed into a classification module, yielding the final recognition results and enabling high-precision recognition of micromotion targets in the narrowband radar. The overall structure of the GF-AM Net is illustrated in Figure 3.

3.1. Unimodal Feature Extraction

In order to achieve the unimodal feature extraction from the different modal data, two distinct CNN structures have been employed to process one-dimensional (1D) time-series data (RCS) and two-dimensional (2D) image data (TF image and CVD). By leveraging the CNN’s efficient information retention capabilities, such as the local connectivity, weight sharing, and pooling operations, the network can effectively extract unimodal features that capture the most relevant information from these modalities. The local connectivity in CNNs allows the network to focus on local patterns in the data, which is particularly useful for both the time-series RCS data and the spatially structured image data (TF image and CVD). Weight sharing reduces the number of parameters, enabling the model to generalize better, while pooling operations help to retain the most important features while reducing the spatial dimensions. This approach ensures that each modality’s features are extracted efficiently, and then the obtained unimodal features can provide the valuable information for subsequent fusion and classification tasks. The structure of the unimodal feature extraction network is shown in Figure 4.

When handling the RCS data, we employ a 1D convolution (Conv1D) module when the feature extraction is employed, which is well-suited for the time-series data. The Conv1D module operates through the layer-by-layer convolution, effectively processing the long time-series data by expanding the receptive field. This allows the model to capture both local and global dependencies in the data. For instance, using the Conv1D-16@4, where “16” denotes the number of output channels and “4” represents the size of the convolution kernels, the network extracts complex temporal features. After the convolution operation, a pooling layer is applied to reduce the size of the feature map, thereby enhancing the feature compactness. A batch normalization (BN) layer is then used to accelerate the convergence process, followed by a dropout layer to improve the generalization and prevent the overfitting.

In contrast to the RCS time-series data, both TF image and CVD are the 2D image data in the frequency domain. To effectively extract complex spatial features from these images, a 2D convolution (Conv2D) module for the feature extraction was adopted. Conv2D is capable of recognizing spatial patterns within the images and constructing the deep, high-level semantic feature representations layer by layer. Similar to the RCS processing based on the Conv1D module, the Conv2D module is followed by pooling layers, a BN layer, and a dropout layer to enhance the model performance and prevent the overfitting.

Ultimately, with the design of two CNNs, this paper extracts independent features from each of the three modalities—RCS, TF image, and CVD. These extracted features serve as the foundational elements for the subsequent cross-modal graph-based fusion. By separately processing the time-series data (RCS) and image data (TF image and CVD) through specialized CNN architectures, the model is able to capture the unique characteristics of each modality. These unimodal features, in turn, provide the necessary input for the graph fusion process, enabling effective integration of the different modalities to enhance the overall performance of the micromotion target recognition.

3.2. Graph Fusion Module for Cross-Modal Attention Mechanism Construction

In order to better interact with two modal features (i.e., vertices in the graph fusion module), the cross-modal attention mechanism has been used to the cross-modal attention mechanism was used to effectively implement the interaction of two vertices. This paper takes the example of generating

f_{t, c}

through

f_{t}

and

f_{c}

, the structure of which is shown in Figure 5. The interaction between these two features is captured by the attention mechanism, facilitating the fusion of the complementary information from both modalities. This interaction enhances the representation of the fused features, which is critical for improving the overall recognition accuracy of the micromotion target.

In Figure 5,

f_{t}

and

f_{c}

are first passed through the Split and Conv1D, where Split transforms the input features into inputs suitable for the multimodal attention mechanism, and Conv1D enables the elements of the input features to have interactions on their neighborhood elements, as follows:

{\hat{f}}_{{r, t, c}} = Conv 1 D (f_{{r, t, c}}^{'}, k_{{r, t, c}}) \in R^{N_{{r, t, c}} \times d}

(7)

where

k_{{r, t, c}}

is the size of the convolution kernel,

N_{{r, t, c}} \times d

is the feature obtained after splitting, N is the number of segments, and d is the length of each segment. In order to improve the stability and training efficiency of the neural network, stabilize the input distribution and enhance the generalization ability of the model, the Layernorm technique is used [39], which is given by

LayerNorm (f) = \frac{f - μ}{σ} γ + β

(8)

where

μ

and

σ

are the mean and standard deviation of the input features of the layer.

γ

and

β

are learnable scaling and panning parameters that preserve the expressive power of the network.

After the normalization, the two features are subjected to a multi-head cross-modal attention mechanism [40]. The formula for the attention mechanism from the TF image modality to CVD modality is given by

{Attention}_{i}^{t} = softmax (\frac{(X_{t} W_{t i}^{Q}) {(X_{c} W_{c i}^{K})}^{T}}{\sqrt{d_{k}}}) (X_{c} W_{c i}^{V})

(9)

MultiHead (Q_{t}, K_{c}, V_{c}) = Concat ({Attention}_{1}^{t}, {Attention}_{2}^{t}, \dots, {Attention}_{h}^{t}) W^{O}

(10)

The attention mechanism from the CVD to TF image is formulated as follows:

{Attention}_{i}^{c} = softmax (\frac{(X_{c} W_{c i}^{Q}) {(X_{t} W_{t i}^{K})}^{T}}{\sqrt{d_{k}}}) (X_{t} W_{t i}^{V})

(11)

MultiHead (Q_{c}, K_{t}, V_{t}) = Concat ({Attention}_{1}^{c}, {Attention}_{2}^{c}, \dots, {Attention}_{h}^{c}) W^{O}

(12)

where

W_{t i}^{Q}, W_{c i}^{K}, W_{c i}^{V}, W_{c i}^{Q}, W_{t i}^{K}

and

W_{t i}^{V}

are the learnable projection matrices with

i \in {1, \dots, h}

denoting the i-th attention head and h being the total number of attention heads, and

X_{t}

and

X_{c}

are the feature sequences of the two modalities, TF image and CVD. After computing the cross-modal attention, the residual concatenation of the attention outputs is performed, and then

f_{t, c}

can be obtained by passing through a LayerNorm layer and performing a Hadamard product. Thus, it can have

{Output}_{t} = Layer Norm (MultiHead (Q_{t}, K_{c}, V_{c}) + X_{t})

(13)

{Output}_{c} = LayerNorm (MultiHead (Q_{c}, K_{t}, V_{t}) + X_{c})

(14)

f_{t, c} = {Output}_{t} ⊙ {Output}_{c}

(15)

To fully exploit the information from each modality and explore the interactions of bimodal and trimodal features, we utilize a three-layer graph fusion network module for the progressive fusion of unimodal, bimodal, and trimodal features [37]. In this structure, each layer of the graph corresponds to a different level of modality fusion: the first layer handles unimodal features, the second layer handles bimodal features, and the third layer captures trimodal features. By modeling the interaction between the modalities as nodes and edges in the graph, the network learns to progressively refine the inter-modal relationships at each layer. This layer-by-layer interaction allows the model to capture fine-grained connections between the different features, facilitating a more comprehensive fusion and ultimately improving the overall performance of the target recognition task.

In the first layer (unimodal layer), the features of each modality are mainly self-information, denoted as

f_{r}

,

f_{t}

and

f_{c}

(data features from RCS, TF image, and CVD modalities, respectively). Since the importance of the features may vary from modality to modality, we apply the attention mechanism in this layer in order to weight the contribution of each modality. The attention mechanism is applied to weights to the modal features according to their importance, resulting in a weighted average unimodal feature output. After this processing, each modal feature not only contains its own information but also reflects its importance in the network, which is shown below:

α_{m} = M A N (f_{m})

(16)

U = \frac{1}{3} \sum_{m} α_{m} \cdot f_{m}, m \in {r, t, c}

(17)

where

α_{m}

is the weight of each modality and

M A N

is the network from which the weights are extracted, consisting of sigmoid activation dense layers. U is the final unimodal vector.

The second layer, referred to as the bimodal dynamic learning layer, focuses on the interaction between two modalities. In this layer, the vertices representing the bimodal features are obtained through a two-by-two interaction of the unimodal information. Specifically, the cross-modal attention mechanism, as discussed earlier, is employed to facilitate the interaction between pairs of unimodal features. This mechanism enables the dynamic exchange of the information between different modalities, allowing the model to learn the interdependencies and complementary information between them. The detailed calculations for this interaction have been provided in the previous section, where the attention mechanism adaptively modulates the information flow between the modalities to refine the bimodal feature representation. Thus, it can have

f_{m_{1}, m_{2}} = C A M (f_{m_{1}}, f_{m_{2}}) ⊙ C A M (f_{m_{2}}, f_{m_{1}})

(18)

f_{m_{1}}, f_{m_{2}} \in {f_{r}, f_{t}, f_{c}}, f_{m_{1}} \neq f_{m_{2}}

(19)

where CAM refers to the cross-modal attention mechanism, where the first parameter serves as the query, and the second parameter represents the key and value of the attention mechanism. The operation ⊙ denotes the Hadamard product, which represents the element-wise multiplication between two values. Through this cross-modal attention mechanism, the model is able to extract features of interest from each modality for the other, facilitating the capture of the complementary information between modalities. This interaction enables the model to generate the more comprehensive, multi-dimensional feature representations, which enrich the overall understanding of the target. By tapping into the complementary characteristics of each modality, the model significantly enhances its ability to generalize and improve both the accuracy and robustness of the target recognition task.

f_{m_{1}, m_{2}}

is the information vector of the vertices of the second layer. For the weights of the edges, since the similarity of the two vectors increases with the increase in the inner product of these two vectors, it means that, when the inner product of the two vectors is larger, their information overlaps and there is little or no complementary information. Given that the bimodal interactions exhibit varying degrees of significance, the edge weights connecting vertices between the first and second layers can be determined as follows:

S_{m_{1}, m_{2}} = {\tilde{f}}_{m_{1}}^{T} {\tilde{f}}_{m_{2}}

(20)

{\hat{α}}_{m_{1}, m_{2}} = \frac{α_{m_{1}} + α_{m_{2}}}{S_{m_{1}, m_{2}} + 0.5}

(21)

α_{m_{1}, m_{2}} = \frac{e^{{\hat{α}}_{m_{1}, m_{2}}}}{\sum_{m_{i} \neq m_{j}} e^{{\hat{α}}_{m_{i}, m_{j}}}}

(22)

where

{\tilde{f}}_{m_{1}}

denotes the Softmax normalized vector of

f_{m_{1}}

(the normalization ensures that the computed similarity is between 0 and 1),

S_{m_{1}, m_{2}}

denotes the similarity of the two vectors,

{\hat{α}}_{m_{1}, m_{2}}

denotes the weights of the unnormalized edges, and

α_{m_{1}, m_{2}}

denotes the normalized weights. The weighted average of the bimodal vertex information of the second layer can be obtained as follows:

B = \sum α_{m_{1}, m_{2}} \cdot f_{m_{1}, m_{2}}

(23)

In the bimodal fusion stage of the second layer, the cross-modal attention mechanism is employed to interact the vertices of the bimodal features two-by-two, generating the trimodal vertices of the third layer. This fusion process results in three distinct trimodal vertices, each representing the integration of the information from different modalities. Furthermore, to achieve a more comprehensive integration of modal features, the cross-layer fusion is also performed by combining the unimodal feature vertices from the first layer, those not involved in the corresponding bimodal vertex, into the second layer’s bimodal vertices. This generates additional trimodal vertices, leading to a total of six trimodal vertices, each representing a different combination of modalities. This multi-layered fusion approach, incorporating both bimodal and trimodal interactions, allows the network to capture the more complex and nuanced micromotion information from diverse modal combinations. For the edges connecting the third layer, the same weight calculation method is applied as in the previous layers, ensuring the consistent and effective fusion. The weighted information from each trimodal vertex is then aggregated, resulting in the final trimodal information representation, denoted as T. This comprehensive fusion strategy not only strengthens the correlation between the modalities, but also equips the network with the detailed modal interaction information, which can enhance its ability to accurately capture the micromotion features of micromotion targets.

3.3. Adaptive Weight Classification Module

From the graph fusion module, it can obtain the information corresponding to the unimodal, bimodal, and trimodal features. To enable the model to flexibly assign weights to different layers of features and then adaptively adjust the contribution of each layer, the unimodal, bimodal, and trimodal features are concatenated, which is denoted as U, B, and T, respectively. These concatenated features are then passed through a multilayer perceptron (MLP) composed of three fully connected layers, where the potential interactions between the different features are captured. Finally, the weight coefficients are computed through a Sigmoid activation function, as shown in the Adaptive Weight Classification Module in Figure 3. The process is formalized as follows:

ω = σ (W_{3} \cdot ReLU (W_{2} \cdot ReLU (W_{1} \cdot x + b_{1}) + b_{2}) + b_{3})

(24)

where x represents the input after concatenating U, B, and T.

W_{1}

,

W_{2}

,

W_{3}

,

b_{1}

,

b_{2}

, and

b_{3}

are all learnable parameters in the fully connected layer and

σ

is the Sigmoid activation function.

ω

will be divided into three weight coefficients

ω = {[ω_{1}^{T}, ω_{2}^{T}, ω_{3}^{T}]}^{T}

. The weighting coefficients are weighted with the input features of different modalities of the input, and finally the weighted features are obtained with the following expression:

F_{fuse} = ω_{1} U + ω_{2} B + ω_{3} T

(25)

where

ω_{1}

is the weight of U,

ω_{2}

is the weight of B, and

ω_{3}

is the weight of T.

By using the weighted features, the contribution of each layer in the graph fusion network module can be dynamically adjusted, which can enhance the model’s robustness and improve the feature representation. The weighted features are then passed into a feed-forward neural network (FNN), which consists of multiple fully connected layers, activation functions, batch normalization, and dropout layers. This structure enables the network to map the features to the final classification output. The FNN performs the hierarchical feature compression and expansion, allowing the network to extract the global information from the high-dimensional input features. Additionally, regularization techniques, such as the dropout and batch normalization, are employed to improve the generalization ability of the model, thereby reducing the risk of the overfitting. This process ensures that the rich feature set is preserved while enhancing the model’s robustness to the noise, ultimately improving the final classification accuracy. Through this method, the network is able to make more accurate predictions while maintaining a high degree of the generalization to unseen data.

4. Experimental Results and Analysis

4.1. Dataset Generation

Since the measured data are difficult to obtain, the dimensions and micromotion parameters of the four types of micromotion targets are designed according to the literature [33,34], and their three-dimensional (3D) model structure is shown in Figure 6. The physical optics (PO) method is used to obtain the simulated narrowband radar electromagnetic (EM) data and RCS data. Since the designed four types of micromotion targets have rotational symmetry, the azimuth angle is fixed at 90°, and the pitch angle is varied from 0 to 180° with an interval of 0.1°, so that the narrowband radar echo data can be obtained, as shown in Figure 7.

In order to better simulate the real situation of the micromotion target, let us set the pulse repetition frequency (PRF) of the narrowband radar signal to be 600 Hz and the observation time to be 1 s. The initial pitch angle is from 31° to 40° with an interval of 1°, and the precession frequency and precession angle of each target are different, in which the precession frequency interval is 0.2 Hz and the precession angle interval is 0.15°, and the actual parameters are shown in Table 1. Using the micromotion parameters with the static narrowband radar echo data obtained from the simulation to generate 1210 sample data (10 × 11 × 11) for each category, there are 4840 sample data for four categories of micromotion targets. In order to better realize the recognition of micromotion targets in reality, this paper divides the dataset from 31 to 40° according to the angle, with the first fifty percent as the training set, and the second fifty percent as the test set. In order to simulate the effect of the noise on the echo, we added the Gaussian noise into the signal model, in four cases of 0 dB, 5 dB, 10 dB, and 15 dB. The dynamic narrowband radar echo data are processed accordingly to obtain the RCS time series, TF image, and CVD of the micromotion target. Partial plots of the three modalities of the micromotion target with a signal-to-noise ratio (SNR) of 5 dB are shown in Figure 8, Figure 9 and Figure 10.

4.2. Hyperparameter Settings and Comparison Methods

In this paper, two different networks are used for the feature extraction. Conv1D is used to extract features from RCS time series and Conv2D is used to extract unimodal features from TF image and CVD, the extracted features are fed into the graph fusion network for the final output. The learning rate of the whole network is set to 0.0005, the batch size is set to 64, the epoch is set to 100, and then Adam is used as the optimizer of the network to update the weights. The comparison methods are given as follows:

(1): RCS Net [25]. This model mainly consists of a one-dimensional CNN with the RCS time series as its input, which is utilized for the micromotion target recognition.
(2): MobileNet Vit [41]. This model is a combination of the CNN and vision in Transformer, which makes the model able to capture both local features and global information. Firstly, it uses the convolutional layer to extract the local information in the image, and then it uses the attention mechanism to focus on the long-distance dependency and global contextual information, which makes the model able to capture the semantic information of the image in a more comprehensive way. This model can capture the semantic information of the image more comprehensively, which can be well applied to CVD and TF images.
(3): Transfer learning-based parallel network (TLPS Net) [31]. This model takes the TF image and CVD as inputs and applies the migration learning on the ResNet-18 network to construct two parallel networks for the micromotion target recognition.
(4): Attention-augmented cross-modal feature fusion recognition network (ACM-FR Net) [33]. This model employs a feature fusion approach to integrate data from three modalities, i.e., the HRRP, JTF image, and RID. Initially, the features of each modality are extracted, after which the modal information is interactively fused through an adaptive cross-modal feature fusion mechanism. This process enables the identification of micromotion targets. The proposed method utilizes modalities derived from the broadband radar echo data. To facilitate a better comparison with the existing methods, corresponding modifications are made in this paper to ensure that the modalities used are consistent with those in the current model. Specifically, the network designed for the HRRP and JTF data is applied to the CVD and TF data, while the RCS data are processed using the network proposed in this paper.

4.3. Performance Experiments

To demonstrate the effectiveness of the proposed method, it was compared with four other methods, and the accuracy results are shown in Table 2. As can be seen, the proposed method outperforms all the other four methods, achieving an accuracy of 99.55%. This significant improvement can be attributed to the combination of the cross-modal attention with the graph fusion network module, which effectively fuses the information across different modalities, thereby enhancing the recognition accuracy of the micromotion target.

To better understand the learning dynamics of different methods, we analyze their training behavior by examining the evolution of accuracy and loss metrics. The convergence curves of different methods during the training process are presented in Figure 11. The proposed GF-AM Net demonstrates stable training behavior throughout the training stage. The accuracy and loss curves indicate consistent and smooth convergence during the training process. Notably, while maintaining the stable training performance, the proposed method achieves the superior accuracy on the test set (as shown in Table 2), demonstrating the better generalization ability.

To further analyze the performance of the proposed method across different target classes, the confusion matrix for each method is plotted and shown in Figure 12. In the confusion matrix, the horizontal axis represents the predicted targets, the vertical axis represents the actual targets, and the values in the cells indicate the accuracy rate. From the matrix, it is evident that the accuracy difference between the smallest and largest class is only 1.5%, which highlights the balanced performance of the proposed method across different classes when compared to other methods.

To better highlight the superior performance of the proposed method compared to other methods under narrowband radar conditions, t-SNE is utilized to visualize the feature distributions of the proposed and all comparison methods, which is shown in Figure 13. The results can clearly highlight the ability of each method to the differentiate the data. It can be seen that the proposed method in this paper exhibits a more distinct separation between the four categories, with a higher degree of clustering among similar points, in contrast to the other methods. The proposed method indicates a more compact and well-defined clustering structure, suggesting that the GF-AM Net has effectively learned a richer set of features. As a result, the proposed method achieves a better distribution of features in the feature space, further confirming its superior ability to distinguish between micromotion targets.

4.4. Ablation Experiments

To more effectively validate the performance of the tri-modal fusion approach compared to unimodal and bimodal methods, this paper first performed the unimodal feature extraction, then fed the information from the first and second layers of the graph fusion module into the output classification layer. Specifically, the unimodal mode uses only the vertices’ information from the first layer, while the bimodal mode leverages the fused features from the first and second layers. The results presented in Table 3 show that, under a 5 dB SNR condition, the tri-modal fusion method improves the recognition performance by at least 6.4% compared to the unimodal mode and at least 0.7% compared to the bimodal mode. This indicates that the information fusion across three modalities provides a more comprehensive feature representation than unimodal or bimodal features, resulting in improved robustness and better noise immunity, which is crucial for the accurate recognition of micromotion targets. These results demonstrate that the proposed network can efficiently extract representative features from all three modalities.

To comprehensively evaluate the proposed multidomain adaptive feature fusion algorithm, we conduct comparative experiments with different fusion approaches, including concatenation-based (Concat Net) and weighted summation-based (Weight Net) fusion techniques. The experimental results are shown in Table 4. The results demonstrate that the proposed feature fusion method achieves the superior performance compared to these alternative approaches. Through the integration of adaptive cross-modal attention mechanisms and graph fusion techniques, the proposed method enables the more precise and nuanced integration of the multimodal data, leading to enhanced target recognition performance. While each fusion approach offers its own merits in the feature integration, the proposed method excels by effectively modeling and leveraging the complex interdependencies between different modalities, which achieves the more robust and discriminative representations of the micromotion target.

To systematically evaluate the effectiveness of different methods in the ablation study, we conduct a detailed analysis of their training dynamics. Figure 14 illustrates the training curves obtained from the ablation experiments, demonstrating the evolution of accuracy and loss metrics throughout the training process. While several comparative methods exhibit comparable convergence patterns with the proposed approach, empirical results on the test set demonstrate that the proposed method achieves superior generalization capability, as shown in Table 3 and Table 4.

4.5. Different Networks at Different SNRs

In the above experiments, the Gaussian noise is introduced into the RCS, TF image, and CVD data to simulate real-world conditions. In this section, this paper further evaluates the robustness of the proposed method by testing its performance at different SNRs, ranging from 0 dB to 15 dB. The experimental results are summarized in Table 5. As expected, the recognition accuracy of all methods tends to decrease as the SNR decreases. However, the proposed method demonstrates remarkable robustness, achieving a recognition accuracy of 96.61% even at 0 dB SNR. This recognition is significantly higher than that of the other comparison methods, highlighting the strong anti-interference capabilities and overall robustness of the proposed approach in noisy environments. The results confirm that the proposed method can effectively extract meaningful features and maintain the high recognition accuracy, even under extreme conditions where the SNR is as low as 0 dB. This underscores the proposed method’s ability to handle noisy data and its suitability for real-world applications in challenging radar scenarios.

4.6. Discussion

To evaluate the computational efficiency of different models, we compare their theoretical computational cost in terms of GFLOPS (Giga Floating Point Operations Per Second). As shown in Table 6, the proposed GF-AM Net achieves a moderate computational cost of 0.52 GFLOPS, which is significantly lower than TLPS Net (4.74 GFLOPS). While RCS Net shows the lowest computational complexity of 0.01 GFLOPS, the proposed method maintains a better balance between model complexity and performance (as discussed in previous sections). In future work, we plan to further optimize the network architecture to reduce its computational complexity while maintaining its high performance.

5. Conclusions

The aim of this paper is to address the challenge of the micromotion target identification under narrowband radar conditions. To fully use the information contained in narrowband waveforms, the GF-AM Net for the micromotion target identification using the narrowband radar was proposed, which can make the comprehensive use of the information from different modalities. In this network, two types of CNNs are first employed to extract features from three modalities (RCS, TF image, and CVD), followed by the GF-AM Net. This mechanism simulates multi-level interactions between unimodal, bimodal, and trimodal features. The unimodal, bimodal, and trimodal features are then fused according to their relative importance, ultimately achieving the high-precision micromotion target recognition under narrowband radar conditions. The experimental results from electromagnetic simulations involving four types of micromotion targets demonstrate that the proposed GF-AM Net achieves a recognition accuracy of 99.55%, showcasing its high accuracy and robustness. This validates the effectiveness of GF-AM Net in extracting and integrating multimodal features to improve the micromotion target identification, even in challenging radar environments. The proposed method offers a significant potential for the real-world applications, where the fusion of the multimodal information plays a crucial role in enhancing the performance of radar-based target recognition systems.

Author Contributions

Conceptualization, Y.Z., H.X. and T.G.; methodology, Y.Z. and H.X.; software, Y.Z. and H.X.; validation, Y.Z. and H.X.; formal analysis, H.X., N.Z. and Z.L.; investigation, Y.Z. and H.X.; resources, Y.Z., H.X. and H.L.; data curation, Y.Z., H.X. and M.G.; writing—original draft preparation, Y.Z. and H.X.; writing—review and editing, Y.Z., H.X., H.L., M.G. and B.X.; visualization, Y.Z. and H.X.; supervision, Y.Z., H.X. and T.G.; project administration, Y.Z., H.X. and Z.L.; funding acquisition, H.X., T.G. and N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Shenzhen Science and Technology Program (Grant No. JCYJ20240813151238049 and Grant No. 202206193000001, 20220815171723002), by the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023A1515011588), by the Fundamental Research Funds for the Central Universities, Sun Yat-sen University (Grant No. 24qnpy202 and Grant No. 24qnpy164), by the China Postdoctoral Science Foundation (Grant No. 2024M753740), by the National Natural Science Foundation of China (Grant No. 62203465 and Grant No. 62201614), and the Science and Technology Planning Project of Key Laboratory of Advanced IntelliSense Technology, Guangdong Science and Technology Department (Grant No. 2023B1212060024). Hongtu Xie is the corresponding author.

Data Availability Statement

Data are contained within the article and can be made available by the authors.

Acknowledgments

The authors would like to thank the editors and reviewers for their very competent comments and helpful suggestions to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, H.; Zhang, Y.; He, J.; Yi, S.; Zhang, L.; Zhu, N. Foliage-concealed target change detection scheme based on convolutional neural network in low-frequency ultra-wideband SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19302–19316. [Google Scholar]
Xie, H.; Yi, S.; He, J.; Zhang, Y.; Lu, Z.; Zhu, N. Low-frequency ultra-wideband synthetic aperture radar foliage-concealed target change detection strategy based on image stacks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19817–19830. [Google Scholar]
Xie, H.; He, J.; Lu, Z.; Hu, J. Two-level feature-fusion ship recognition strategy combining HOG features with dual-polarized data in SAR images. Remote Sens. 2023, 15, 4393. [Google Scholar] [CrossRef]
Tang, W.; Yu, L.; Wei, Y.; Tong, P. Radar target recognition of ballistic missile in complex scene. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Chongqing, China, 11–13 December 2019; pp. 1–6. [Google Scholar]
Zhu, J.; Yin, T.; Guo, W.; Zhang, B.; Zhou, Z. An underwater target azimuth trajectory enhancement approach in BTR. Appl. Acoust. 2025, 230, 110373. [Google Scholar]
Zhu, J.; Xie, Z.; Jiang, N.; Song, Y.; Han, S.; Liu, W.; Huang, X. Delay-Doppler map shaping through oversampled complementary sets for high-speed target detection. Remote Sens. 2024, 16, 2898. [Google Scholar] [CrossRef]
Zhu, J.; Song, Y.; Jiang, N.; Xie, Z.; Fan, C.; Huang, X. Enhanced Doppler resolution and sidelobe suppression performance for Golay complementary waveforms. Remote Sens. 2023, 15, 2452. [Google Scholar] [CrossRef]
Xie, Z.; Wu, L.; Zhu, J.; Lops, M.; Huang, X.; Shankar, B. RIS-Aided radar for target detection: Clutter region analysis and joint active-passive design. IEEE Trans. Signal Process. 2024, 72, 1706–1723. [Google Scholar]
Choi, I.O.; Park, S.H.; Kim, M.; Kang, K.B.; Kim, K.T. Efficient discrimination of ballistic targets with microomotions. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1243–1261. [Google Scholar]
Duan, J.; Xie, P.; Zhang, L.; Ma, Y. Space target dynamic identification by exploiting geometrical feature flow from ISAR image sequences. IEEE Sens. J. 2022, 22, 21877–21884. [Google Scholar]
Zhang, X.; Yang, P.; Cao, D. Synthetic aperture image enhancement with near-coinciding Nonuniform sampling case. Comput. Electr. Eng. 2024, 120, 109818. [Google Scholar]
Xie, Z.; Xu, Z.; Han, S.; Zhu, J.; Huang, X. Modulus constrained minimax radar code design against target interpulse fluctuation. IEEE Trans. Veh. Technol. 2023, 72, 13671–13676. [Google Scholar] [CrossRef]
Silberman, G.L. Parametric classification techniques for theater ballistic missile defense. Johns Hopkins APL Tech. Dig. 1998, 19, 323. [Google Scholar]
Chen, X.; Ye, C.; Wang, Y.; Zhang, Y.; Hu, Q. Unambiguous estimation of multidimensional parameters for space precession targets with wideband radar measurements. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5112716. [Google Scholar] [CrossRef]
Hanif, A.; Muaz, M.; Hasan, A.; Adeel, M. Micro-Doppler based target recognition with radars: A review. IEEE Sens. J. 2022, 22, 2948–2961. [Google Scholar] [CrossRef]
Lei, X.; Fu, X.; Wang, C.; Gao, M. Statistical feature selection of narrowband RCS sequence based on greedy algorithm. In Proceedings of the 2011 IEEE CIE International Conference on Radar, Chengdu, China, 24–27 October 2011; Volume 2, pp. 1664–1667. [Google Scholar]
Gao, H.; Xie, L.; Wen, S.; Kuang, Y. Micro-Doppler signature extraction from ballistic target with micromotions. IEEE Trans. Aerosp. Electron. Syst. 2010, 46, 1969–1982. [Google Scholar] [CrossRef]
Wang, Z.; Luo, Y.; Li, K.; Yuan, H.; Zhang, Q. Micro-Doppler parameters extraction of precession cone-shaped targets based on rotating antenna. Remote Sens. 2022, 14, 2549. [Google Scholar] [CrossRef]
Clemente, C.; Pallotta, L.; De Maio, A.; Soraghan, J.J.; Farina, A. A novel algorithm for radar classification based on Doppler characteristics exploiting orthogonal pseudo-Zernike polynomials. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 417–430. [Google Scholar] [CrossRef]
Persico, A.R.; Clemente, C.; Gaglione, D.; Ilioudis, C.V.; Cao, J.; Pallotta, L.; De Maio, A.; Proudler, I.; Soraghan, J.J. On model, algorithms, and experiment for micro-Doppler-based recognition of ballistic targets. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 1088–1108. [Google Scholar] [CrossRef]
Pan, X.Y.; Wang, W.; Feng, D.J.; Liu, Y.C.; Wang, G.Y. Signature extraction from rotating targets based on a fraction of HRRPs. IEEE Trans. Antennas Propag. 2014, 63, 585–592. [Google Scholar] [CrossRef]
Persico, A.R.; Ilioudis, C.V.; Clemente, C.; Soraghan, J.J. Novel classification algorithm for ballistic target based on HRRP frame. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 3168–3189. [Google Scholar] [CrossRef]
Zhu, N.; Hu, J.; Xu, S.; Wu, W.; Zhang, Y.; Chen, Z. Micromotion parameter extraction for ballistic missile with wideband radar using improved ensemble EMD method. Remote Sens. 2021, 13, 3545. [Google Scholar] [CrossRef]
Yang, D.; Wang, X.; Peng, Z.; Hu, L.; Li, J. Micro-Doppler separation based on U-Net and plot-curve association for ballistic target. IEEE Trans. Aerosp. Electron. Syst. 2022, 59, 827–836. [Google Scholar] [CrossRef]
Chen, J.; Xu, S.; Chen, Z. Convolutional neural network for classifying space target of the same shape by using RCS time series. IET Radar Sonar Navig 2018, 12, 1268–1275. [Google Scholar] [CrossRef]
Narasimhamurthy, R.; Khalaf, O.I. Deep learning network for classifying target of same shape using RCS time series. Air Traffic Manag. Control 2021, 9, 25. [Google Scholar]
Bai, X.; Wang, L.; Zhou, F.; Li, Y.; Hui, Y. Deep CNN for micromotion recognition of space targets. In Proceedings of the 2016 CIE International Conference on Radar (RADAR), Guangzhou, China, 10–13 October 2016; pp. 1–5. [Google Scholar]
Feng, B.; Chen, B.; Liu, H. Radar HRRP target recognition with deep networks. Pattern Recognit. 2017, 61, 379–393. [Google Scholar] [CrossRef]
Zhang, Y.P.; Zhang, Q.; Kang, L.; Luo, Y.; Zhang, L. End-to-end recognition of similar space cone–cylinder targets based on complex-valued coordinate attention networks. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5106214. [Google Scholar] [CrossRef]
Lee, J.I.; Kim, N.; Min, S.; Kim, J.; Jeong, D.K.; Seo, D.W. Space target classification improvement by generating microDoppler signatures considering incident angle. Sensors 2022, 22, 1653. [Google Scholar] [CrossRef] [PubMed]
Jung, K.; Lee, J.I.; Kim, N.; Oh, S.; Seo, D.W. Classification of space objects by using deep learning with micro-Doppler signature images. Sensors 2021, 21, 4365. [Google Scholar] [CrossRef]
Tian, X.; Bai, X.; Xue, R.; Qin, R.; Zhou, F. Fusion recognition of space targets with micromotion. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3116–3125. [Google Scholar] [CrossRef]
Tian, X.; Bai, X.; Zhou, F. Recognition of micromotion space targets based on attention-augmented cross-modal feature fusion recognition network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5104909. [Google Scholar] [CrossRef]
Zhang, Y.; Xie, Y.; Kang, L.; Li, K.; Luo, Y.; Zhang, Q. Feature-Level fusion recognition of space targets with composite micromotion. IEEE Trans. Aerosp. Electron. Syst. 2023, 60, 934–951. [Google Scholar]
Kim, J.H.; Kwon, S.Y.; Kim, H.N. Spectral-Kurtosis and image-embedding approach for target classification in micro-Doppler signatures. Electronics 2024, 13, 376. [Google Scholar] [CrossRef]
Siriwardhana, S.; Kaluarachchi, T.; Billinghurst, M.; Nanayakkara, S. Multimodal emotion recognition with transformer-based self supervised feature fusion. IEEE Access 2020, 8, 176274–176285. [Google Scholar]
Mai, S.; Hu, H.; Xing, S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 164–172. [Google Scholar]
Zhao, F.; Zhang, C.; Geng, B. Deep multimodal data fusion. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar]
Lei Ba, J.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 6000–6010. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]

Figure 1. Diagram of the micromotion model.

Figure 2. Process of converting the narrowband radar echoes into the TF image and CVD.

Figure 3. Overall structure of the GF-AM Net.

Figure 4. Structure diagram of the unimodal feature extraction network.

Figure 5. Cross-modal attention mechanism in the graph fusion module.

Figure 6. 3D structure of the four types of micromotion targets. From left to right, they are Target 1, Target 2, Target 3, and Target 4.

Figure 7. Static narrowband radar echo data for four types of micromotion targets. (a) Target 1; (b) Target 2; (c) Target 3; (d) Target 4.

Figure 8. RCS for four types of micromotion targets. (a) Target 1; (b) Target 2; (c) Target 3; (d) Target 4.

Figure 9. TF image for four types of micromotion targets. (a) Target 1; (b) Target 2; (c) Target 3; (d) Target 4.

Figure 10. CVD for four types of micromotion targets. (a) Target 1; (b) Target 2; (c) Target 3; (d) Target 4.

Figure 11. Comparison of training curves for different methods.

Figure 12. Confusion matrix of different methods. (a) RCS Net; (b) Mobilenetvit_1; (c) Mobilenetvit_2; (d) TLPS Net; (e) ACM-FR Net; (f) GF-AM Net.

Figure 13. Visualization of the corresponding features extracted by the different networks. (a) RCS Net; (b) Mobilenetvit_1; (c) Mobilenetvit_2; (d) TLPS Net; (e) ACM-FR Net; (f) GF-AM Net.

Figure 14. Training curve comparison in ablation experiments.

Table 1. Micromotion parameters of the micromotion targets.

Targets	Initial Elevation Angle	Precession Frequency	Precession Angel
Target 1	31:1:40°	2:0.2:4 Hz	4:0.15:5.5°
Target 2	31:1:40°	1.5:0.2:3.5 Hz	2:0.15:3.5°
Target 3	31:1:40°	1:0.2:3 Hz	3:0.15:4.5°
Target 4	31:1:40°	0.5:0.2:2.5 Hz	1:0.15:2.5°

Table 2. Recognition accuracy by different methods.

Methods	Domain	Accuracy
RCS Net	RCS	0.7711
Mobilenetvit_1	TF	0.9236
Mobilenetvit_2	CVD	0.8789
TLPS Net	TF and CVD	0.9165
ACM-FR Net	RCS, TF and CVD	0.9905
GF-AM Net	RCS, TF and CVD	0.9955

Table 3. Comparison of the recognition effect of the unimodal and multimodal.

Modality	Domain	Accuracy
	RCS	0.7777
Unimodal	TF	0.9310
	CVD	0.9198
	RCS and TF	0.9884
Bimodal	RCS and CVD	0.9860
	TF and CVD	0.9351
Trimodal	RCS, TF and CVD	0.9955

Table 4. Comparison of the recognition accuracy of different fusion methods.

Method	Domain	Accuracy
Concat Net	RCS, TF and CVD	0.9636
Weight Net	RCS, TF and CVD	0.9888
GF-AM Net	RCS, TF and CVD	0.9955

Table 5. Comparison of the recognition accuracy by different methods under different SNRs.

Methods	Domain	0 dB	5 dB	10 dB	15 dB
RCS Net	RCS	0.5847	0.7711	0.8731	0.9050
Mobilenetvit_1	TF	0.7959	0.9236	0.9314	0.9409
Mobilenetvit_2	CVD	0.7814	0.8789	0.9372	0.9421
TLPS Net	TF and CVD	0.8145	0.9165	0.9302	0.9335
ACM-FR Net	RCS, TF, and CVD	0.9070	0.9905	0.9917	0.9926
GF-AM Net	RCS, TF, and CVD	0.9661	0.9955	0.9975	0.9983

Table 6. Comparison of GFLOPS computational cost across different methods.

Method	GFLOPS
RCS Net	0.01
Mobilenetvit_1	0.6
Mobilenetvit_2	0.6
TLPS Net	4.74
ACM-FR Net	0.29
GF-AM Net	0.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Gao, T.; Xie, H.; Liu, H.; Ge, M.; Xu, B.; Zhu, N.; Lu, Z. Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism. Remote Sens. 2025, 17, 641. https://doi.org/10.3390/rs17040641

AMA Style

Zhang Y, Gao T, Xie H, Liu H, Ge M, Xu B, Zhu N, Lu Z. Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism. Remote Sensing. 2025; 17(4):641. https://doi.org/10.3390/rs17040641

Chicago/Turabian Style

Zhang, Yuanjie, Ting Gao, Hongtu Xie, Haozong Liu, Mengfan Ge, Bin Xu, Nannan Zhu, and Zheng Lu. 2025. "Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism" Remote Sensing 17, no. 4: 641. https://doi.org/10.3390/rs17040641

APA Style

Zhang, Y., Gao, T., Xie, H., Liu, H., Ge, M., Xu, B., Zhu, N., & Lu, Z. (2025). Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism. Remote Sensing, 17(4), 641. https://doi.org/10.3390/rs17040641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Narrowband Radar Micromotion Targets Recognition Strategy Based on Graph Fusion Network Constructed by Cross-Modal Attention Mechanism

Abstract

1. Introduction

2. Signal Modeling

2.1. Micromotion Signal Modeling

2.2. Narrowband Radar Signal Modeling

3. Proposed Method

3.1. Unimodal Feature Extraction

3.2. Graph Fusion Module for Cross-Modal Attention Mechanism Construction

3.3. Adaptive Weight Classification Module

4. Experimental Results and Analysis

4.1. Dataset Generation

4.2. Hyperparameter Settings and Comparison Methods

4.3. Performance Experiments

4.4. Ablation Experiments

4.5. Different Networks at Different SNRs

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI