Further author information: (Send correspondence to Sachini Wickramasinghe)
Sachini Wickramasinghe: shwickra@usc.edu
Dhruv Parikh: dhruvash@usc.edu
Bingyi Zhang: bingyizh@usc.edu
Rajgopal Kannan: rajgopal.kannan.civ@army.mil
Viktor Prasanna: prasanna@usc.edu
Carl Busart: carl.e.busart.civ@army.mil
VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA
Abstract
Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the state-of-the-art in various computer vision applications, outperforming Convolutional Neural Networks (CNNs). However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality. The standard SAR datasets have a limited number of labeled training data, reducing the learning capability of ViTs (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without pre-training. To this end, we incorporate the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules into the ViT model. We directly train this model on SAR datasets to evaluate its effectiveness for SAR ATR applications. The proposed model, VTR (ViT for SAR ATR), is evaluated on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Experimental results show that the proposed VTR model achieves a classification accuracy of 95.96%, 93.47%, and 99.46% on MSTAR, SynthWakeSAR, and GBSAR datasets, respectively. VTR achieves accuracy comparable to the state-of-the-art models on MSTAR and GBSAR datasets with and smaller model sizes, respectively. On SynthWakeSAR dataset, VTR achieves a higher accuracy with a model size that is smaller. Further, a novel FPGA accelerator is proposed for VTR, to enable real-time SAR ATR applications. Compared with the implementation of VTR on state-of-the-art CPU and GPU platforms, our FPGA implementation achieves latency reduction by a factor of and , respectively. For inference on small batch sizes, our FPGA implementation achieves a higher throughput compared with GPU.
keywords:
Synthetic Aperture Radar, Automatic Target Recognition, Vision Transformer1 INTRODUCTION
Automatic Target Recognition (ATR) for Synthetic Aperture Radar (SAR) images is a broadly researched topic due to its diverse applications spanning from remote sensing to military surveillance [1]. Unlike optical sensors, SAR can capture high-resolution images irrespective of weather conditions or time (day/night). Recent advances in SAR imaging systems have led to images with resolutions as high as few decimeters [2]. These advances, coupled with their any-circumstance imaging capability, have led to SAR based sensor systems outperforming their optical sensor system counterparts, while being more reliable. ATR for SAR comprises of three distinct tasks: detection, discrimination and classification [3]. Detection involves identifying regions-of-interest within a SAR image to localize targets. Discrimination involves the capability of an ATR algorithm to discard false alarms generated by the detection algorithms due to natural/artificial clutter. Classification involves accurately and precisely classifying detected targets within a SAR image [3]. Since the images generated by radar sensors differ significantly from the images generated by optical sensors [4], SAR ATR becomes a challenging problem to solve.
Recent advances in deep learning have revolutionized the field of SAR ATR [3]. Several works employing deep learning to SAR ATR utilize traditional Convolutional Neural Networks (CNNs) [5, 6] to classify SAR images, outperforming prior works. Graph Neural Network (GNN) based approaches [7, 8] have led to state-of-the-art performance for SAR ATR applications. GNNs drastically reduce the overall model parameters and inference latency, making them suitable for real-time applications. Vision Transformer (ViT), introduced by Dosovitskiy et al. [9], has outperformed traditional CNN based architectures across several tasks in the computer vision domain [10]. Several recent works [11, 12, 13] have employed ViTs across various SAR ATR applications. Chen et al. [11] utilize a multi-scale geospatial contextual attention network (MGCAN) on several SAR image chips, inspired from the Multi-Head Self-Attention mechanism in ViTs. MGCAN model was used in Chen et al. [11] to perform object detection for aircrafts. Liu et al. [12] utilize ViTs and CNNs to extract global and local features, respectively. The combined model led to an improved classification accuracy when employed for image classification on high resolution (HR) SAR images. Wang et al. [13] also combine ViTs and CNNs into a module termed ConvT. ConvT was utilized for few shot image classification on small-sized SAR datasets.
Despite ViTs being the state-of-the-art model for vision applications, significant challenges prevent its effective deployment for SAR ATR applications. (i) SAR ATR datasets are typically quite small. SAR image collection is an expensive endeavour. Thus, most available SAR datasets have limited number of training instances. ViTs generally require a large amount of training data in order to generalize due to its limited locality inductive bias [9]. Thus, training a raw ViT on the small-sized SAR datasets without pre-training on larger datasets becomes challenging. (ii) ViTs are computationally expensive with a large memory footprint and associated model size. Since SAR ATR applications are driven by real-time constraints, it becomes imperative to optimize trained models for efficient real-time inference. The computational cost of ViTs is proportional to the square of the total input tokens [9]. For images with higher resolutions (such as SAR), this leads to an intractably high computation cost.
Prior works utilize ViTs pre-trained on large datasets such as ImageNet [14]. The pre-trained ViTs are then finetuned on small SAR ATR datasets [15, 12, 16]. Further, while several works accelerate SAR ATR applications [17, 18, 19] on FPGA, these works do not focus on optimizing the ViT architecture for SAR ATR.
To address the above challenges, we propose VTR, a novel ViT based model for SAR ATR application. VTR can be trained directly on the small-sized SAR datasets, without pre-training on larger datasets. Furthermore, we propose a novel FPGA accelerator for low-latency and high-throughput SAR ATR. Our contributions are as follows,
-
•
We propose a novel ViT model (VTR) equipped with the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules [20] for SAR ATR. VTR improves the locality inductive bias of ViTs, allowing it to directly be trained on small SAR datasets, removing the overhead associated with pre-training.
-
•
We propose a novel accelerator for the proposed VTR model on FPGA. We propose a Highly Parallel Processing Unit (HPPU) to fully exploit the compute parallelism within each layer (encoder) of a VTR.
- •
-
•
VTR achieves a higher classification accuracy for SynthWakeSAR dataset with a smaller model size. On MSTAR and GBSAR datasets, VTR achieves an accuracy comparable to the current state-of-the-art, with a and smaller model size, respectively.
-
•
The proposed VTR FPGA accelerator reaches a latency reduction of and compared to state-of-the-art GPU and CPU platforms, respectively. For inference on small batch sizes, it reaches a throughput improvement of when compared with GPU.
2 BACKGROUND AND RELATED WORK
2.1 Deep Learning models for SAR ATR
Deep neural networks have gained high interest, showcasing impressive results across various problem domains. The state-of-the-art works for SAR ATR applications involve either convolutional neural networks (CNNs) or Graph Neural Networks (GNNs)[24]. Morgan [25] is the first work to propose a deep CNN for SAR ATR. Recently, Zhang et al.[7] proposed a novel architecture based on GNNs that exceeded classification accuracy of 99% on the MSTAR dataset. In contrast to CNNs, the proposed GNN exploits the data sparsity in SAR images to reduce computation complexity. Although ViTs have emerged as state-of-the-art models for computer vision tasks, they perform poorly on small datasets due to severe overfitting. Hence, training a ViT model for SAR ATR applications without pre-training is challenging. Li et al.[26] proposed a pre-processing technique to improve the accuracy of the ViT model on SAR image classification. Recently, Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) techniques have been proposed to improve the accuracy of ViTs on small datasets[20]. In this work, we propose VTR model for SAR ATR, which incorporates the SPT and LSA modules into a vanilla ViT model. This enables training VTR on small SAR datasets without pretraining.
2.2 Vision Transformer
ViT[9] has achieved significant advancements in computer vision tasks like image classification and object detection. Its core components include Multi-Headed Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. First, the input image undergoes partitioning into non-overlapping patches. Then, the flattened patches are embedded along with the class token and positional information. Subsequently, this processed data is fed into the transformer encoder.
Multi-Headed Self-Attention: In the self attention block, the input embeddings are linearly projected to query, key, and value vectors. The query and key vectors are then utilized to obtain a scaled dot product. A single self-attention operation is denoted as one ”head”. The self-attention function is defined as,
(1) |
where , and represent the query, key, and value matrix, respectively. is the dimension of embeddings in . Several such heads are concatenated and projected via a linear layer to compute the final MSA output.
Multi-Layer Perceptron: The output of the MSA layer is fed into the MLP block. It consists of two linear layers with an activation function. The MLP layer is formulated as,
(2) |
where and represent the hidden layer weights and bias, respectively. and are the output layer weights and bias respectively. GeLU is the activation function.
In the standard ViT model, the receptive field of the input embeddings (input tokens) remains fixed and cannot be adjusted. That is, the tokenization of the standard ViT is similar to the operation of a non-overlapping convolutional layer[20]. As a result, these tokens have a small receptive field leading to lower local inductive bias. To this end, SPT leverages spatial information by shifting the input image in the four or eight cardinal directions. Additionally, since both the query and key are linearly projected from the same input tokens, the self-token relations tend to exhibit larger magnitudes compared to inter-token relations. Consequently, the softmax function assigns relatively higher scores to self-token relations and smaller scores to inter-token relations. Hence, the attention of standard ViT tends to be similar to each other regardless of inter-token relations. Moreover, the scaling factor can cause smoothing of the attention score distribution[27]. These issues can be effectively mitigated by incorporating LSA into the attention layer. LSA helps in excluding self-token relations and applies a learnable temperature scale to the softmax function.
3 Overview
The overview of the proposed approach is detailed in Figure 1 and comprises of two stages: (i) VTR Training and (ii) VTR Inference.
VTR Training
The proposed VTR model (Section 4) is trained on small SAR datasets, without pre-training. As such training is inexpensive, we train several VTR model variants by varying the model hyper-parameters, for each SAR dataset, to evaluate the performance of VTR model across each setting.
VTR Inference
Next, the trained VTR model is accelerated on an FPGA using a novel accelerator proposed in Section 5. The input to the accelerator is the shifted and tokenized input image - post the application of the image shifting, concatenation and tokenization action of the SPT module. Specifically, for an image , the host processor shifts the input image along the four cardinal directions. Each such shifted image, along with the original raw image, are concatenated along the channel axis to generate a shifted concatenated image: . Here, represents the shifting module and represents the number of shifted images. Finally, the shifted concatenated image is tokenized by partitioning into several patches and flattening each patch into a embedding vector to generate the final shifted tokenized input for the FPGA accelerator, , where represents the tokenized input with tokens (total patches) and embedding dimension (size of a flattened patch).
4 MODEL DESIGN
The primary focus of our work is to implement a ViT that can effectively learn from small SAR datasets without any pre-training. In this Section, we present the architectural overview of the model. Then, we introduce the two modules: SPT and LSA[20]. The overall architecture is illustrated in Figure 2.
4.1 Overview of the Model
Our model first transforms the input image using SPT. Subsequently, both the original input image and the transformed images (shown in Figure 2) are concatenated and partitioned into non-overlapping patches. The flattened patches undergo layer normalization and linear projection to obtain patch embeddings which are concatenated with a learnable class embedding and then added to learnable positional embeddings to yield the final input tokens. This is then fed into the transformer encoder where the input is passed through several layers of multi-headed self-attention and multi-layer perceptron networks. In the multi-headed self-attention layers, our model incorporates the LSA module. The LSA module applies a learnable temperature scaling to the softmax function to sharpen the distribution of attention scores. Layer normalization is applied before every attention and MLP block while residual connections are applied after each block. Finally, the output of the transformer encoder is passed through an MLP head to generate the classification result.
4.2 SPT
In the SPT module, each input image, , is shifted by pixels in the four diagonal directions: left-up, right-up, left-down, and right-down. These shifted images are cropped to the same size as the original input image and concatenated with it (Figure 2). The resulting set of images are then divided into non-overlapping patches and flattened into a sequence of vectors, . and represent the height, width and the number of channels of the input image respectively. In our case, . is the flattened vector. indicates the size of the patch while represents the total number of shifted images. The patch embeddings (flattened vectors) are linearly transformed into a hidden dimension, . A class token is concatenated to the linearly transformed patch embeddings and positional embeddings are added. The output of this SPT module is fed to the transformer encoder.
4.3 LSA
The main components of the LSA module include diagonal masking and learnable temperature scaling. In the multi-headed self-attention layer, two linear projections of the input are generated: the query, , and the key, . As formulated in eq.1, the softmax function is applied on a scaled dot product of these linear projections. The diagonal of the resultant matrix of the dot product, , represents the self-token relations while the off-diagonal elements represent the inter-token relations. Since the linear projections are from the same input tokens, the self-token relations tend to be larger resulting in higher scores for self-token relations. To prevent this the LSA module forces on the diagonal elements. This effectively prevents the attention from being focused on its own tokens. The learnable temperature scaling technique incorporated in the LSA module, allows the ViT to decide the softmax temperature by itself while training.
5 Accelerator Design
In this Section, we describe the designed hardware accelerator comprising of two core compute modules: (i) Highly Parallel Processing Unit (HPPU) and (ii) Element-wise Compute Unit (ECU). Prior to their description, we briefly discuss the data layout (format) for the accelerator.
5.1 Data Layout
Figure 3 describes the data layout. In order to compute the matrix product , being the left matrix and the right matrix, the left matrix is stored in a row-major format and the right matrix is stored in a column-major format.
The matrices are divided into blocks of size . The matrices are stored in a block-contiguous fashion, either in row-major format (left matrix) or column-major format (right matrix). The elements within each block are stored in a contiguous fashion.
5.2 Compute Units
HPPU
The core computing unit is show in Figure 4. In order to leverage the inherent compute parallelism along the attention heads within a transformer, the entire compute unit is divided into several Head Compute Units (HCUs). Each HCU computes an individual (attention) head. Further, each HCU contains a 2D mesh of processing elements (PEs). Thus, for a total of HCUs, each HCU comprising of PEs, the total number of PEs in the HPPU are . This arrangement of PEs exploits compute parallelism across three levels of compute dimensions. PEs along the and axes exploit compute parallelism within each head along the token and embedding dimensions, respectively. such instances of HCUs allows computing heads simultaneously.
Each PE is organized as a grid of compute elements (systolic array) (shown in Figure 4). The Global Input Buffer (GIB) is utilized to store the input feature matrix. Each HCU has its own, individual Local Input Buffer (LIB). Data from the GIB is appropriately streamed (broadcasted) to each HCUs LIB, allowing for simultaneous compute. The output results computed by each HCU are stored in Local Output Buffers (LOB) and are streamed out into a Global Output Buffer (GOB). The input weight matrix is stored in a Weight Buffer (WB). This WB is partitioned into several local buffers (banks) for each column of computing PEs within an HCU.
The HPPU performs Dense Block-wise Matrix Multiplication (DBMM) on two dense matrices. The dense matrices are partitioned block-wise (Section 5.1) into blocks of size . Each PE computes an output block for the output matrix.
ECU
The ECU is structured identically to the HPPU (as in Figure 4). However, its main function is to perform element-wise computations (element-wise multiply and/or add) (Figure 5). The ECU can also perform element-wise non-linear activation (GELU) or exponentiation (exp). The ECU has a total of buffers to support a general operation of the form . Here represents the non-linear activation, respresents element-wise multiply and represents element-wise add. and are equal sized matrices stored per the layout in Section 5.1.
The buffers for ECU store the matrices and . A final buffer stores the final computed result. Each PE computes an output block of result through element-wise operations. This is in contrast to computing an output block of result through block-wise matrix multiplication, as in the HPPU.
5.3 Compute Flow
In this Section, we describe the overall compute flow for the accelerator in Section 5.2 to compute the output for an input . The input is shifted and tokenized (Section 3). The compute flow within each VTR encoder layer can be partitioned into two flows: (i) Multi-Headed Self-Attention (MSA) compute (ii) Multi-Layer Perceptron (MLP) compute. Prior to describing the MSA and MLP compute, we first describe the generation of learned embeddings from the shifted tokenized input, .
Embedding Generation
The shifted and tokenized input , is transformed into learned embeddings as per eq. 3. Note that refers to the total tokens and refers to the raw embedding size.
(3) |
In eq. 3, refers to the learned class token embedding. stands for layer normalization layer and stands for a linear layer (a single layer MLP). The learned embeddings are obtained by performing layer normalization on the input matrix (along the embedding axis). The layer-normed embeddings, , are passed through a linear layer. Finally, the class embedding vector, , is concatenated to the output of the linear layer to generate the final embeddings, .
The operation of layer normalization is performed via both the HPPU and ECU units. The mean and standard deviation of each embedding vector is computed via ECU. The required aggregations for this are performed by HPPU via a multiply by operation (where refers to a vector of ’s). The learned parameters of the layer normalization layer are finally applied to as via the ECU. Here are the mean and standard deviation vectors for the embeddings (token-wise), respectively.
The linear layer is mapped to the DBMM operation performed via the HPPU. Note that there is no direct notion of a ‘head’ associated with a simple linear layer. However, the output matrix can be partitioned into several ‘fictitious heads’ along along the column axis. Then, each such ‘fictitious head’ is mapped to an HCU within the HPPU.
MSA Compute
The MSA compute contains several stages described thus. The generated embedding matrix is used to generate the , and matrix via weights , and respectively. The , and weights are naturally partitioned per the total heads, , associated with the multi-headed self-attention mechanism within the encoder. Thus, the matrices can be represented as follows,
(4) |
In eq. 4, the output matrix is split into several heads, with the output matrix of some head , computed by an HCU via DBMM. Here, refers to one of the , and matrices.
Next, the raw attention matrix for each head is computed as below,
(5) |
As is suggested by eq. 5, the attention compute is performed by multiplying the query and key matrices of corresponding heads. Each such output matrix (for a head ) is computed by an HCU using DBMM. The softmax scores are computed via both the ECU and the HPPU compute units. ECU performs the scaling operations and exponentiation. HPPU performs aggregations to generate the scales required for softmax compute. To incorporate the LSA mechanism, a learned scaling factor (instead of ) is used to scale the raw attention matrix. Further, post-scaling (pre-softmax), the values in the attention matrix, along the diagonals, are set to arbitrarily large negative values. Post-softmax, the attention scores along the diagonal (in each head), thus become close to . This effectively removes a tokens self-value vector in computing the weighted aggregate of value vectors with its attention scores. As a result, the tokens focus more on inter-token relations.
Finally, the processed attention matrix comprising the attention scores are used along with the value matrix as below,
(6) |
In eq. 5.3, refers to the operation of setting the values along the diagonal of a matrix (for the head) to an arbitrarily large value close to .
The compute in eq. 6 is similar to that in eq. 5 and is performed by the HPPU through the DBMM operation. Finally, the projection matrix is utilized to compute the final MSA output as = . This compute is similar to computing the matrices and is also performed via the HPPU through the DBMM operation.
MLP Compute
The compute in the MLP stage of the encoder comprises of two single layer feed-forward neural networks. This is performed as DBMM via the HPPU similar to the description in the MSA compute stage.
6 EVALUATION
6.1 Experimental Setting
We implement the VTR model using PyTorch 2.0.1[28] and utilize an NVIDIA RTX A6000 GPU with CUDA 11.8 for training the model. Several model variants are trained to comprehensively evaluate the VTR model’s robustness and generalization across different small-sized SAR datasets. Specifically, we vary the following hyper-parameters:
-
•
Patch Size: We explore different patch sizes for partitioning the input image, to assess its impact on model performance.
-
•
Hidden Dimension: The dimensionality of the hidden layers within the transformer encoder is varied to observe its effects on model learning.
-
•
Depth: In our experiments, we explore varying depths of layers in the transformer encoder to assess the model’s sensitivity to depth.
-
•
Number of Heads: We use two settings of attention heads in the multi-headed self-attention mechanism to analyze its influence on model behavior.
We directly train the VTR model on three distinct SAR datasets: MSTAR [21], SynthWakeSAR [22], and GBSAR [23]. The three datasets represent a diverse range of SAR imaging scenarios and characteristics. The Adam optimization algorithm [29] is utilized for training, with an initial learning rate of . Subsequently, the step learning rate optimizer is applied, with a step size of and a gamma value of . The model is trained for epochs.
6.2 Hardware Implementation Details
The HPPU and ECU described in Section 5 are implemented on a state-of-the-art FPGA platform, Xilinx Alveo U250. The main hyper-parameters associated with the proposed accelerator, and are selected as and , respectively. of allows the accelerator to fully utilize the SLRs (Super Logic Regions) on the FPGA. was selected based on the nominal input token blocks (associated with the input image). of allows for both the PE columns within each HCU (in each SLR) to concurrently access the FPGA BRAM/URAM using its dual port memory. of is selected to support any block size which is a multiple of (such as , etc.). The accelerator is designed using Xilinx High Level Synthesis (HLS) and synthesized using Xilinx Vitis v2022.2 with an achieved frequency of MHz.
6.3 Datasets
MSTAR: The original Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset consists of 5172
SAR images. It contains 10 categories of ground vehicles, with 2747 images in the training set and 2425 images in the testing set. Each image is of pixels. For our experiments, we use an augmented MSTAR dataset which consists 27,000 SAR images in the training set and 2425 images in the testing set. The size of each image in this dataset is . Hence, the images in this dataset are generated by cropping the images in the original MSTAR dataset in various directions.
SynthWakeSAR: SynthwakeSAR is a synthetic SAR imagery dataset comprising 92,160 images of 10 different real vessel models. There are 73,728 images in the training set and 18,432 images in the test set. Each image is of pixels. The images of each vessel contain the ship’s wakes.
GBSAR: Ground Based SAR (GBSAR) dataset consists of 6434 raw radar images of which 5147 images are in the training set and 1287 images are in the test set. This dataset captures 7 different ceramic cups with rubber objects. Each raw SAR image has a size of .
6.4 Platform Specification
We evaluate the performance of our accelerator for the SAR ATR applications against state-of-the-art CPU and GPU platforms, with the specifications of the platforms detailed in Table 1. Table 1 also contains specifications associated with the FPGA platform utilized for the hardware accelerator.
CPU | GPU | FPGA | |||||||
---|---|---|---|---|---|---|---|---|---|
Platform |
|
|
|
||||||
Frequency | 2.4 GHz | 915 MHz | 300 MHz | ||||||
|
3.69 | 91.06 | 1.8 | ||||||
|
384 MB | 96MB | 36 MB | ||||||
|
461 GB/s | 960 GB/s | 77 GB/s |
6.5 Performance Metrics
We evaluate our model using the following metrics: classification accuracy, total MACs (multiply and accumulate operations), model parameters, inference latency and throughput. Classification accuracy is measured as the ratio of images correctly classified by the model. MACs and model parameters characterize the computational complexity associated with a given model; MACs measures the total multiply-accumulate operations that a model performs to compute the output, model parameters are the total learnable parameters in a model. The inference latency is defined as the end-to-end latency (time) required for a model to compute the output given an input image, for a given platform. Finally, throughput is defined as the total images that can be processed on a given platform in a second, where processed refers to computing the output for an input image.
6.6 Experimental Results
The experimental results obtained across all three datasets are summarized in Tables 2-4. Considering that the images in the MSTAR and GBSAR datasets are sized , our experiments involve patch sizes of 8 and 11, along with hidden dimension sizes of 44 and 88, respectively. However, as the image size in the SynthwakeSAR dataset is , we opt for patch sizes of 8 and 16, accompanied by hidden dimension sizes of 96 and 48, respectively. The depth of the transformer encoder is set to 4, 6, 8, or 12, and the number of heads is configured to be 2 or 4. Each of these hyperparameter settings is employed during the training of the model to assess its performance across different configurations. Latency for FPGA, GPU and CPU platforms is the end-to-end latency defined in 6.5, measured from the time that the input is sent to a processor to the time that the processor takes to compute the output. Throughput for the three platforms is defined as per 6.5 measuring the total images that can be processed in a given second. For latency comparison across the platforms, we utilize a batch size of . A batch size of represents a real-time scenario for SAR ATR applications wherein the input images are continuously streaming. In order to compare the throughput for the platforms, results are computed over batch sizes and . Note that the FPGA throughput is nominally measured as the inverse of the per-image inference latency, and thus is invariant with respect to batch size.
6.7 Discussion
Through the experiments, we observe that the performance of the VTR model varies across the three datasets for different hyperparameter settings. Notably, the best classification accuracies (highlighted in bold in Tables 2-4) for the MSTAR and SynthWakeSAR datasets (95.96% and 93.47% respectively) are achieved with similar hyperparameter configurations. However, for the GBSAR dataset, a lower depth size results in better performance. It can be concluded that a lower patch size results in a higher classification accuracy as smaller patch sizes are capable of capturing fine-grained details in the input SAR images. It is also worth noting that the SPT and LSA modules are crucial for the performance improvements achieved by the VTR model, as the standard ViT performs poorly on SAR datasets unless pre-trained on significantly larger datasets.
In comparison to the benchmarking conducted by Ye et al.[24] on the three datasets using CNNs, GNNs, and ViTs, our VTR model equipped with the SPT and LSA modules demonstrates improved or similar performance. Particularly, our model outperforms the state-of-the-art for the SynthWakeSAR dataset, achieving an accuracy of 93.47%. For the GBSAR dataset, we achieve a comparable classification accuracy of 99.46%. For the MSTAR dataset, however, GNNs outperform all other models. This is attributed to the fact that images in this dataset contain the actual target ground vehicle object to be classified only within a few pixels in the center of the image. This limits VTRs capability to extract localized features from mostly non-informant tokens, despite the addition of the SPT and LSA modules. In contrast, GNNs excel at capturing this local information, leading to a better performance. These comparison results are summarized in Table 5. We define the best-performing model as the model that achieves the highest classification accuracy. Table 6 compares the number of parameters in the best-performing models. In contrast to previous work, our model has fewer parameters, leading to a smaller model size. Compared against standard pre-trained ViTs with total parameters of the order , VTR is significantly smaller () with the largest model size being of the order parameters.
The proposed FPGA accelerator, for the best-performing model across the three datasets, has an average speedup (based on single image inference latency) of and when compared to GPU and CPU platforms, respectively. For smaller models, the accelerator has a much higher speedup than the nominal speedup (associated with larger models). Thus, it is highly suitable for deployment in real-time SAR ATR workloads. Figure 6 compares the throughput of the best-performing model (for each dataset) versus the baseline FPGA throughput (green line). For smaller batch sizes of and , the FPGA throughput is much better compared to its CPU and GPU counterparts. With larger batch sizes, as expected, the GPU throughput overtakes the FPGA throughput. This suggests that despite being optimized for streaming single image input inferences, our proposed FPGA accelerator can also be used for smaller batch sizes such as or with a throughput that is still, on average, that of the GPU throughput.
Dataset | Model | |||||
---|---|---|---|---|---|---|
VTR | ResNet18[30] | ResNet34[30] | ResNet50[30] | SS-ViT[20] | Multi-layer GNN[17] | |
MSTAR | 95.96% | 98.47% | 98.64% | 90.34% | 95.61% | 99.09% |
SynthWakeSAR | 93.47% | 90.30% | 92.14% | 92.42% | 87.98% | 91.15% |
GBSAR | 99.46% | 99.30% | 99.99% | 99.53% | 99.04% | 98.67% |
MSTAR | SynthWakeSAR | GBSAR | ||||
VTR | Multi-layer GNN | VTR | ResNet50 | VTR | ResNet34 | |
# Parameters | 1.16M | 1.27M | 1.37M | 23.5M | 0.59M | 21.3M |
7 CONCLUSION AND FUTURE WORK
In this paper, we developed a lightweight ViT model tailored for SAR ATR applications, VTR, addressing the challenges of limited training data and model computational complexity. Our experimental results demonstrated the effectiveness of the proposed model across the diverse SAR datasets, achieving better or comparable results than prior work. The proposed FPGA accelerator is suitable for real-time SAR ATR workloads with tight latency constraints, performing significantly better than the alternative state-of-the-art GPU and CPU platforms. In future work, we will explore multi-modal datasets, such as the EO-SAR dataset, for the SAR ATR application and how such multi-modality can be exploited to improve the model performance. Furthermore, we will explore novel hybrid ViT and GNN architectures to overcome the performance limitations of purely ViT-based approaches on MSTAR-like data. To this extent, we will study approaches that exploit ViT’s global and GNN’s local inductive bias.
8 Acknowledgement
This work is supported by the DEVCOM Army Research Lab (ARL) under grant W911NF2220159 and the National Science Foundation (NSF) under grants SPX-2333009 and SaTC-2104264. Equipment and support by AMD AECG are greatly appreciated.
Distribution Statement A: Approved for public release. Distribution is unlimited.
References
- [1] Tsokas, A., Rysz, M., Pardalos, P. M., and Dipple, K., “Sar data applications in earth observation: An overview,” Expert Systems with Applications 205, 117342 (2022).
- [2] Reigber, A., Scheiber, R., Jager, M., Prats-Iraola, P., Hajnsek, I., Jagdhuber, T., Papathanassiou, K. P., Nannini, M., Aguilera, E., Baumgartner, S., Horn, R., Nottensteiner, A., and Moreira, A., “Very-high-resolution airborne synthetic aperture radar imaging: Signal processing and applications,” Proceedings of the IEEE 101(3), 759–783 (2013).
- [3] Li, J., Yu, Z., Yu, L., Cheng, P., Chen, J., and Chi, C., “A comprehensive survey on sar atr in deep-learning era,” Remote Sensing 15(5) (2023).
- [4] Moreira, A., Prats-Iraola, P., Younis, M., Krieger, G., Hajnsek, I., and Papathanassiou, K. P., “A tutorial on synthetic aperture radar,” IEEE Geoscience and Remote Sensing Magazine 1(1), 6–43 (2013).
- [5] Ding, J., Chen, B., Liu, H., and Huang, M., “Convolutional neural network with data augmentation for sar target recognition,” IEEE Geoscience and Remote Sensing Letters 13(3), 364–368 (2016).
- [6] Chen, S., Wang, H., Xu, F., and Jin, Y.-Q., “Target classification using the deep convolutional networks for sar images,” IEEE Transactions on Geoscience and Remote Sensing 54(8), 4806–4817 (2016).
- [7] Zhang, B., Wijeratne, S., Kannan, R., Prasanna, V., and Busart, C., “Graph neural network for accurate and low-complexity sar atr,” arXiv preprint arXiv:2305.07119 (2023).
- [8] Wang, R., Wang, L., Wei, X., Chen, J.-W., and Jiao, L., “Dynamic graph-level neural network for sar image change detection,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
- [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N., “An image is worth 16x16 words: Transformers for image recognition at scale,” (2021).
- [10] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., and Shah, M., “Transformers in vision: A survey,” ACM Computing Surveys 54, 1–41 (Jan. 2022).
- [11] Chen, L., Luo, R., Xing, J., Li, Z., Yuan, Z., and Cai, X., “Geospatial transformer is what you need for aircraft detection in sar imagery,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2022).
- [12] Liu, X., Wu, Y., Liang, W., Cao, Y., and Li, M., “High resolution sar image classification using global-local network structure based on vision transformer and cnn,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
- [13] Wang, C., Huang, Y., Liu, X., Pei, J., Zhang, Y., and Yang, J., “Global in local: A convolutional transformer for sar atr fsl,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
- [14] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., “Imagenet: A large-scale hierarchical image database,” in [2009 IEEE Conference on Computer Vision and Pattern Recognition ], 248–255 (2009).
- [15] Dong, H., Zhang, L., and Zou, B., “Exploring vision transformers for polarimetric sar image classification,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2022).
- [16] Zhou, Y., Jiang, X., Xu, G., Yang, X., Liu, X., and Li, Z., “Pvt-sar: An arbitrarily oriented sar ship detector with pyramid vision transformer,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16, 291–305 (2023).
- [17] Zhang, B., Kannan, R., Prasanna, V., and Busart, C., “Accurate, low-latency, efficient sar automatic target recognition on fpga,” in [2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) ], 1–8 (2022).
- [18] Zhang, B., Kannan, R., Prasanna, V., and Busart, C., “Accelerating gnn-based sar automatic target recognition on hbm-enabled fpga,” in [2023 IEEE High Performance Extreme Computing Conference (HPEC) ], 1–7 (2023).
- [19] Fein-Ashley, J., Ye, T., Wickramasinghe, S., Zhang, B., Kannan, R., and Prasanna, V., “A single graph convolution is all you need: Efficient grayscale image classification,” arXiv preprint arXiv:2402.00564 (2024).
- [20] Lee, S. H., Lee, S., and Song, B. C., “Vision transformer for small-size datasets,” arXiv preprint arXiv:2112.13492 (2021).
- [21] “MSTAR dataset.” https://www.sdms.afrl.af.mil/index.php?collection=mstar. Accessed: 2024-03-27.
- [22] Rizaev, I. G. and Achim, A., “Synthwakesar: A synthetic sar dataset for deep learning classification of ships at sea,” Remote Sensing 14(16), 3999 (2022).
- [23] Turčinović, F., Kačan, M., Bojanjac, D., and Bosiljevac, M., “Deep learning approach based on gbsar data for detection of defects in packed objects,” in [2023 17th European Conference on Antennas and Propagation (EuCAP) ], 1–4 (2023).
- [24] Fein-Ashley, J., Ye, T., Kannan, R., Prasanna, V., and Busart, C., “Benchmarking deep learning classifiers for sar automatic target recognition,” in [2023 IEEE High Performance Extreme Computing Conference (HPEC) ], 1–6, IEEE (2023).
- [25] Morgan, D. A., “Deep convolutional neural networks for atr from sar imagery,” in [Algorithms for Synthetic Aperture Radar Imagery XXII ], 9475, 116–128, SPIE (2015).
- [26] Li, S., Lang, P., Fu, X., Jiang, J., Dong, J., and Nie, Z., “Automatic target recognition of sar images based on transformer,” in [2021 CIE International Conference on Radar (Radar) ], 938–941, IEEE (2021).
- [27] He, Y.-L., Zhang, X.-L., Ao, W., and Huang, J. Z., “Determining the optimal temperature parameter for softmax function in reinforcement learning,” Applied Soft Computing 70, 80–85 (2018).
- [28] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S., “Pytorch: An imperative style, high-performance deep learning library,” in [Advances in Neural Information Processing Systems ], Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., eds., 32, Curran Associates, Inc. (2019).
- [29] Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).
- [30] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 770–778 (2016).