Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2404.04527v1 [cs.CV] 06 Apr 2024
\authorinfo

Further author information: (Send correspondence to Sachini Wickramasinghe)
Sachini Wickramasinghe: shwickra@usc.edu
Dhruv Parikh: dhruvash@usc.edu
Bingyi Zhang: bingyizh@usc.edu
Rajgopal Kannan: rajgopal.kannan.civ@army.mil
Viktor Prasanna: prasanna@usc.edu
Carl Busart: carl.e.busart.civ@army.mil

VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

Sachini Wickramasinghe University of Southern California, Los Angeles, CA Dhruv Parikh University of Southern California, Los Angeles, CA Bingyi Zhang University of Southern California, Los Angeles, CA Rajgopal Kannan DEVCOM Army Research Office, Playa Vista, CA
Viktor Prasanna
University of Southern California, Los Angeles, CA
Carl Busart DEVCOM Army Research Office, Playa Vista, CA
Abstract

Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the state-of-the-art in various computer vision applications, outperforming Convolutional Neural Networks (CNNs). However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training data to generalize well due to their low locality. The standard SAR datasets have a limited number of labeled training data, reducing the learning capability of ViTs (2) ViTs have a high parameter count and are computation intensive which makes their deployment on resource-constrained SAR platforms difficult. In this work, we develop a lightweight ViT model that can be trained directly on small datasets without pre-training. To this end, we incorporate the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules into the ViT model. We directly train this model on SAR datasets to evaluate its effectiveness for SAR ATR applications. The proposed model, VTR (ViT for SAR ATR), is evaluated on three widely used SAR datasets: MSTAR, SynthWakeSAR, and GBSAR. Experimental results show that the proposed VTR model achieves a classification accuracy of 95.96%, 93.47%, and 99.46% on MSTAR, SynthWakeSAR, and GBSAR datasets, respectively. VTR achieves accuracy comparable to the state-of-the-art models on MSTAR and GBSAR datasets with 1.1×1.1\times1.1 × and 36×36\times36 × smaller model sizes, respectively. On SynthWakeSAR dataset, VTR achieves a higher accuracy with a model size that is 17×17\times17 × smaller. Further, a novel FPGA accelerator is proposed for VTR, to enable real-time SAR ATR applications. Compared with the implementation of VTR on state-of-the-art CPU and GPU platforms, our FPGA implementation achieves latency reduction by a factor of 70×70\times70 × and 30×30\times30 ×, respectively. For inference on small batch sizes, our FPGA implementation achieves a 2×2\times2 × higher throughput compared with GPU.

keywords:
Synthetic Aperture Radar, Automatic Target Recognition, Vision Transformer
00footnotetext: *: Equal contribution

1 INTRODUCTION

Automatic Target Recognition (ATR) for Synthetic Aperture Radar (SAR) images is a broadly researched topic due to its diverse applications spanning from remote sensing to military surveillance [1]. Unlike optical sensors, SAR can capture high-resolution images irrespective of weather conditions or time (day/night). Recent advances in SAR imaging systems have led to images with resolutions as high as few decimeters [2]. These advances, coupled with their any-circumstance imaging capability, have led to SAR based sensor systems outperforming their optical sensor system counterparts, while being more reliable. ATR for SAR comprises of three distinct tasks: detection, discrimination and classification [3]. Detection involves identifying regions-of-interest within a SAR image to localize targets. Discrimination involves the capability of an ATR algorithm to discard false alarms generated by the detection algorithms due to natural/artificial clutter. Classification involves accurately and precisely classifying detected targets within a SAR image [3]. Since the images generated by radar sensors differ significantly from the images generated by optical sensors [4], SAR ATR becomes a challenging problem to solve.

Recent advances in deep learning have revolutionized the field of SAR ATR [3]. Several works employing deep learning to SAR ATR utilize traditional Convolutional Neural Networks (CNNs) [5, 6] to classify SAR images, outperforming prior works. Graph Neural Network (GNN) based approaches [7, 8] have led to state-of-the-art performance for SAR ATR applications. GNNs drastically reduce the overall model parameters and inference latency, making them suitable for real-time applications. Vision Transformer (ViT), introduced by Dosovitskiy et al. [9], has outperformed traditional CNN based architectures across several tasks in the computer vision domain [10]. Several recent works [11, 12, 13] have employed ViTs across various SAR ATR applications. Chen et al. [11] utilize a multi-scale geospatial contextual attention network (MGCAN) on several SAR image chips, inspired from the Multi-Head Self-Attention mechanism in ViTs. MGCAN model was used in Chen et al. [11] to perform object detection for aircrafts. Liu et al. [12] utilize ViTs and CNNs to extract global and local features, respectively. The combined model led to an improved classification accuracy when employed for image classification on high resolution (HR) SAR images. Wang et al. [13] also combine ViTs and CNNs into a module termed ConvT. ConvT was utilized for few shot image classification on small-sized SAR datasets.

Despite ViTs being the state-of-the-art model for vision applications, significant challenges prevent its effective deployment for SAR ATR applications. (i) SAR ATR datasets are typically quite small. SAR image collection is an expensive endeavour. Thus, most available SAR datasets have limited number of training instances. ViTs generally require a large amount of training data in order to generalize due to its limited locality inductive bias [9]. Thus, training a raw ViT on the small-sized SAR datasets without pre-training on larger datasets becomes challenging. (ii) ViTs are computationally expensive with a large memory footprint and associated model size. Since SAR ATR applications are driven by real-time constraints, it becomes imperative to optimize trained models for efficient real-time inference. The computational cost of ViTs is proportional to the square of the total input tokens [9]. For images with higher resolutions (such as SAR), this leads to an intractably high computation cost.

Prior works utilize ViTs pre-trained on large datasets such as ImageNet [14]. The pre-trained ViTs are then finetuned on small SAR ATR datasets [15, 12, 16]. Further, while several works accelerate SAR ATR applications [17, 18, 19] on FPGA, these works do not focus on optimizing the ViT architecture for SAR ATR.

To address the above challenges, we propose VTR, a novel ViT based model for SAR ATR application. VTR can be trained directly on the small-sized SAR datasets, without pre-training on larger datasets. Furthermore, we propose a novel FPGA accelerator for low-latency and high-throughput SAR ATR. Our contributions are as follows,

  • We propose a novel ViT model (VTR) equipped with the Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) modules [20] for SAR ATR. VTR improves the locality inductive bias of ViTs, allowing it to directly be trained on small SAR datasets, removing the overhead associated with pre-training.

  • We propose a novel accelerator for the proposed VTR model on FPGA. We propose a Highly Parallel Processing Unit (HPPU) to fully exploit the compute parallelism within each layer (encoder) of a VTR.

  • We comprehensively evaluate the performance of several configurations of VTR across the MSTAR [21], SynthWakeSAR [22] and GBSAR [23] datasets. The performance of a model is characterized via classification accuracy, model size, computation complexity, and inference latency and throughput.

  • VTR achieves a higher classification accuracy for SynthWakeSAR dataset with a 17×17\times17 × smaller model size. On MSTAR and GBSAR datasets, VTR achieves an accuracy comparable to the current state-of-the-art, with a 1.1×1.1\times1.1 × and 36×36\times36 × smaller model size, respectively.

  • The proposed VTR FPGA accelerator reaches a latency reduction of 30×30\times30 × and 70×70\times70 × compared to state-of-the-art GPU and CPU platforms, respectively. For inference on small batch sizes, it reaches a throughput improvement of 2×2\times2 × when compared with GPU.

2 BACKGROUND AND RELATED WORK

2.1 Deep Learning models for SAR ATR

Deep neural networks have gained high interest, showcasing impressive results across various problem domains. The state-of-the-art works for SAR ATR applications involve either convolutional neural networks (CNNs) or Graph Neural Networks (GNNs)[24]. Morgan [25] is the first work to propose a deep CNN for SAR ATR. Recently, Zhang et al.[7] proposed a novel architecture based on GNNs that exceeded classification accuracy of 99% on the MSTAR dataset. In contrast to CNNs, the proposed GNN exploits the data sparsity in SAR images to reduce computation complexity. Although ViTs have emerged as state-of-the-art models for computer vision tasks, they perform poorly on small datasets due to severe overfitting. Hence, training a ViT model for SAR ATR applications without pre-training is challenging. Li et al.[26] proposed a pre-processing technique to improve the accuracy of the ViT model on SAR image classification. Recently, Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) techniques have been proposed to improve the accuracy of ViTs on small datasets[20]. In this work, we propose VTR model for SAR ATR, which incorporates the SPT and LSA modules into a vanilla ViT model. This enables training VTR on small SAR datasets without pretraining.

2.2 Vision Transformer

ViT[9] has achieved significant advancements in computer vision tasks like image classification and object detection. Its core components include Multi-Headed Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks. First, the input image undergoes partitioning into non-overlapping patches. Then, the flattened patches are embedded along with the class token and positional information. Subsequently, this processed data is fed into the transformer encoder.

Multi-Headed Self-Attention: In the self attention block, the input embeddings are linearly projected to query, key, and value vectors. The query and key vectors are then utilized to obtain a scaled dot product. A single self-attention operation is denoted as one ”head”. The self-attention function is defined as,

Attention(𝑸,𝑲,𝑽)=softmax(𝑸𝑲Tdk)𝑽Attention𝑸𝑲𝑽softmax𝑸superscript𝑲𝑇subscript𝑑𝑘𝑽\text{Attention}(\bm{Q},\bm{K},\bm{V})=\text{softmax}(\frac{\bm{Q}\bm{K}^{T}}{% \sqrt{d_{k}}})\bm{V}Attention ( bold_italic_Q , bold_italic_K , bold_italic_V ) = softmax ( divide start_ARG bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_italic_V (1)

where 𝑸𝑸\bm{Q}bold_italic_Q, 𝑲𝑲\bm{K}bold_italic_K and 𝑽𝑽\bm{V}bold_italic_V represent the query, key, and value matrix, respectively. dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dimension of embeddings in 𝑲𝑲\bm{K}bold_italic_K. Several such heads are concatenated and projected via a linear layer to compute the final MSA output.

Multi-Layer Perceptron: The output of the MSA layer is fed into the MLP block. It consists of two linear layers with an activation function. The MLP layer is formulated as,

MLP(𝒙)=GeLU(𝒙𝑾1+𝒃1)𝑾2+𝒃2MLP𝒙GeLU𝒙subscript𝑾1subscript𝒃1subscript𝑾2subscript𝒃2\text{MLP}(\bm{x})=\text{GeLU}(\bm{x}\bm{W}_{1}+\bm{b}_{1})\bm{W}_{2}+\bm{b}_{% 2}\\ MLP ( bold_italic_x ) = GeLU ( bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (2)

where 𝑾1subscript𝑾1\bm{W}_{1}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒃1subscript𝒃1\bm{b}_{1}bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represent the hidden layer weights and bias, respectively. 𝑾2subscript𝑾2\bm{W}_{2}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝒃2subscript𝒃2\bm{b}_{2}bold_italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the output layer weights and bias respectively. GeLU is the activation function.

In the standard ViT model, the receptive field of the input embeddings (input tokens) remains fixed and cannot be adjusted. That is, the tokenization of the standard ViT is similar to the operation of a non-overlapping convolutional layer[20]. As a result, these tokens have a small receptive field leading to lower local inductive bias. To this end, SPT leverages spatial information by shifting the input image in the four or eight cardinal directions. Additionally, since both the query and key are linearly projected from the same input tokens, the self-token relations tend to exhibit larger magnitudes compared to inter-token relations. Consequently, the softmax function assigns relatively higher scores to self-token relations and smaller scores to inter-token relations. Hence, the attention of standard ViT tends to be similar to each other regardless of inter-token relations. Moreover, the scaling factor dksubscript𝑑𝑘\sqrt{d_{k}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG can cause smoothing of the attention score distribution[27]. These issues can be effectively mitigated by incorporating LSA into the attention layer. LSA helps in excluding self-token relations and applies a learnable temperature scale to the softmax function.

3 Overview

The overview of the proposed approach is detailed in Figure 1 and comprises of two stages: (i) VTR Training and (ii) VTR Inference.

VTR Training

The proposed VTR model (Section 4) is trained on small SAR datasets, without pre-training. As such training is inexpensive, we train several VTR model variants by varying the model hyper-parameters, for each SAR dataset, to evaluate the performance of VTR model across each setting.

Refer to caption
Figure 1: Overview

VTR Inference

Next, the trained VTR model is accelerated on an FPGA using a novel accelerator proposed in Section 5. The input to the accelerator is the shifted and tokenized input image - post the application of the image shifting, concatenation and tokenization action of the SPT module. Specifically, for an image 𝒙RH×W×C𝒙superscript𝑅𝐻𝑊𝐶\bm{x}\in R^{H\times W\times C}bold_italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the host processor shifts the input image along the four cardinal directions. Each such shifted image, along with the original raw image, are concatenated along the channel axis to generate a shifted concatenated image: S(𝒙)RH×W×(Ns+1)C𝑆𝒙superscript𝑅𝐻𝑊subscript𝑁𝑠1𝐶S(\bm{x})\in R^{H\times W\times(N_{s}+1)C}italic_S ( bold_italic_x ) ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × ( italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 ) italic_C end_POSTSUPERSCRIPT. Here, S(.)S(.)italic_S ( . ) represents the shifting module and Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the number of shifted images. Finally, the shifted concatenated image S(𝒙)𝑆𝒙S(\bm{x})italic_S ( bold_italic_x ) is tokenized by partitioning S(𝒙)𝑆𝒙S(\bm{x})italic_S ( bold_italic_x ) into several patches and flattening each patch into a embedding vector to generate the final shifted tokenized input for the FPGA accelerator, S(𝒙)tokenize𝑿tokenize𝑆𝒙𝑿S(\bm{x})\xrightarrow[]{\text{tokenize}}\bm{X}italic_S ( bold_italic_x ) start_ARROW overtokenize → end_ARROW bold_italic_X, where 𝑿RN×D𝑿superscript𝑅𝑁𝐷\bm{X}\in R^{N\times D}bold_italic_X ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT represents the tokenized input with N𝑁Nitalic_N tokens (total patches) and D𝐷Ditalic_D embedding dimension (size of a flattened patch).

4 MODEL DESIGN

The primary focus of our work is to implement a ViT that can effectively learn from small SAR datasets without any pre-training. In this Section, we present the architectural overview of the model. Then, we introduce the two modules: SPT and LSA[20]. The overall architecture is illustrated in Figure 2.

Refer to caption
Figure 2: Model Architecture

4.1 Overview of the Model

Our model first transforms the input image using SPT. Subsequently, both the original input image and the transformed images (shown in Figure 2) are concatenated and partitioned into non-overlapping patches. The flattened patches undergo layer normalization and linear projection to obtain patch embeddings which are concatenated with a learnable class embedding and then added to learnable positional embeddings to yield the final input tokens. This is then fed into the transformer encoder where the input is passed through several layers of multi-headed self-attention and multi-layer perceptron networks. In the multi-headed self-attention layers, our model incorporates the LSA module. The LSA module applies a learnable temperature scaling to the softmax function to sharpen the distribution of attention scores. Layer normalization is applied before every attention and MLP block while residual connections are applied after each block. Finally, the output of the transformer encoder is passed through an MLP head to generate the classification result.

4.2 SPT

In the SPT module, each input image, 𝒙H×W×C𝒙superscript𝐻𝑊𝐶\bm{x}\in\mathbb{R}^{H\times W\times C}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, is shifted by 2222 pixels in the four diagonal directions: left-up, right-up, left-down, and right-down. These shifted images are cropped to the same size as the original input image and concatenated with it (Figure 2). The resulting set of images are then divided into N𝑁Nitalic_N non-overlapping patches and flattened into a sequence of vectors, 𝒙iP2.C.(Ns+1)superscript𝒙𝑖superscriptformulae-sequencesuperscript𝑃2𝐶subscript𝑁𝑠1\bm{x}^{i}\in\mathbb{R}^{P^{2}.C.(N_{s}+1)}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . italic_C . ( italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 ) end_POSTSUPERSCRIPT. H,W,𝐻𝑊H,W,italic_H , italic_W , and C𝐶Citalic_C represent the height, width and the number of channels of the input image respectively. In our case, C=1𝐶1C=1italic_C = 1. 𝒙isuperscript𝒙𝑖\bm{x}^{i}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT flattened vector. P𝑃Pitalic_P indicates the size of the patch while Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the total number of shifted images. The patch embeddings (flattened vectors) are linearly transformed into a hidden dimension, dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. A class token is concatenated to the linearly transformed patch embeddings and positional embeddings are added. The output of this SPT module is fed to the transformer encoder.

4.3 LSA

The main components of the LSA module include diagonal masking and learnable temperature scaling. In the multi-headed self-attention layer, two linear projections of the input are generated: the query, 𝑸=𝑿𝑾q𝑸𝑿subscript𝑾𝑞\bm{Q}=\bm{X}\bm{W}_{q}bold_italic_Q = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and the key, 𝑲=𝑿𝑾k𝑲𝑿subscript𝑾𝑘\bm{K}=\bm{X}\bm{W}_{k}bold_italic_K = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. As formulated in eq.1, the softmax function is applied on a scaled dot product of these linear projections. The diagonal of the resultant matrix of the dot product, 𝑸𝑲T𝑸superscript𝑲𝑇\bm{Q}\bm{K}^{T}bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, represents the self-token relations while the off-diagonal elements represent the inter-token relations. Since the linear projections are from the same input tokens, the self-token relations tend to be larger resulting in higher scores for self-token relations. To prevent this the LSA module forces -\infty- ∞ on the diagonal elements. This effectively prevents the attention from being focused on its own tokens. The learnable temperature scaling technique incorporated in the LSA module, allows the ViT to decide the softmax temperature by itself while training.

5 Accelerator Design

In this Section, we describe the designed hardware accelerator comprising of two core compute modules: (i) Highly Parallel Processing Unit (HPPU) and (ii) Element-wise Compute Unit (ECU). Prior to their description, we briefly discuss the data layout (format) for the accelerator.

5.1 Data Layout

Figure 3 describes the data layout. In order to compute the matrix product 𝑨𝑩𝑨𝑩\bm{A}\bm{B}bold_italic_A bold_italic_B, 𝑨𝑨\bm{A}bold_italic_A being the left matrix and 𝑩𝑩\bm{B}bold_italic_B the right matrix, the left matrix 𝑨𝑨\bm{A}bold_italic_A is stored in a row-major format and the right matrix 𝑩𝑩\bm{B}bold_italic_B is stored in a column-major format.

Refer to caption
Figure 3: Data Layout

The matrices are divided into blocks of size b×b𝑏𝑏b\times bitalic_b × italic_b. The matrices are stored in a block-contiguous fashion, either in row-major format (left matrix) or column-major format (right matrix). The elements within each block are stored in a contiguous fashion.

5.2 Compute Units

HPPU

The core computing unit is show in Figure 4. In order to leverage the inherent compute parallelism along the attention heads within a transformer, the entire compute unit is divided into several Head Compute Units (HCUs). Each HCU computes an individual (attention) head. Further, each HCU contains a 2D mesh of processing elements (PEs). Thus, for a total of phsubscript𝑝p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT HCUs, each HCU comprising of pt×pcsubscript𝑝𝑡subscript𝑝𝑐p_{t}\times p_{c}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT PEs, the total number of PEs in the HPPU are ph×pt×pcsubscript𝑝subscript𝑝𝑡subscript𝑝𝑐p_{h}\times p_{t}\times p_{c}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This arrangement of PEs exploits compute parallelism across three levels of compute dimensions. PEs along the ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT axes exploit compute parallelism within each head along the token and embedding dimensions, respectively. phsubscript𝑝p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT such pt×pcsubscript𝑝𝑡subscript𝑝𝑐p_{t}\times p_{c}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT instances of HCUs allows computing phsubscript𝑝p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT heads simultaneously.

Refer to caption
Figure 4: Block diagram of the proposed HPPU

Each PE is organized as a ppe×ppesubscript𝑝𝑝𝑒subscript𝑝𝑝𝑒p_{pe}\times p_{pe}italic_p start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT grid of compute elements (systolic array) (shown in Figure 4). The Global Input Buffer (GIB) is utilized to store the input feature matrix. Each HCU has its own, individual Local Input Buffer (LIB). Data from the GIB is appropriately streamed (broadcasted) to each HCUs LIB, allowing for simultaneous compute. The output results computed by each HCU are stored in Local Output Buffers (LOB) and are streamed out into a Global Output Buffer (GOB). The input weight matrix is stored in a Weight Buffer (WB). This WB is partitioned into several local buffers (banks) for each column of computing PEs within an HCU.

The HPPU performs Dense Block-wise Matrix Multiplication (DBMM) on two dense matrices. The dense matrices are partitioned block-wise (Section 5.1) into blocks of size b×b𝑏𝑏b\times bitalic_b × italic_b. Each PE computes an output block for the output matrix.

ECU

The ECU is structured identically to the HPPU (as in Figure 4). However, its main function is to perform element-wise computations (element-wise multiply and/or add) (Figure 5). The ECU can also perform element-wise non-linear activation (GELU) or exponentiation (exp). The ECU has a total of 4444 buffers to support a general operation of the form f(𝑨𝑩𝑪)𝑓𝑨𝑩direct-sum𝑪f(\bm{A}\bigodot\bm{B}\bigoplus\bm{C})italic_f ( bold_italic_A ⨀ bold_italic_B ⨁ bold_italic_C ). Here f(.)f(.)italic_f ( . ) represents the non-linear activation, \bigodot respresents element-wise multiply and direct-sum\bigoplus represents element-wise add. 𝑨,𝑩𝑨𝑩\bm{A},\bm{B}bold_italic_A , bold_italic_B and 𝑪𝑪\bm{C}bold_italic_C are equal sized matrices stored per the layout in Section 5.1.

Refer to caption
Figure 5: Element-wise Compute Unit (ECU) Operation

The 3333 buffers for ECU store the matrices 𝑨,𝑩𝑨𝑩\bm{A},\bm{B}bold_italic_A , bold_italic_B and 𝑪𝑪\bm{C}bold_italic_C. A final buffer stores the final computed result. Each PE computes an output block of result through element-wise operations. This is in contrast to computing an output block of result through block-wise matrix multiplication, as in the HPPU.

5.3 Compute Flow

In this Section, we describe the overall compute flow for the accelerator in Section 5.2 to compute the output for an input 𝑿𝑿\bm{X}bold_italic_X. The input 𝑿𝑿\bm{X}bold_italic_X is shifted and tokenized (Section 3). The compute flow within each VTR encoder layer can be partitioned into two flows: (i) Multi-Headed Self-Attention (MSA) compute (ii) Multi-Layer Perceptron (MLP) compute. Prior to describing the MSA and MLP compute, we first describe the generation of learned embeddings from the shifted tokenized input, 𝑿𝑿\bm{X}bold_italic_X.

Embedding Generation

The shifted and tokenized input 𝑿RN×D𝑿superscript𝑅𝑁superscript𝐷\bm{X}\in R^{N\times D^{\prime}}bold_italic_X ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, is transformed into learned embeddings as per eq. 3. Note that N𝑁Nitalic_N refers to the total tokens and Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT refers to the raw embedding size.

𝒁=concat(𝒙CLS,Linear(LN(𝑿)))𝒁concatsubscript𝒙CLSLinearLN𝑿\bm{Z}=\text{concat}(\bm{x_{\text{CLS}}},\text{Linear}(\text{LN}(\bm{X})))bold_italic_Z = concat ( bold_italic_x start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT , Linear ( LN ( bold_italic_X ) ) ) (3)

In eq. 3, 𝒙CLSsubscript𝒙CLS\bm{x_{\text{CLS}}}bold_italic_x start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT refers to the learned class token embedding. LN(.)\text{LN}(.)LN ( . ) stands for layer normalization layer and Linear(.)\text{Linear}(.)Linear ( . ) stands for a linear layer (a single layer MLP). The learned embeddings 𝒁R(N+1)×D𝒁superscript𝑅𝑁1𝐷\bm{Z}\in R^{(N+1)\times D}bold_italic_Z ∈ italic_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_D end_POSTSUPERSCRIPT are obtained by performing layer normalization on the input matrix 𝑿𝑿\bm{X}bold_italic_X (along the embedding axis). The layer-normed embeddings, LN(𝑿)LN𝑿\text{LN}(\bm{X})LN ( bold_italic_X ), are passed through a linear layer. Finally, the class embedding vector, 𝒙CLSsubscript𝒙CLS\bm{x}_{\text{CLS}}bold_italic_x start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT, is concatenated to the output of the linear layer to generate the final embeddings, 𝒁𝒁\bm{Z}bold_italic_Z.

The operation of layer normalization is performed via both the HPPU and ECU units. The mean and standard deviation of each embedding vector is computed via ECU. The required aggregations for this are performed by HPPU via a multiply by 𝟏1\bm{1}bold_1 operation (where 𝟏1\bm{1}bold_1 refers to a vector of 1111’s). The learned parameters 𝜸and𝜷RD𝜸and𝜷superscript𝑅superscript𝐷\bm{\gamma}\,\text{and}\,\bm{\beta}\in R^{D^{\prime}}bold_italic_γ and bold_italic_β ∈ italic_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT of the layer normalization layer are finally applied to 𝑿𝑿\bm{X}bold_italic_X as 𝑿𝝁𝝈𝜸𝜷𝑿𝝁𝝈𝜸direct-sum𝜷\frac{\bm{X}-\bm{\mu}}{\bm{\sigma}}\bigodot\bm{\gamma}\bigoplus\bm{\beta}divide start_ARG bold_italic_X - bold_italic_μ end_ARG start_ARG bold_italic_σ end_ARG ⨀ bold_italic_γ ⨁ bold_italic_β via the ECU. Here 𝝁and𝝈RD𝝁and𝝈superscript𝑅superscript𝐷\bm{\mu}\,\text{and}\,\bm{\sigma}\in R^{D^{\prime}}bold_italic_μ and bold_italic_σ ∈ italic_R start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the mean and standard deviation vectors for the embeddings 𝑿𝑿\bm{X}bold_italic_X (token-wise), respectively.

The linear layer is mapped to the DBMM operation performed via the HPPU. Note that there is no direct notion of a ‘head’ associated with a simple linear layer. However, the output matrix can be partitioned into several ‘fictitious heads’ along along the column axis. Then, each such ‘fictitious head’ is mapped to an HCU within the HPPU.

MSA Compute

The MSA compute contains several stages described thus. The generated embedding matrix 𝒁𝒁\bm{Z}bold_italic_Z is used to generate the 𝑸𝑸\bm{Q}bold_italic_Q, 𝑲𝑲\bm{K}bold_italic_K and 𝑽𝑽\bm{V}bold_italic_V matrix via weights 𝑾Qsubscript𝑾𝑄\bm{W}_{Q}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, 𝑾Ksubscript𝑾𝐾\bm{W}_{K}bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and 𝑾Vsubscript𝑾𝑉\bm{W}_{V}bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT respectively. The 𝑾Qsubscript𝑾𝑄\bm{W}_{Q}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, 𝑾Ksubscript𝑾𝐾\bm{W}_{K}bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and 𝑾Vsubscript𝑾𝑉\bm{W}_{V}bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT weights are naturally partitioned per the total heads, H𝐻Hitalic_H, associated with the multi-headed self-attention mechanism within the encoder. Thus, the 𝑸,𝑲,𝑽𝑸𝑲𝑽\bm{Q},\bm{K},\bm{V}bold_italic_Q , bold_italic_K , bold_italic_V matrices can be represented as follows,

𝒀=[𝒀1𝒀2𝒀H]where𝒀{𝑸,𝑲,𝑽}formulae-sequence𝒀subscript𝒀1subscript𝒀2subscript𝒀𝐻where𝒀𝑸𝑲𝑽\bm{Y}=[\bm{Y}_{1}\quad\bm{Y}_{2}\quad...\quad\bm{Y}_{H}]\quad\text{where}% \quad\bm{Y}\in\{\bm{Q},\bm{K},\bm{V}\}bold_italic_Y = [ bold_italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … bold_italic_Y start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ] where bold_italic_Y ∈ { bold_italic_Q , bold_italic_K , bold_italic_V } (4)

In eq. 4, the output matrix 𝒀𝒀\bm{Y}bold_italic_Y is split into several heads, with the output matrix of some head i𝑖iitalic_i, 𝒀isubscript𝒀𝑖\bm{Y}_{i}bold_italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT computed by an HCU via DBMM. Here, 𝒀𝒀\bm{Y}bold_italic_Y refers to one of the 𝑸𝑸\bm{Q}bold_italic_Q, 𝑲𝑲\bm{K}bold_italic_K and 𝑽𝑽\bm{V}bold_italic_V matrices.

Next, the raw attention matrix for each head is computed as below,

𝑨=[𝑨1𝑨2𝑨H]=𝑸𝑲T=[𝑸1𝑲1T𝑸2𝑲2T𝑸H𝑲HT]𝑨subscript𝑨1subscript𝑨2subscript𝑨𝐻𝑸superscript𝑲𝑇subscript𝑸1superscriptsubscript𝑲1𝑇subscript𝑸2superscriptsubscript𝑲2𝑇subscript𝑸𝐻superscriptsubscript𝑲𝐻𝑇\bm{A}=[\bm{A}_{1}\quad\bm{A}_{2}\quad...\quad\bm{A}_{H}]=\bm{Q}\bm{K}^{T}=[% \bm{Q}_{1}\bm{K}_{1}^{T}\quad\bm{Q}_{2}\bm{K}_{2}^{T}\quad...\quad\bm{Q}_{H}% \bm{K}_{H}^{T}]bold_italic_A = [ bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … bold_italic_A start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ] = bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ bold_italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT … bold_italic_Q start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] (5)

As is suggested by eq. 5, the attention compute is performed by multiplying the query and key matrices of corresponding heads. Each such output matrix 𝑨isubscript𝑨𝑖\bm{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (for a head i𝑖iitalic_i) is computed by an HCU using DBMM. The softmax scores are computed via both the ECU and the HPPU compute units. ECU performs the scaling operations and exponentiation. HPPU performs aggregations to generate the scales required for softmax compute. To incorporate the LSA mechanism, a learned scaling factor λ𝜆\lambdaitalic_λ (instead of dksubscript𝑑𝑘\sqrt{d_{k}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG) is used to scale the raw attention matrix. Further, post-scaling (pre-softmax), the values in the attention matrix, along the diagonals, are set to arbitrarily large negative values. Post-softmax, the attention scores along the diagonal (in each head), thus become close to 00. This effectively removes a tokens self-value vector in computing the weighted aggregate of value vectors with its attention scores. As a result, the tokens focus more on inter-token relations.

Finally, the processed attention matrix comprising the attention scores are used along with the value matrix as below,

𝑶=𝑺𝑽=[𝑺1𝑽1𝑺2𝑽2𝑺H𝑽H]𝑶𝑺𝑽subscript𝑺1subscript𝑽1subscript𝑺2subscript𝑽2subscript𝑺𝐻subscript𝑽𝐻\bm{O}=\bm{S}\bm{V}=[\bm{S}_{1}\bm{V}_{1}\quad\bm{S}_{2}\bm{V}_{2}\quad...% \quad\bm{S}_{H}\bm{V}_{H}]bold_italic_O = bold_italic_S bold_italic_V = [ bold_italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … bold_italic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ] (6)

In eq. 6, 𝑺𝑺\bm{S}bold_italic_S (attention score matrix) is computed from 𝑨𝑨\bm{A}bold_italic_A as below,

𝑨λsubscript𝑨𝜆\displaystyle\bm{A}_{\lambda}bold_italic_A start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT =𝑨/λabsent𝑨𝜆\displaystyle=\bm{A}/\lambda= bold_italic_A / italic_λ
𝑨λsuperscriptsubscript𝑨𝜆\displaystyle\bm{A}_{\lambda}^{\prime}bold_italic_A start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =[M(𝑨λ1)M(𝑨λ2)M(𝑨λH)]absentMsubscript𝑨𝜆1Msubscript𝑨𝜆2Msubscript𝑨𝜆𝐻\displaystyle=[\text{M}(\bm{A}_{\lambda 1})\quad\text{M}(\bm{A}_{\lambda 2})% \quad...\quad\text{M}(\bm{A}_{\lambda H})]= [ M ( bold_italic_A start_POSTSUBSCRIPT italic_λ 1 end_POSTSUBSCRIPT ) M ( bold_italic_A start_POSTSUBSCRIPT italic_λ 2 end_POSTSUBSCRIPT ) … M ( bold_italic_A start_POSTSUBSCRIPT italic_λ italic_H end_POSTSUBSCRIPT ) ]
𝑺𝑺\displaystyle\bm{S}bold_italic_S =softmax(𝑨λ)absentsoftmaxsuperscriptsubscript𝑨𝜆\displaystyle=\text{softmax}(\bm{A}_{\lambda}^{\prime})= softmax ( bold_italic_A start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (5.3)

In eq. 5.3, M(.)\text{M}(.)M ( . ) refers to the operation of setting the values along the diagonal of a matrix 𝑨λisubscript𝑨𝜆𝑖\bm{A}_{\lambda i}bold_italic_A start_POSTSUBSCRIPT italic_λ italic_i end_POSTSUBSCRIPT (for the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT head) to an arbitrarily large value close to -\infty- ∞.

The compute in eq. 6 is similar to that in eq. 5 and is performed by the HPPU through the DBMM operation. Finally, the projection matrix 𝑾psubscript𝑾𝑝\bm{W}_{p}bold_italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is utilized to compute the final MSA output as MSA(𝑿)MSA𝑿\text{MSA}(\bm{X})MSA ( bold_italic_X ) = 𝑶𝑶\bm{O}bold_italic_O𝑾psubscript𝑾𝑝\bm{W}_{p}bold_italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. This compute is similar to computing the 𝑸,𝑲,𝑽𝑸𝑲𝑽\bm{Q},\bm{K},\bm{V}bold_italic_Q , bold_italic_K , bold_italic_V matrices and is also performed via the HPPU through the DBMM operation.

MLP Compute

The compute in the MLP stage of the encoder comprises of two single layer feed-forward neural networks. This is performed as DBMM via the HPPU similar to the description in the MSA compute stage.

6 EVALUATION

6.1 Experimental Setting

We implement the VTR model using PyTorch 2.0.1[28] and utilize an NVIDIA RTX A6000 GPU with CUDA 11.8 for training the model. Several model variants are trained to comprehensively evaluate the VTR model’s robustness and generalization across different small-sized SAR datasets. Specifically, we vary the following hyper-parameters:

  • Patch Size: We explore different patch sizes for partitioning the input image, to assess its impact on model performance.

  • Hidden Dimension: The dimensionality of the hidden layers within the transformer encoder is varied to observe its effects on model learning.

  • Depth: In our experiments, we explore varying depths of layers in the transformer encoder to assess the model’s sensitivity to depth.

  • Number of Heads: We use two settings of attention heads in the multi-headed self-attention mechanism to analyze its influence on model behavior.

We directly train the VTR model on three distinct SAR datasets: MSTAR [21], SynthWakeSAR [22], and GBSAR [23]. The three datasets represent a diverse range of SAR imaging scenarios and characteristics. The Adam optimization algorithm [29] is utilized for training, with an initial learning rate of 0.0010.0010.0010.001. Subsequently, the step learning rate optimizer is applied, with a step size of 20202020 and a gamma value of 0.50.50.50.5. The model is trained for 200200200200 epochs.

6.2 Hardware Implementation Details

The HPPU and ECU described in Section 5 are implemented on a state-of-the-art FPGA platform, Xilinx Alveo U250. The main hyper-parameters associated with the proposed accelerator, ph,pt,pcsubscript𝑝subscript𝑝𝑡subscript𝑝𝑐p_{h},p_{t},p_{c}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and ppesubscript𝑝𝑝𝑒p_{pe}italic_p start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT are selected as 4,12,241224,12,24 , 12 , 2 and 8888, respectively. phsubscript𝑝p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of 4444 allows the accelerator to fully utilize the 4444 SLRs (Super Logic Regions) on the FPGA. ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT was selected based on the nominal input token blocks (associated with the input image). pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of 2222 allows for both the PE columns within each HCU (in each SLR) to concurrently access the FPGA BRAM/URAM using its dual port memory. ppesubscript𝑝𝑝𝑒p_{pe}italic_p start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT of 8888 is selected to support any block size b𝑏bitalic_b which is a multiple of 8888 (such as 8,16,32816328,16,328 , 16 , 32, etc.). The accelerator is designed using Xilinx High Level Synthesis (HLS) and synthesized using Xilinx Vitis v2022.2 with an achieved frequency of 300300300300 MHz.

6.3 Datasets

MSTAR: The original Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset consists of 5172 SAR images. It contains 10 categories of ground vehicles, with 2747 images in the training set and 2425 images in the testing set. Each image is of 128×128128128128\times 128128 × 128 pixels. For our experiments, we use an augmented MSTAR dataset which consists 27,000 SAR images in the training set and 2425 images in the testing set. The size of each image in this dataset is 88×88888888\times 8888 × 88. Hence, the images in this dataset are generated by cropping the images in the original MSTAR dataset in various directions.
SynthWakeSAR: SynthwakeSAR is a synthetic SAR imagery dataset comprising 92,160 images of 10 different real vessel models. There are 73,728 images in the training set and 18,432 images in the test set. Each image is of 128×128128128128\times 128128 × 128 pixels. The images of each vessel contain the ship’s wakes.
GBSAR: Ground Based SAR (GBSAR) dataset consists of 6434 raw radar images of which 5147 images are in the training set and 1287 images are in the test set. This dataset captures 7 different ceramic cups with rubber objects. Each raw SAR image has a size of 88×88888888\times 8888 × 88.

6.4 Platform Specification

We evaluate the performance of our accelerator for the SAR ATR applications against state-of-the-art CPU and GPU platforms, with the specifications of the platforms detailed in Table 1. Table 1 also contains specifications associated with the FPGA platform utilized for the hardware accelerator.

Table 1: Specifications of platforms
CPU GPU FPGA
Platform
AMD
EPYC 9654
NVIDIA RTX
6000 Ada
Xilinx
Alveo U250
Frequency 2.4 GHz 915 MHz 300 MHz
Peak
Performance
(TFLOPS)
3.69 91.06 1.8
On-chip
Memory
384 MB 96MB 36 MB
Memory
Bandwidth
461 GB/s 960 GB/s 77 GB/s

6.5 Performance Metrics

We evaluate our model using the following metrics: classification accuracy, total MACs (multiply and accumulate operations), model parameters, inference latency and throughput. Classification accuracy is measured as the ratio of images correctly classified by the model. MACs and model parameters characterize the computational complexity associated with a given model; MACs measures the total multiply-accumulate operations that a model performs to compute the output, model parameters are the total learnable parameters in a model. The inference latency is defined as the end-to-end latency (time) required for a model to compute the output given an input image, for a given platform. Finally, throughput is defined as the total images that can be processed on a given platform in a second, where processed refers to computing the output for an input image.

6.6 Experimental Results

The experimental results obtained across all three datasets are summarized in Tables 2-4. Considering that the images in the MSTAR and GBSAR datasets are sized 88×88888888\times 8888 × 88, our experiments involve patch sizes of 8 and 11, along with hidden dimension sizes of 44 and 88, respectively. However, as the image size in the SynthwakeSAR dataset is 128×128128128128\times 128128 × 128, we opt for patch sizes of 8 and 16, accompanied by hidden dimension sizes of 96 and 48, respectively. The depth of the transformer encoder is set to 4, 6, 8, or 12, and the number of heads is configured to be 2 or 4. Each of these hyperparameter settings is employed during the training of the model to assess its performance across different configurations. Latency for FPGA, GPU and CPU platforms is the end-to-end latency defined in 6.5, measured from the time that the input is sent to a processor to the time that the processor takes to compute the output. Throughput for the three platforms is defined as per 6.5 measuring the total images that can be processed in a given second. For latency comparison across the 3333 platforms, we utilize a batch size of 1111. A batch size of 1111 represents a real-time scenario for SAR ATR applications wherein the input images are continuously streaming. In order to compare the throughput for the 3333 platforms, results are computed over batch sizes 1,8,16,32,64,128,256181632641282561,8,16,32,64,128,2561 , 8 , 16 , 32 , 64 , 128 , 256 and 512512512512. Note that the FPGA throughput is nominally measured as the inverse of the per-image inference latency, and thus is invariant with respect to batch size.

Table 2: Performance on MSTAR dataset

Patch size Hidden dimension Depth # Heads Accuracy # MACs # Parameters Latency (ms) CPU GPU FPGA 8 44 4 2 89.61% 0.878G 109.99K 17.64 2.03 0.037 4 87.84% 0.901G 109.99K 10.05 2.16 0.034 6 2 90.93% 1.26G 157.33K 10.99 2.95 0.056 4 89.57% 1.29G 157.33K 11.10 2.97 0.052 8 2 91.05% 1.64G 204.68K 11.35 3.84 0.075 4 90.68% 1.68G 204.68K 13.24 3.85 0.069 12 2 94.06% 2.40G 299.37K 13.17 5.61 0.11 4 91.51% 2.47G 299.37K 13.62 5.62 0.10 88 4 2 90.47% 3.18G 405.19K 5.04 2.18 0.067 4 91.30% 3.20G 405.19K 5.21 2.18 0.064 6 2 93.98% 4.65G 592.80K 7.15 2.98 0.10 4 93.94% 4.68G 592.80K 7.39 2.99 0.097 8 2 94.35% 6.12G 780.42K 9.30 3.88 0.13 4 93.86% 6.17G 780.42K 9.75 3.89 0.13 12 2 95.18% 9.07G 1156K 13.80 5.71 0.20 4 95.96% 9.14G 1156K 14.41 5.71 0.19 11 44 4 2 86.27% 0.517G 123.10K 3.50 2.17 0.029 4 84.33% 0.524G 123.10K 3.61 2.17 0.028 6 2 88.25% 0.717G 170.44K 5.05 2.94 0.043 4 89.49% 0.727G 170.44K 5.18 2.95 0.042 8 2 90.56% 0.917G 217.79K 6.55 3.84 0.058 4 89.81% 0.930G 217.79K 6.60 3.83 0.056 12 2 91.96% 1.32G 312.48K 9.53 5.61 0.087 4 91.05% 1.34G 312.48K 9.67 5.62 0.084 88 4 2 90.23% 1.79G 430.83K 4.02 2.16 0.052 4 89.49% 1.80G 430.83K 4.10 2.17 0.051 6 2 91.92% 2.58G 618.45K 5.79 2.95 0.078 4 91.75% 2.58G 618.45K 5.87 2.95 0.076 8 2 93.94% 3.36G 806.07K 7.59 3.85 0.10 4 92.62% 3.37G 806.07K 7.62 3.85 0.10 12 2 94.72% 4.92G 1181K 10.97 5.63 0.15 4 93.24% 4.94G 1181K 11.29 5.66 0.15

Table 3: Performance on SynthWakeSAR dataset

Patch size Hidden dimension Depth # Heads Accuracy # MACs # Parameters Latency (ms) CPU GPU FPGA 8 48 4 2 90.04% 2.22G 129.15K 5.43 2.20 0.085 4 90.57% 2.32G 129.15K 5.39 2.21 0.062 6 2 91.02% 3.19G 185.40K 7.67 3.01 0.128 4 91.16% 3.34G 185.40K 8.17 3.01 0.093 8 2 91.33% 4.16G 241.66K 10.15 3.90 0.17 4 91.57% 4.37G 241.66K 10.90 3.92 0.12 12 2 92.16% 6.11G 354.17K 14.83 5.72 0.25 4 92.18% 6.41G 354.17K 15.78 5.72 0.18 96 4 2 92.55% 7.95G 478.83K 6.82 2.23 0.17 4 92.56% 8.06G 478.83K 6.64 2.23 0.16 6 2 93.34% 11.67G 701.93K 9.70 3.05 0.25 4 93.35% 11.82G 701.93K 9.43 3.05 0.24 8 2 93.05% 15.38G 925.03K 12.76 3.98 0.34 4 93.44% 15.58G 925.03K 12.17 3.97 0.32 12 2 93.21% 22.80G 1371K 18.39 5.82 0.51 4 93.47% 23.11G 1371K 17.83 5.81 0.48 16 48 4 2 84.83% 0.745G 177.15K 3.61 2.19 0.029 4 85.35% 0.752G 177.15K 3.72 2.07 0.029 6 2 86.52% 0.982G 233.40K 5.15 2.98 0.044 4 87.23% 0.992G 233.40K 5.31 2.98 0.043 8 2 87.66% 1.22G 289.66K 6.69 3.88 0.059 4 88.01% 1.23G 289.66K 6.85 3.88 0.058 12 2 89.45% 1.69G 402.17K 9.57 5.67 0.089 4 88.53% 1.71G 402.17K 9.81 5.67 0.087 96 4 2 90.29% 2.38G 572.91K 4.30 2.19 0.053 4 89.99% 2.39G 572.91K 4.29 2.20 0.052 6 2 90.75% 3.31G 796.01K 6.11 3.01 0.08 4 91.22% 3.32G 796.01K 6.09 3.02 0.079 8 2 91.95% 4.24G 1019K 7.80 3.94 0.10 4 91.25% 4.26G 1019K 7.83 3.93 0.10 12 2 91.91% 6.10G 1465K 11.26 5.74 0.16 4 91.85% 6.12G 1465K 11.49 5.76 0.15

Table 4: Performance on GBSAR dataset

Patch size Hidden dimension Depth # Heads Accuracy # MACs # Parameters Latency (ms) CPU GPU FPGA 8 44 4 2 98.99% 0.878G 109.86K 4.44 2.06 0.037 4 98.45% 0.901G 109.86K 4.60 2.18 0.034 6 2 98.76% 1.26G 157.20K 6.24 2.97 0.056 4 98.91% 1.29G 157.20K 6.61 2.98 0.052 8 2 99.07% 1.64G 204.54K 8.06 3.86 0.075 4 98.91% 1.68G 204.54K 8.54 3.88 0.069 12 2 98.91% 2.40G 299.23K 11.65 5.65 0.11 4 99.07% 2.47G 299.23K 12.72 5.66 0.10 88 4 2 99.30% 3.18G 404.92K 5.19 2.19 0.067 4 99.30% 3.20G 404.92K 5.39 2.20 0.064 6 2 99.46% 4.65G 592.54K 7.26 3.01 0.10 4 98.99% 4.68G 592.54K 7.78 3.02 0.097 8 2 99.07% 6.12G 780.15K 9.52 3.92 0.13 4 98.76% 6.17G 780.15K 10.03 3.92 0.13 12 2 99.30% 9.07G 1155K 13.84 5.74 0.20 4 98.91% 9.14G 1155K 14.38 5.75 0.19 11 44 4 2 97.75% 0.518G 122.97K 3.54 2.18 0.029 4 98.37% 0.524G 122.97K 3.64 2.18 0.028 6 2 98.37% 0.717G 170.31K 5.10 2.97 0.043 4 98.21% 0.727G 170.31K 5.37 2.97 0.042 8 2 98.60% 0.917G 217.65K 6.61 3.87 0.058 4 98.76% 0.930G 217.65K 6.81 3.88 0.056 12 2 98.37% 1.32G 312.34K 9.52 5.66 0.087 4 97.90% 1.34G 312.34K 9.85 5.66 0.084 88 4 2 98.29% 1.79G 430.57K 4.16 2.17 0.052 4 98.60% 1.80G 430.57K 4.21 2.18 0.051 6 2 99.15% 2.58G 618.19K 5.92 2.98 0.078 4 99.07% 2.58G 618.19K 5.99 2.99 0.076 8 2 99.22% 3.36G 805.80K 7.62 3.86 0.10 4 98.68% 3.37G 805.80K 7.76 3.88 0.10 12 2 99.07% 4.92G 1181K 11.11 5.67 0.15 4 98.45% 4.94G 1181K 11.20 5.68 0.15

6.7 Discussion

Through the experiments, we observe that the performance of the VTR model varies across the three datasets for different hyperparameter settings. Notably, the best classification accuracies (highlighted in bold in Tables 2-4) for the MSTAR and SynthWakeSAR datasets (95.96% and 93.47% respectively) are achieved with similar hyperparameter configurations. However, for the GBSAR dataset, a lower depth size results in better performance. It can be concluded that a lower patch size results in a higher classification accuracy as smaller patch sizes are capable of capturing fine-grained details in the input SAR images. It is also worth noting that the SPT and LSA modules are crucial for the performance improvements achieved by the VTR model, as the standard ViT performs poorly on SAR datasets unless pre-trained on significantly larger datasets.

In comparison to the benchmarking conducted by Ye et al.[24] on the three datasets using CNNs, GNNs, and ViTs, our VTR model equipped with the SPT and LSA modules demonstrates improved or similar performance. Particularly, our model outperforms the state-of-the-art for the SynthWakeSAR dataset, achieving an accuracy of 93.47%. For the GBSAR dataset, we achieve a comparable classification accuracy of 99.46%. For the MSTAR dataset, however, GNNs outperform all other models. This is attributed to the fact that images in this dataset contain the actual target ground vehicle object to be classified only within a few pixels in the center of the image. This limits VTRs capability to extract localized features from mostly non-informant tokens, despite the addition of the SPT and LSA modules. In contrast, GNNs excel at capturing this local information, leading to a better performance. These comparison results are summarized in Table 5. We define the best-performing model as the model that achieves the highest classification accuracy. Table 6 compares the number of parameters in the best-performing models. In contrast to previous work, our model has fewer parameters, leading to a smaller model size. Compared against standard pre-trained ViTs with total parameters of the order 108superscript10810^{8}10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT, VTR is significantly smaller (×100absent100\times 100× 100) with the largest model size being of the order 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT parameters.

The proposed FPGA accelerator, for the best-performing model across the three datasets, has an average speedup (based on single image inference latency) of ×30absent30\times 30× 30 and ×70absent70\times 70× 70 when compared to GPU and CPU platforms, respectively. For smaller models, the accelerator has a much higher speedup than the nominal speedup (associated with larger models). Thus, it is highly suitable for deployment in real-time SAR ATR workloads. Figure 6 compares the throughput of the best-performing model (for each dataset) versus the baseline FPGA throughput (green line). For smaller batch sizes of 1,8,1618161,8,161 , 8 , 16 and 32323232, the FPGA throughput is much better compared to its CPU and GPU counterparts. With larger batch sizes, as expected, the GPU throughput overtakes the FPGA throughput. This suggests that despite being optimized for streaming single image input inferences, our proposed FPGA accelerator can also be used for smaller batch sizes such as 8888 or 16161616 with a throughput that is still, on average, ×2absent2\times 2× 2 that of the GPU throughput.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Throughput on CPU, GPU, and FPGA
Table 5: Comparison of classification accuracy
Dataset Model
VTR ResNet18[30] ResNet34[30] ResNet50[30] SS-ViT[20] Multi-layer GNN[17]
MSTAR 95.96% 98.47% 98.64% 90.34% 95.61% 99.09%
SynthWakeSAR 93.47% 90.30% 92.14% 92.42% 87.98% 91.15%
GBSAR 99.46% 99.30% 99.99% 99.53% 99.04% 98.67%
Table 6: Comparison of the # parameters in the best performing models with prior work
MSTAR SynthWakeSAR GBSAR
VTR Multi-layer GNN VTR ResNet50 VTR ResNet34
# Parameters 1.16M 1.27M 1.37M 23.5M 0.59M 21.3M

7 CONCLUSION AND FUTURE WORK

In this paper, we developed a lightweight ViT model tailored for SAR ATR applications, VTR, addressing the challenges of limited training data and model computational complexity. Our experimental results demonstrated the effectiveness of the proposed model across the diverse SAR datasets, achieving better or comparable results than prior work. The proposed FPGA accelerator is suitable for real-time SAR ATR workloads with tight latency constraints, performing significantly better than the alternative state-of-the-art GPU and CPU platforms. In future work, we will explore multi-modal datasets, such as the EO-SAR dataset, for the SAR ATR application and how such multi-modality can be exploited to improve the model performance. Furthermore, we will explore novel hybrid ViT and GNN architectures to overcome the performance limitations of purely ViT-based approaches on MSTAR-like data. To this extent, we will study approaches that exploit ViT’s global and GNN’s local inductive bias.

8 Acknowledgement

This work is supported by the DEVCOM Army Research Lab (ARL) under grant W911NF2220159 and the National Science Foundation (NSF) under grants SPX-2333009 and SaTC-2104264. Equipment and support by AMD AECG are greatly appreciated.
Distribution Statement A: Approved for public release. Distribution is unlimited.

References

  • [1] Tsokas, A., Rysz, M., Pardalos, P. M., and Dipple, K., “Sar data applications in earth observation: An overview,” Expert Systems with Applications 205, 117342 (2022).
  • [2] Reigber, A., Scheiber, R., Jager, M., Prats-Iraola, P., Hajnsek, I., Jagdhuber, T., Papathanassiou, K. P., Nannini, M., Aguilera, E., Baumgartner, S., Horn, R., Nottensteiner, A., and Moreira, A., “Very-high-resolution airborne synthetic aperture radar imaging: Signal processing and applications,” Proceedings of the IEEE 101(3), 759–783 (2013).
  • [3] Li, J., Yu, Z., Yu, L., Cheng, P., Chen, J., and Chi, C., “A comprehensive survey on sar atr in deep-learning era,” Remote Sensing 15(5) (2023).
  • [4] Moreira, A., Prats-Iraola, P., Younis, M., Krieger, G., Hajnsek, I., and Papathanassiou, K. P., “A tutorial on synthetic aperture radar,” IEEE Geoscience and Remote Sensing Magazine 1(1), 6–43 (2013).
  • [5] Ding, J., Chen, B., Liu, H., and Huang, M., “Convolutional neural network with data augmentation for sar target recognition,” IEEE Geoscience and Remote Sensing Letters 13(3), 364–368 (2016).
  • [6] Chen, S., Wang, H., Xu, F., and Jin, Y.-Q., “Target classification using the deep convolutional networks for sar images,” IEEE Transactions on Geoscience and Remote Sensing 54(8), 4806–4817 (2016).
  • [7] Zhang, B., Wijeratne, S., Kannan, R., Prasanna, V., and Busart, C., “Graph neural network for accurate and low-complexity sar atr,” arXiv preprint arXiv:2305.07119 (2023).
  • [8] Wang, R., Wang, L., Wei, X., Chen, J.-W., and Jiao, L., “Dynamic graph-level neural network for sar image change detection,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
  • [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N., “An image is worth 16x16 words: Transformers for image recognition at scale,” (2021).
  • [10] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., and Shah, M., “Transformers in vision: A survey,” ACM Computing Surveys 54, 1–41 (Jan. 2022).
  • [11] Chen, L., Luo, R., Xing, J., Li, Z., Yuan, Z., and Cai, X., “Geospatial transformer is what you need for aircraft detection in sar imagery,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2022).
  • [12] Liu, X., Wu, Y., Liang, W., Cao, Y., and Li, M., “High resolution sar image classification using global-local network structure based on vision transformer and cnn,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
  • [13] Wang, C., Huang, Y., Liu, X., Pei, J., Zhang, Y., and Yang, J., “Global in local: A convolutional transformer for sar atr fsl,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022).
  • [14] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., “Imagenet: A large-scale hierarchical image database,” in [2009 IEEE Conference on Computer Vision and Pattern Recognition ], 248–255 (2009).
  • [15] Dong, H., Zhang, L., and Zou, B., “Exploring vision transformers for polarimetric sar image classification,” IEEE Transactions on Geoscience and Remote Sensing 60, 1–15 (2022).
  • [16] Zhou, Y., Jiang, X., Xu, G., Yang, X., Liu, X., and Li, Z., “Pvt-sar: An arbitrarily oriented sar ship detector with pyramid vision transformer,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16, 291–305 (2023).
  • [17] Zhang, B., Kannan, R., Prasanna, V., and Busart, C., “Accurate, low-latency, efficient sar automatic target recognition on fpga,” in [2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) ], 1–8 (2022).
  • [18] Zhang, B., Kannan, R., Prasanna, V., and Busart, C., “Accelerating gnn-based sar automatic target recognition on hbm-enabled fpga,” in [2023 IEEE High Performance Extreme Computing Conference (HPEC) ], 1–7 (2023).
  • [19] Fein-Ashley, J., Ye, T., Wickramasinghe, S., Zhang, B., Kannan, R., and Prasanna, V., “A single graph convolution is all you need: Efficient grayscale image classification,” arXiv preprint arXiv:2402.00564 (2024).
  • [20] Lee, S. H., Lee, S., and Song, B. C., “Vision transformer for small-size datasets,” arXiv preprint arXiv:2112.13492 (2021).
  • [21] “MSTAR dataset.” https://www.sdms.afrl.af.mil/index.php?collection=mstar. Accessed: 2024-03-27.
  • [22] Rizaev, I. G. and Achim, A., “Synthwakesar: A synthetic sar dataset for deep learning classification of ships at sea,” Remote Sensing 14(16), 3999 (2022).
  • [23] Turčinović, F., Kačan, M., Bojanjac, D., and Bosiljevac, M., “Deep learning approach based on gbsar data for detection of defects in packed objects,” in [2023 17th European Conference on Antennas and Propagation (EuCAP) ], 1–4 (2023).
  • [24] Fein-Ashley, J., Ye, T., Kannan, R., Prasanna, V., and Busart, C., “Benchmarking deep learning classifiers for sar automatic target recognition,” in [2023 IEEE High Performance Extreme Computing Conference (HPEC) ], 1–6, IEEE (2023).
  • [25] Morgan, D. A., “Deep convolutional neural networks for atr from sar imagery,” in [Algorithms for Synthetic Aperture Radar Imagery XXII ], 9475, 116–128, SPIE (2015).
  • [26] Li, S., Lang, P., Fu, X., Jiang, J., Dong, J., and Nie, Z., “Automatic target recognition of sar images based on transformer,” in [2021 CIE International Conference on Radar (Radar) ], 938–941, IEEE (2021).
  • [27] He, Y.-L., Zhang, X.-L., Ao, W., and Huang, J. Z., “Determining the optimal temperature parameter for softmax function in reinforcement learning,” Applied Soft Computing 70, 80–85 (2018).
  • [28] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S., “Pytorch: An imperative style, high-performance deep learning library,” in [Advances in Neural Information Processing Systems ], Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., eds., 32, Curran Associates, Inc. (2019).
  • [29] Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).
  • [30] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 770–778 (2016).