Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
An Internet of Things—Supervisory Control and Data Acquisition (IoT-SCADA) Architecture for Photovoltaic System Monitoring, Control, and Inspection in Real Time
Previous Article in Journal
Mining Nuanced Weibo Sentiment with Hierarchical Graph Modeling and Self-Supervised Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention

1
School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China
2
School of Future Technology, Shanghai University, Shanghai 200444, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(1), 43; https://doi.org/10.3390/electronics14010043
Submission received: 12 November 2024 / Revised: 18 December 2024 / Accepted: 23 December 2024 / Published: 26 December 2024

Abstract

:
Spiking neural networks (SNNs), a neural network model structure inspired by the human brain, have emerged as a more energy-efficient deep learning paradigm due to their unique spike-based transmission and event-driven characteristics. Combining SNNs with the Transformer model significantly enhances SNNs’ performance while maintaining good energy efficiency. The gating mechanism, which dynamically adjusts input data and controls information flow, plays an important role in artificial neural networks (ANNs). Here, we introduce this gating mechanism into SNNs and propose a novel spike Transformer model, called SGSAFormer, based on the Spikformer network architecture. We introduce the Spike Gated Linear Unit (SGLU) module to improve the Multi-layer perceptron (MLP) module in SNNs by adding a gating mechanism to enhance the model’s expressive power. We also incorporate Spike Gated Self-Attention (SGSA) to strengthen the network’s attention mechanism, improving its ability to capture temporal information and dynamic processing. Additionally, we propose a Temporal Attention (TA) module, which selects new filters for the input data along the temporal dimension and can substantially reduce energy consumption with only a slight decrease in accuracy. To validate the effectiveness of our approach, we conducted extensive experiments on several neuromorphic datasets. Our model outperforms other state-of-the-art models in terms of performance.

1. Introduction

Spiking neural networks (SNNs) are inspired by biological brains, characterized by low power consumption and high biological rationality. They are considered the third generation of neural networks [1]. SNNs transmit information by simulating the dynamics of biological neurons. Unlike traditional artificial neural networks (ANNs), spiking neurons accumulate membrane potentials, and when the accumulated potential exceeds the discharge threshold, the neuron emits spikes to transmit information. Compared with ANNs, SNNs achieve low power consumption, low latency, and high computational efficiency for two main reasons. First, SNNs transmit information through spikes, which are binary signals. This allows many data operations in neural networks to be performed through addition rather than multiplication, significantly reducing computation by replacing multiply-and-accumulate (MAC) with accumulate (AC). Second, the computation in SNNs is asynchronous and event-driven, meaning computation occurs only when a spike is emitted, thus reducing the processing of unnecessary zero-value data. Additionally, spike information in the network is sparse, with only a small portion of neurons being active at any given time. SNNs, with their binary spike mechanism and event-driven computation, are particularly suitable for deployment on neuromorphic chips such as TrueNorth [2], Loihi [3], and Tianjic [4]. With these event-driven, bio-inspired, and spike transmission characteristics, SNNs have emerged as a significant direction in brain-like computing.
The Transformer model is a neural network architecture based on the self-attention mechanism [5] for processing sequence-to-sequence tasks. It has demonstrated strong performance in tasks such as natural language processing (NLP) [6,7], image classification [8], object detection [9], and semantic segmentation [10]. The Vision Transformer, which is also based on the attention mechanism [8], replaces the traditional convolutional structure with a Transformer-based structure to perform computer vision tasks, surpassing Convolutional Neural Networks (CNNs) in effectiveness for many tasks. The Transformer architecture has become a mainstream method in the field of artificial intelligence, owing to its self-attention mechanism and parallelism capabilities. Recently, numerous studies have explored the application of Transformer-based models to SNNs in order to leverage the self-attention mechanism and improve SNNs performance. Spikformer proposes a spiking self-attention (SSA) mechanism [11], combining the Transformer structure with SNNs to enhance their performance. Spikformer has achieved a significant performance improvement in the field of SNNs. However, its performance still lags behind that of the state-of-the-art ANN models. On the ImageNet dataset, Spikformer achieves an accuracy of 74.81%, representing an advanced level of performance within the SNNs domain. Nevertheless, Transformer models of comparable scale can achieve accuracies exceeding 80%. Furthermore, while Spikformer avoids the complexity of the softmax operation in implementing the self-attention mechanism, it still requires the processing of sparse spiking-form query, key, and value representations. This introduces a certain level of computational complexity, particularly when dealing with large-scale datasets. This indicates that there remains a gap between spiking Transformers and Transformer architectures in ANNs, suggesting potential for further improvement and development of spiking Transformers. However, on the ImageNet test set, Spikformer demonstrates an average theoretical energy consumption of 11.577 mJ per image prediction, whereas a Transformer with the same architecture consumes an average of 38.340 mJ, which is 3.31 times higher than that of Spikformer. This lower energy consumption highlights the advantage of SNNs over ANNs in terms of energy efficiency, owing to the spike-driven nature of SNNs.
In this paper, we propose a Temporal Attention (TA) module that partitions the temporal data of input neuromorphic signals based on the number of events. Using an attention mechanism, this module determines the importance of input data, allowing the model to ignore less significant data during training and inference. This approach significantly reduces computational costs, energy consumption, and latency while only causing a slight drop in accuracy. The TA module serves as a preprocessing module that weights input data temporally and can be integrated with any network model to mitigate the impact of less important input data. Furthermore, we integrate a gating mechanism with Spikformer and propose a novel spiking Transformer architecture, termed SGSAFormer. The core component of Spikformer is SSA, which generates three spiking matrices: query, key, and value. These matrices are used to compute the attention matrix through dot product or similar methods. Due to the sparse nature of spiking matrices, which consist only of nonzero elements, the softmax function is not required to ensure the non-negativity of the attention matrix. SGSAFormer introduces a Spike Gated Self-Attention (SGSA) mechanism, which enhances attention control by introducing an additional gate spiking matrix, that multiplies the generated attention matrix to provide gated enhancement of attention. Additionally, SGSAFormer replaces the Multi-layer perceptron (MLP) module in Spikformer with a Spike Gated Linear Unit (SGLU) to further enhance data mapping and feature representation through gating mechanisms. We evaluate SGSAFormer on multiple neuromorphic datasets, and the experimental results demonstrate that SGSAFormer outperforms Spikformer across various benchmarks. Specifically, on the CIFAR10-DVS dataset, the model achieves an accuracy of 85.0%, a 4.1% improvement over the baseline. On the DVS128 dataset, it achieves an accuracy of 99.0%, outperforming the baseline by 0.7%. The main contributions of this paper are as follows:
  • Propose a TA module, which has the property of having low latency and low power consumption when performing training and inference tasks, and apply it to SNNs.
  • Combine the gating mechanism with SNNs by proposing an SGLU and SGSA, which introduce the gating mechanism into the SNNs domain.
  • Based on SGLU and SGSA, we propose the SGSAFormer model, which is an improvement of Spikformer that combines the gating mechanism with the spike Transformer to improve the information control of the network, and validate it on a variety of neuromorphic datasets to achieve advanced performance.

2. Related Works

2.1. Learning Methods

The learning methods of SNNs can be broadly classified into two categories: conversion-based methods [12,13] and direct training methods. The ANN to SNN conversion training method involves pre-training an ANN to obtain the weights, and then converting these weights and the neuron activation functions into the spike-firing and time-encoding mechanisms used in SNNs. This method achieves an accuracy almost identical to that of ANNs, but it suffers from longer inference delays and has limited adaptability to complex datasets. Spike-Timing-Dependent Plasticity (STDP) is a biologically inspired unsupervised learning algorithm [14,15] that mimics the learning process of biological neurons. The STDP algorithm adjusts synaptic strengths based on the timing difference between pre- and post-synaptic spikes, with a focus on the connections between adjacent layers of the network. As a result, this method is suitable only for small-scale networks and faces accuracy issues in multi-layer networks. Additionally, it is more susceptible to noise interference. The Backpropagation Through Time (BPTT) algorithm [16,17] is a supervised learning method that addresses the non-differentiability issue in the backpropagation of SNNs by optimizing with surrogate gradients. This improves the efficiency of SNNs training and enhances network performance, allowing many ANN-based methods to be applied to SNNs. This approach can achieve accuracy close to that of ANNs. However, for input data with long time steps, BPTT requires unfolding the entire time sequence and propagating errors step by step, which increases computational complexity. Additionally, in learning long sequences, BPTT may encounter issues such as vanishing or exploding gradients. The training algorithm we used was the BPTT algorithm.

2.2. Gating Mechanism

In biological neural networks, the synaptic strength between neurons can be dynamically adjusted during learning and memory processes. The gating mechanism is inspired by the information processing mechanism of the brain, enhancing neural networks’ ability to handle complex tasks. In 1997, Hochreiter and Schmidhuber introduced the long short-term memory (LSTM) network [18], the first to incorporate a gating mechanism into neural networks. LSTM proposes three types of gates: the forgetting gate, the input gate, and the output gate. The forgetting gate controls how much previous information is retained, the input gate regulates the introduction of new information, and the output gate determines the output based on the cell state. By using these three gates, LSTM controls the flow of information, enabling the network to memorize long-term dependencies. In 2014, Cho et al. proposed the gated recurrent unit (GRU) [19], which simplifies the LSTM gating structure. GRU combines the forgetting and input gates of LSTM into a single update gate and uses a reset gate to control the effect of the previous state. By reducing the number of gates from three to two, GRU maintains the functionality of the gating mechanism while effectively reducing computational complexity. In 2016, Dauphin et al. proposed the gated linear unit (GLU) [20] in deep convolutional networks. GLU combines the gating mechanism with a linear transformation by passing one part of the input through a linear transformation and the other part through a gating function. The outputs of these two parts are then multiplied to produce the final output. GLU controls the flow of input information through the gating mechanism, improving both the expressiveness and computational efficiency of the model, and has shown strong performance in text and sequential data tasks. Qiu et al. proposed gated attention coding (GAC) [21], which utilizes a multidimensional gated attention unit to encode the input of SNNs, improving the temporal dynamics and encoding efficiency of SNNs. However, GAC is only used as a preprocessing layer, responsible for re-encoding the input data of the SNNs, replacing the direct encoding commonly used in most deep SNNs. GAC cannot be integrated into the internal structure of SNNs, and therefore, in the SNNs model it uses, the gated mechanism is not utilized to control and process the internal data flow within the network. Our proposed method integrates the gated mechanism into the internal structure of the SNNs, enabling gated processing during its computational process.

2.3. Attention Spiking Neural Networks

The impressive performance of deep neural networks (DNNs) in tasks such as natural language processing [7] and computer vision [22], particularly those based on the attention mechanism, has sparked interest in integrating the attention mechanism with SNNs. Yao et al. [23] introduced a TA mechanism into SNNs by assigning attention weights to spike sequences along the temporal dimension. In subsequent research, Yao et al. [24] proposed a multidimensional attention module based on the channel and spatial attention mechanisms of the Convolutional Block Attention Module (CBAM)). This module assigns attention weights to spike sequences across the temporal, channel, and spatial dimensions, using these weights to optimize the membrane potential and adjust the spike response, resulting in significant improvements in both performance and energy efficiency. These methods improve the performance of SNNs through the attention mechanism, but their performance still lags behind that of Transformer architectures, and the computation of these attention mechanisms is relatively complex.
Zhou et al. [11] proposed SSA, which calculates the attention matrix using spike matrices without the need for a softmax function to ensure non-negativity of the attention matrix, thus saving computational cost in the attention mechanism. Based on this, they developed the Spikformer model, which greatly enhanced the performance of SNNs. Building on Spikformer, Zhou et al. [25] further developed the Spikingformer model, introducing a spike-driven residual structure. In Spikingformer, the residual structure from Spikformer was reordered by placing the neuron layers before the convolutional layers, reducing the non-spiking computations in Spikformer and thus lowering energy consumption. Yao et al. [26] proposed the Spike-Driven Transformer, which introduces a novel Spike-Driven Self-Attention (SDSA) mechanism. This mechanism uses only masking and addition operations, eliminating multiplication and significantly reducing computational energy consumption. In SDSA, matrix multiplications related to the spike matrices are transformed into sparse additions, and matrix multiplication between queries, keys, and values is performed using masking and addition. Additionally, modifications were made to the residual structure, effectively reducing computational energy compared to standard self-attention. Despite the continuous improvements in the performance of the Spike-Driven Transformer, there remains a performance gap compared to ANNs. On the large-scale ImageNet dataset, Spikformer achieves an accuracy of 74.81%, and the accuracy of the Spike-Driven Transformer increases to 77.07%, but this still falls short of the accuracy of Transformers of a similar scale. This indicates that there is still room for further improvements in the Spike-Driven Transformer. Zhou et al. proposed QKFormer [27] and developed a novel Q-K attention mechanism, specifically designed for the spatiotemporal patterns of SNNs, which can easily simulate the importance of tokens or channel dimensions using binary values. The Q-K attention mechanism has linear complexity with respect to the number of tokens (or channels) and uses only two spike-based components: query and key. QKFormer achieves performance surpassing existing SNNs on various static and neuromorphic datasets with lower power consumption. In attention-based SNNs, increasing the scale of the network can enhance its performance, but this approach also leads to increased computational complexity and energy consumption. Currently, many researchers are optimizing attention-based SNNs by improving the attention computation methods. Both the Spike-Driven Transformer and QKFormer achieve performance improvements while reducing power consumption by enhancing spike-based attention. These spike-based Transformer methods have made significant progress in both performance and energy consumption, but they have not effectively controlled and filtered the data flow within the network.
The SGSAFormer model we propose combines the self-attention mechanism with a gating mechanism, enhancing the attention mechanism’s ability to filter and control temporal and spatial information within the data. Compared to existing methods, this is the first time that a gating mechanism has been applied within an SNNs model to control the data flow. Our approach enables control over the attention within the SNNs, advancing the further development of the spike attention mechanism. Through gating enhancement, the proposed SGSAFormer achieves improved performance over current attention-based SNNs.

3. Method

Based on Spikformer, we propose SGSAFormer, which incorporates the gating mechanism into the SNNs architecture and controls the transfer of information within the neural network through this gating mechanism. The general framework of SGSAFormer is shown in Figure 1. SGSAFormer primarily consists of an SPS module, an SGSA module, an SGLU module, and a linear classification head. The spike neurons in SGSAFormer are Leaky Integrate-and-Fire (LIF) neurons, and the model structure of LIF neurons is briefly introduced in Section 3.1. In Section 3.2, we design the TA module to select relevant input information in the time dimension. Section 3.3 and Section 3.4 describe the design of the SGLU and SGSA modules, respectively, which combine the gating mechanism with the SNNs architecture.

3.1. Leaky Integrate-and-Fire Neuron Model

The spike neuron model is simplified from the biological neuron model. The LIF neuron model [28] is widely used in SNNs [29] by virtue of its simplicity and computational efficiency, making is easier to implement on computers. The neuron dynamics equations for the LIF model are defined by the following equation:
H [ t ] = V [ t 1 ] + 1 τ ( X [ t ] ( V [ t 1 ] V r e s e t ) )
S [ t ] = Θ ( H [ t ] V t h )
V [ t ] = V r e s e t S [ t ] + H [ t ] ( 1 S [ t ] )
where τ represents the membrane potential time decay factor, X [ t ] is the input current at time step t. H [ t ] denotes the membrane potential accumulated at time step t. V [ t ] denotes the membrane potential after the spike is delivered at time step t. Θ ( v ) is the Heaviside step function, with Θ ( v ) = 1 if v 0 , and Θ ( v ) = 0 if v < 0 . V t h denotes the spike delivery threshold, S [ t ] is the neural output at time step t. Its value is only 1 or 0, which represents whether or not a spike is generated. S [ t ] is 1 when H [ t ] V t h , otherwise S [ t ] is 0. V r e s e t reset represents the reset potential after the neuron issues a spike. Researchers have made various improvements to the LIF neuron model to improve the performance. The Parametric Leaky Integrate-and-Fire (PLIF) [30] model introduces a learnable time decay factor, the k-based Leaky Integrate-and-Fire (KLIF) [31] model proposes a learnable neuron activation threshold. In this paper, we use the neuron model as the base LIF model.

3.2. Temporal Attention

Dynamic vision sensors (DVS) are different from conventional cameras in that DVS cameras do not capture static images but generate a stream of events by monitoring the change in brightness of each pixel. With their high temporal resolution, low latency, and low power consumption, DVS cameras have excelled in the fields of autonomous driving and motion recognition. Due to the asynchronous event-driven nature of SNNs, they have an advantage over ANNs in processing DVS data. SNNs use the data from the DVS event camera and first integrate it in time to form a stream of frames on the timeline. Since the DVS event camera records the part of the change that occurs to generate the event, the more events that a given frame of data represents, the more change that occurs at that moment in time. Biologically, the human eye tends to focus more on actions or events with more changes in the field of view, so a TA module is proposed to assign weights to time frames based on the number of events in the input time frames, allowing the model to better focus on time frames containing a greater number of events, and the structure of the TA module is shown in Figure 2.
As can be seen in Figure 2, data frames with a higher number of events contain more information; more weight is assigned to these data frames, and their importance is higher. On the contrary, frames with a lower number of events contain less information. The way the weights are assigned to each frame is defined by the following equation:
W t = S i g m o d ( s · n t n f r a m e )
where W t represents the weight of the frame t, s c a l e denotes the scaling factor, n t denotes the result of sorting the frame t in ascending order of the number of events in the input data frame, n f r a m e denotes the number of frames in the input data. Figure 3 shows how the TA module effectively assigns weights.
The sigmoid function represents a nonlinear relationship in biology that gradually changes towards saturation, a characteristic associated with regulation and feedback in many biological phenomena. The human eye tends to focus more on actions or events with significant changes in the field of view, while less attention is paid to static or low-change actions or events, which aligns with the mathematical properties of the sigmoid function. Additionally, the sigmoid function for weight calculation concentrates the weights at the two ends (0 and 1), allowing the network model to focus more on the time frames with a higher number of events, thus enabling attention allocation along the temporal dimension of the input data.
In the training and inference process of the network, the TA module is placed after the input layer to assign weights to the input data frame stream, and a frame with a smaller weight value indicates that the frame contains less information. The human eye tends to focus its attention on the scene in the field of view with large dynamic changes. Based on this principle, we add a threshold value to the TA module to filter the input frames with weights less than the threshold value and retain the data frames with weights greater than the threshold value.
W t = W t if W t W t h , 0 if W t < W t h
where W t h represents the weight threshold. This effectively reduces the time step of the SNNs, greatly reduces the amount of computation, and the accuracy does not drop dramatically because the main information is retained.

3.3. Spike Gated Linear Unit

MLP is a basic structure of neural networks and is widely used in a variety of neural network models. Spikformer implements a spike MLP by placing a layer of LIF neurons after a linear layer in place of an activation function. The gating mechanism can dynamically regulate the information flow and enhance the performance of the model to effectively handle complex tasks. We propose an SGLU by combining the gating mechanism into the MLP structure in SNNs. The structure of the SGLU is shown in Figure 4. SGLU divides the input data into two parts, each processed through a fully connected layer and linear transformation. These two parts are then passed through a spiking neuron layer to generate spike vectors. The two spike vectors are element-wise multiplied, acting as a filter to obtain the filtered information. The filtered information is then subjected to another linear transformation and passed through the spiking neuron layer to generate spike vectors, preparing them for the next section of the network. In addition, SGLU uses time-dimension batch normalization layer instead of the base batch normalization layer. The time-dimension batch normalization layer is able to batch normalize the data in the time dimension, which can better preserve the time-domain information in the data. The formula of the SGLU is as follows:
U = S N ( T B N ( L i n e a r ( X ) ) )
V = S N ( T B N ( L i n e a r ( X ) ) )
O = S N ( T B N ( L i n e a r ( U · V ) ) )
where Linear is the fully connected layer, TBN is the time-dimension batch normalization layer, and SN is the spike neuron layer. SGLU is an improved spike MLP with the addition of a gating mechanism that improves the expressive power of the model.

3.4. Spike Gated Self-Attention

Inspired by SGLU, we noticed that the gating mechanism can also play a great role in SNNs, and we proposed SGSA as the base component of SGSAFormer. This module combines the gating mechanism with Spike Self-Attention to enhance the ability of the attention mechanism to filter and control the information, which helps the model to focus on more important features and ignore irrelevant information when processing data. Like SGLU, the TBN layer is used in SGSA instead of the BN layer. The structure of SGSA is shown in Figure 1. SGSA adds a gating mechanism to the SSA. SGSA generates four spike matrices: query (Q), key (K), value (V), and gate (G). The attention matrix is computed by performing a dot product on Q, K, and V, and then element-wise multiplying the resulting attention matrix with the spike matrix G. Through the G matrix, the attention matrix generated by Q, K, and V is filtered and controlled to obtain the gated attention matrix:
Q = S N ( T B N ( L i n e a r ( X ) ) ) )
K = S N ( T B N ( L i n e a r ( X ) ) )
V = S N ( T B N ( L i n e a r ( X ) ) )
where Q , K , V R T N D . The generated spike matrices Q, K, V are subjected to dot product and scaling factor s to generate the attention matrix:
S S A ( Q , K , V ) = S N ( Q K T V · s )
G = S N ( T B N ( L i n e a r ( X ) ) )
S G S A = S N ( T B N ( L i n e a r ( G · S S A ) )
The gating matrix G and S S A are multiplied element-wise and then passed through a linear layer and a spike neuron layer to obtain SGSA. SGSA gates enhance attention mechanisms.

4. Results

In this section, we evaluate the application of the TA module in the spiking convolutional neural network (SCNN) model and the Spikformer model to assess the generality and performance of the TA module. Additionally, we evaluate the performance of the SGSAFormer model on the neuromorphic datasets CIFAR10DVS [32], DVS128Gesture [33], and N-Caltech101 [34]. We retrain SGSAFormer and compare its performance with existing SNNs models. The results demonstrate that SGSAFormer achieves state-of-the-art performance on the neuromorphic datasets. Furthermore, we conduct ablation experiments on the CIFAR10DVS dataset to investigate the impact of different components in SGSAFormer and evaluate its performance at various time steps and with different network parameters. The deep learning frameworks used are PyTorch version 1.13.1 and SpikingJelly version 0.0.0.0.12 [35]. The GPU used for experiments is the NVIDIA GeForce RTX 3090. The optimizer used is AdamW, with an initial learning rate of 0.01 and 200 epochs.
The number of floating-point operations (FLOPs) is often used to estimate the computational cost of ANNs. If the FLOPs of an ANNs are known, the computational cost of a SNNs can be estimated by combining the SFR and the number of time steps. Assuming that the MAC and AC operations are performed on 45 nm hardware [36], with EMAC = 4.6 pJ and EAC = 0.9 pJ, the energy consumption of the network can be estimated. We use this method to calculate the energy consumption of our approach and model.

4.1. Datasets

DVS128Gesture is a neuromorphic dataset used for gesture recognition, consisting of 11 different gesture categories. These gestures were collected from 29 individuals under three distinct lighting conditions. The dataset includes a total of 1464 samples, with 1176 samples designated as the training set and 288 samples as the test set.
CIFAR10DVS is one of the larger visual neuromorphic datasets, containing 10,000 event streams, which were converted from 10,000 image frames using DVS. It includes 10 categories, with 1000 samples per category. Typically, researchers split the first 900 samples of each category for the training set and use the remaining 100 samples for the test set. In our experiments, we also adopted a 9:1 split ratio.
N-Caltech101 is a neuromorphic version of the original Caltech101 dataset, specifically designed for SNNs and other neuromorphic computing systems. N-Caltech101 contains images from 101 object categories, ranging from animals to everyday objects, captured using a DVS, making it suitable for classification tasks.

4.2. Temporal Attention Experimental Result

We add the TA module in front of two models, SCNN and Spikformer, and conduct experiments on the DVS128Gesture dataset. The SCNN model consists of five convolutional layers and three fully connected layers, with classification performed by the fully connected layers after multiple convolutional and pooling operations. Each layer is followed by a spiking neuron layer, which replaces the activation function using the LIF neuron model. The Spikformer model uses two encoder modules with an embedding dimension of 256.
We first tested the impact of the TA module with different thresholds on the SCNN network, and the experimental results are shown in Figure 5. The experiment shows that as the threshold increases, more input data are filtered, resulting in lower energy consumption for the entire network. However, when the threshold becomes too large, it filters out valid information, causing the network’s performance to degrade. Therefore, considering both performance and energy consumption, a threshold of 0.7 is selected for the experiment. The experimental results are shown in Table 1.
The addition of the TA module to the SCNN model results in a 0.69% decrease in accuracy. However, the TA module enables the SNNs to filter the input data frames, reducing the number of data frames for which computation is performed. This corresponds to a reduction in the time step of the network from 16 to 5, leading to significant savings in computation and power consumption compared to the original model. When compared to the SCNN model with five time steps, the accuracy of the SCNN model with the TA module is improved by 3.78%. The same trend is observed in the Spikformer model. The addition of the TA module reduces the accuracy of the Spikformer model with 16 time steps by 1.77%, but when compared to the Spikformer model with 5 time steps, the accuracy improves by 1.74%. In addition, we estimated the energy consumption of the network for testing a single image on the DVS128Gesture dataset using the method described in Section 4. The results show that the use of the TA module effectively reduces the network’s energy consumption. Specifically, using the TA module in SCNN reduced the energy consumption by 53%, while in Spikformer, the energy consumption was reduced by 59%. We calculate the spike-firing rate (SFR) before and after the addition of the TA module in both models. The SFR is given by
S F R t = n f i r e t N n e u r o n
where S F R t denotes the SFR at model time step t, n f i r e t denotes the number of neurons firing spike at time step t, and N n e u r o n denotes the total number of neurons in the model. The experimental results are shown in Figure 6. The SFR decreases when the TA module is added to both models. This is due to the fact that the TA module reduces the input of unimportant data. The decrease in SFR leads to a reduction in the computational effort of the model.
The experiments demonstrate that the TA module can achieve significant computational savings with only a slight reduction in accuracy, while outperforming the model in terms of time steps. Additionally, the experimental results for the two different SNNs are consistent, proving that the TA module is applicable to various models in the field of SNNs and can serve as a generalized network module.

4.3. SGSAFormer Experimental Result

We tested SGSAFormer on the DVS128Gesture, CIFAR10DVS, and N-Caltech101 datasets. The number of encoders was set to two, and the patch embedding dimensions were 256 for the DVS128Gesture and N-Caltech101 datasets. For the CIFAR10DVS dataset, the number of encoders was set to one, and the patch embedding dimensions were 512. In the experimental model, we used four SPSs and 16 detection heads. SGSAFormer-L-D denotes the hyperparameters of the model, where L represents the number of Transformer encoding modules, and D represents the patch embedding dimension. Experiments were performed with 16 time steps. The experimental results are shown in Table 2.
As shown in Table 2, on the CIFAR10DVS dataset, SGSAFormer-1-512 achieves 85.0% accuracy at 16 time steps, which is 2.4% better than SGLFormer and 5.1% better than STSformer. At the same time step, it is 5.0% higher than S-Transformer. SGSAFormer-2-256 uses the same hyperparameters on the DVS128Gesture dataset as those used on the CIFAR10DVS dataset. For the same number of time steps, SGSAFormer-2-256 is 0.7% more accurate than Spikformer, 0.7% more accurate than Spikingformer, 0.3% more accurate than STSformer, and 0.4% more accurate than SGLFormer. The accuracy of SGSAFormer-2-256 is slightly lower than that of S-Transformer. On the N-Caltech101 dataset, SGSAFormer outperforms TCJA-SNNs by 1.4%, but does not surpass the 87.1% accuracy of MVF-Net with ANNs. The experimental results show that our proposed SGSAFormer model achieves advanced performance on the neuromorphic datasets. The gating mechanism also proves effective in the field of SNNs.
In addition, we estimated the energy consumption of SGSAFormer using the method proposed in Section 4. The energy consumption results are shown in Table 3. The results show that SGSAFormer consumes 1.1 mJ more energy than Spikformer of the same scale, which is due to the added gating computation in SGSAFormer. However, compared to the Transformer in ANNs, SGSAFormer’s energy consumption is 67% lower. Although SGSAFormer’s energy consumption and computation are higher than Spikformer, SGSAFormer achieves higher performance.

4.4. Ablation Experiment Result

To verify the effectiveness of our proposed method, we performed ablation experiments on the CIFAR10DVS dataset, as the accuracy on the DVS128Gesture dataset is close to 1, making further improvement difficult. We used three models for comparison: Spikformer as the baseline; Spikformer-SGLU, where the MLP in the Transformer encoder is replaced with SGLU based on the baseline model; and SGSAFormer, which further adds the SGSA module to the previous model. We tested the results for two different hyperparameter settings. As in experiment 4.3, SGSAFormer-L-D denotes the hyperparameters of the model, where L represents the number of Transformer encoders, and D represents the patch embedding dimension. All other hyperparameters of the models in the experiments remained the same. Additionally, we tested three different time steps. The experimental results are shown in Table 4.
From the experimental results, it can be seen that both the SGLU and SGSA modules improve the performance of the benchmark model at different time steps. The results of the ablation experiments confirm the effectiveness and reliability of the modules we proposed. In Figure 7, we show the attention map examples of the last encoder block in SGSAFormer and Spikformer. From the figure, it can be observed that SGSAFormer generates attention at the first time step, while Spikformer fails to generate attention at the first time step. Additionally, SGSAFormer is more effective than Spikformer in capturing image regions relevant to classification, demonstrating the enhanced role of the gating mechanism in the attention mechanism.

5. Conclusions

In this study, we propose a TA module based on the biological mechanism of the human eye to drastically reduce computational and energy consumption while minimizing the impact on model performance. We introduce the gating mechanism from ANNs to SNNs, aiming to utilize this mechanism to control and filter the complex spatio-temporal data in SNNs. We propose the SGLU to improve the expressive power of the model. We design the SGSA module based on the SSA module, which enhances the control ability of the spike attention mechanism over temporal data through the gating mechanism, thus helping the model extract temporal information. By combining SGLU and SGSA, we propose the SGSAFormer model and conduct experiments on neuromorphic datasets.
SGSAFormer has demonstrated excellent performance on multiple neuromorphic datasets, validating the effectiveness of the gating mechanism in the SNNs domain. However, the gating mechanism used in SGSAFormer is relatively basic. In the field of ANNs, there are many advanced gating mechanisms, and integrating these methods with SNNs could further enhance the performance and efficiency of SNNs. Additionally, spiking attention mechanisms are also an important research direction, which could drive improvements in both the performance and energy efficiency of spiking Transformer architectures. Compared to Spikformer, SGSAFormer achieves higher performance. However, due to the presence of the gating mechanism, the SNNs requires additional computation for gating, which is a limitation of our current approach. We hope that future research will focus on optimizing the gating mechanism to reduce the computational cost. Nonetheless, with its outstanding performance, SGSAFormer can be deployed on neuromorphic chips, mimicking the event-driven and asynchronous computing characteristics of biological neural systems to achieve low-power, high-performance edge computing. Looking ahead, the methods proposed in SGSAFormer have the potential to be applied to large-scale neural networks, making them suitable for more complex applications and advancing the development of intelligent systems.

Author Contributions

Conceptualization, S.G. and Y.Q.; methodology, S.G. and Y.Q.; validation, Y.Q. and Z.Z. (Zirui Zhao); formal analysis, Y.Q. and R.Z.; investigation, R.Z. and Z.Z. (Zirui Zhao); writing—original draft preparation, Y.Q.; writing—review and editing, H.Z. and Z.Z. (Zihao Zhu); visualization, H.Z. and Z.Z. (Zihao Zhu); supervision, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China, grant number 2022YFF1202500 and 2022YFF1202504.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Netw. 1997, 10, 1659–1671. [Google Scholar] [CrossRef]
  2. Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.; Guo, C.; Nakamura, Y.; et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 2014, 345, 668–673. [Google Scholar] [CrossRef] [PubMed]
  3. Davies, M.; Srinivasa, N.; Lin, T.H.; Chinya, G.; Cao, Y.; Choday, S.H.; Dimou, G.; Joshi, P.; Imam, N.; Jain, S.; et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]
  4. Pei, J.; Deng, L.; Song, S.; Zhao, M.; Zhang, Y.; Wu, S.; Wang, G.; Zou, Z.; Wu, Z.; He, W.; et al. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 2019, 572, 106–111. [Google Scholar] [CrossRef]
  5. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  6. Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  7. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  8. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  9. Perrett, T.; Masullo, A.; Burghardt, T.; Mirmehdi, M.; Damen, D. Temporal-relational crosstransformers for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 475–484. [Google Scholar]
  10. He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  11. Zhou, Z.; Zhu, Y.; He, C.; Wang, Y.; Yan, S.; Tian, Y.; Yuan, L. Spikformer: When spiking neural network meets transformer. arXiv 2022, arXiv:2209.15425. [Google Scholar]
  12. Diehl, P.U.; Neil, D.; Binas, J.; Cook, M.; Liu, S.C.; Pfeiffer, M. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In Proceedings of the 2015 International Joint Conference On Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; IEEE: Piscataway Township, NJ, USA, 2015; pp. 1–8. [Google Scholar]
  13. Wang, Y.; Zhang, M.; Chen, Y.; Qu, H. Signed Neuron with Memory: Towards Simple, Accurate and High-Efficient ANN-SNN Conversion. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 2501–2508. [Google Scholar]
  14. Diehl, P.U.; Cook, M. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Front. Comput. Neurosci. 2015, 9, 99. [Google Scholar] [CrossRef] [PubMed]
  15. Kheradpisheh, S.R.; Ganjtabesh, M.; Thorpe, S.J.; Masquelier, T. STDP-based spiking deep convolutional neural networks for object recognition. Neural Netw. 2018, 99, 56–67. [Google Scholar] [CrossRef] [PubMed]
  16. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  17. Kim, Y.; Panda, P. Optimizing deeper spiking neural networks for dynamic vision sensing. Neural Netw. 2021, 144, 686–698. [Google Scholar] [CrossRef]
  18. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  19. Cho, K. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  20. Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
  21. Qiu, X.; Zhu, R.J.; Chou, Y.; Wang, Z.; Deng, L.j.; Li, G. Gated attention coding for training high-performance and efficient spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 601–610. [Google Scholar]
  22. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  23. Yao, M.; Gao, H.; Zhao, G.; Wang, D.; Lin, Y.; Yang, Z.; Li, G. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10221–10230. [Google Scholar]
  24. Yao, M.; Zhao, G.; Zhang, H.; Hu, Y.; Deng, L.; Tian, Y.; Xu, B.; Li, G. Attention spiking neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9393–9410. [Google Scholar] [CrossRef]
  25. Zhou, C.; Yu, L.; Zhou, Z.; Ma, Z.; Zhang, H.; Zhou, H.; Tian, Y. Spikingformer: Spike-driven residual learning for transformer-based spiking neural network. arXiv 2023, arXiv:2304.11954. [Google Scholar]
  26. Yao, M.; Hu, J.; Zhou, Z.; Yuan, L.; Tian, Y.; Xu, B.; Li, G. Spike-driven transformer. arXiv 2024, arXiv:2307.01694. [Google Scholar]
  27. Zhou, C.; Zhang, H.; Zhou, Z.; Yu, L.; Huang, L.; Fan, X.; Yuan, L.; Ma, Z.; Zhou, H.; Tian, Y. QKFormer: Hierarchical Spiking Transformer using QK Attention. arXiv 2024, arXiv:2403.16552. [Google Scholar]
  28. Izhikevich, E.M. Simple model of spiking neurons. IEEE Trans. Neural Netw. 2003, 14, 1569–1572. [Google Scholar] [CrossRef] [PubMed]
  29. Deng, L.; Wu, Y.; Hu, X.; Liang, L.; Ding, Y.; Li, G.; Zhao, G.; Li, P.; Xie, Y. Rethinking the performance comparison between SNNS and ANNS. Neural Netw. 2020, 121, 294–307. [Google Scholar] [CrossRef] [PubMed]
  30. Fang, W.; Yu, Z.; Chen, Y.; Masquelier, T.; Huang, T.; Tian, Y. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2661–2671. [Google Scholar]
  31. Jiang, C.; Zhang, Y. Klif: An optimized spiking neuron unit for tuning surrogate gradient slope and membrane potential. arXiv 2023, arXiv:2302.09238. [Google Scholar]
  32. Li, H.; Liu, H.; Ji, X.; Li, G.; Shi, L. Cifar10-dvs: An event-stream dataset for object classification. Front. Neurosci. 2017, 11, 309. [Google Scholar] [CrossRef]
  33. Amir, A.; Taba, B.; Berg, D.; Melano, T.; McKinstry, J.; Di Nolfo, C.; Nayak, T.; Andreopoulos, A.; Garreau, G.; Mendoza, M.; et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7243–7252. [Google Scholar]
  34. Orchard, G.; Jayawant, A.; Cohen, G.K.; Thakor, N. Converting static image datasets to spiking neuromorphic datasets using saccades. Front. Neurosci. 2015, 9, 437. [Google Scholar] [CrossRef]
  35. Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence. Sci. Adv. 2023, 9, eadi1480. [Google Scholar] [CrossRef]
  36. Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; IEEE: Piscataway Township, NJ, USA, 2014; pp. 10–14. [Google Scholar]
  37. Bi, Y.; Chadha, A.; Abbas, A.; Bourtsoulatze, E.; Andreopoulos, Y. Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE Trans. Image Process. 2020, 29, 9084–9098. [Google Scholar] [CrossRef]
  38. Gao, S.; Guo, G.; Huang, H.; Cheng, X.; Chen, C.P. An end-to-end broad learning system for event-based object classification. IEEE Access 2020, 8, 45974–45984. [Google Scholar] [CrossRef]
  39. Wu, Z.; Zhang, H.; Lin, Y.; Li, G.; Wang, M.; Tang, Y. Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6249–6262. [Google Scholar] [CrossRef]
  40. Deng, Y.; Chen, H.; Li, Y. MVF-Net: A multi-view fusion network for event-based object classification. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 8275–8284. [Google Scholar] [CrossRef]
  41. Zheng, H.; Wu, Y.; Deng, L.; Hu, Y.; Li, G. Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtuall, 2–9 February 2021; Volume 35, pp. 11062–11070. [Google Scholar]
  42. Fang, W.; Yu, Z.; Chen, Y.; Huang, T.; Masquelier, T.; Tian, Y. Deep residual learning in spiking neural networks. Adv. Neural Inf. Process. Syst. 2021, 34, 21056–21069. [Google Scholar]
  43. Wang, Y.; Shi, K.; Lu, C.; Liu, Y.; Zhang, M.; Qu, H. Spatial-Temporal Self-Attention for Asynchronous Spiking Neural Networks. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 3085–3093. [Google Scholar]
  44. Zhang, H.; Zhou, C.; Yu, L.; Huang, L.; Ma, Z.; Fan, X.; Zhou, H.; Tian, Y. SGLFormer: Spiking Global-Local-Fusion Transformer with high performance. Front. Neurosci. 2024, 18, 1371290. [Google Scholar] [CrossRef] [PubMed]
  45. Kaiser, J.; Mostafa, H.; Neftci, E. Synaptic plasticity dynamics for deep continuous local learning (DECOLLE). Front. Neurosci. 2020, 14, 424. [Google Scholar] [CrossRef] [PubMed]
  46. Jiang, B.; Li, Z.; Asif, M.S.; Cao, X.; Ma, Z. Event transformer. arXiv 2022, arXiv:2204.05172. [Google Scholar]
  47. Wu, X.; Song, Y.; Zhou, Y.; Jiang, Y.; Bai, Y.; Li, X.; Yang, X. STCA-SNN: Self-attention-based temporal-channel joint attention for spiking neural networks. Front. Neurosci. 2023, 17, 1261543. [Google Scholar] [CrossRef]
  48. Zhu, R.J.; Zhang, M.; Zhao, Q.; Deng, H.; Duan, Y.; Deng, L.J. Tcja-snn: Temporal-channel joint attention for spiking neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2024. [Google Scholar] [CrossRef]
Figure 1. Overview of the SGSAFormer network. The overall architecture is similar to that in Spikeformer, but two parts have been changed. First, we designed Spike Gated Linear Unit (SGLU) to replace the Multi-layer perceptron (MLP) structure in the Spikeformer encoder. Second, we propose the Spike Gated Self-Attention (SGSA) module to replace the original Spike Self-Attention (SSA) module. Our network introduces the gating mechanism with the purpose of using the gating mechanism to control the complex spatio-temporal information in the Spiking neural networks (SNNs).
Figure 1. Overview of the SGSAFormer network. The overall architecture is similar to that in Spikeformer, but two parts have been changed. First, we designed Spike Gated Linear Unit (SGLU) to replace the Multi-layer perceptron (MLP) structure in the Spikeformer encoder. Second, we propose the Spike Gated Self-Attention (SGSA) module to replace the original Spike Self-Attention (SSA) module. Our network introduces the gating mechanism with the purpose of using the gating mechanism to control the complex spatio-temporal information in the Spiking neural networks (SNNs).
Electronics 14 00043 g001
Figure 2. This is the structure of the Temporal Attention (TA) module. The TA module assigns weights to the temporal frame streams of the Dynamic vision sensors (DVS) data, enabling the network to focus on the more important parts of the temporal dimension and ignore the unimportant parts of the data when processing the data. The TA module helps the network to utilize the temporal information more efficiently and reduces the computational effort of the model.
Figure 2. This is the structure of the Temporal Attention (TA) module. The TA module assigns weights to the temporal frame streams of the Dynamic vision sensors (DVS) data, enabling the network to focus on the more important parts of the temporal dimension and ignore the unimportant parts of the data when processing the data. The TA module helps the network to utilize the temporal information more efficiently and reduces the computational effort of the model.
Electronics 14 00043 g002
Figure 3. This is a visualization of the TA module’s weight assignment. The top shows the input data integrated into temporal frames, and the bottom shows the weights assigned to each temporal frame. Frames in the input data containing more events are assigned higher weights, while frames with fewer events are assigned lower weights.
Figure 3. This is a visualization of the TA module’s weight assignment. The top shows the input data integrated into temporal frames, and the bottom shows the weights assigned to each temporal frame. Frames in the input data containing more events are assigned higher weights, while frames with fewer events are assigned lower weights.
Electronics 14 00043 g003
Figure 4. This is the structure of the spike MLP and SGLU modules. SGLU adds a gating mechanism to MLP. The spike vector generated in SGLU after two different linear transformations of the input vectors is multiplied element-wise, realizing the ability to dynamically adjust the input information. Compared with MLP, SGLU improves the nonlinear expression ability of the model.
Figure 4. This is the structure of the spike MLP and SGLU modules. SGLU adds a gating mechanism to MLP. The spike vector generated in SGLU after two different linear transformations of the input vectors is multiplied element-wise, realizing the ability to dynamically adjust the input information. Compared with MLP, SGLU improves the nonlinear expression ability of the model.
Electronics 14 00043 g004
Figure 5. Testing the impact of using TA modules with different thresholds in the spiking convolutional neural network (SCNN). By balancing the effects of the TA threshold on the network’s performance and energy consumption, an appropriate threshold is selected.
Figure 5. Testing the impact of using TA modules with different thresholds in the spiking convolutional neural network (SCNN). By balancing the effects of the TA threshold on the network’s performance and energy consumption, an appropriate threshold is selected.
Electronics 14 00043 g005
Figure 6. Comparison of spike-firing rate (SFR) before and after the addition of the TA module to SCNN and Spikeformer. The addition of the TA module leads to a decrease in the SFR at different time steps of the model.
Figure 6. Comparison of spike-firing rate (SFR) before and after the addition of the TA module to SCNN and Spikeformer. The addition of the TA module leads to a decrease in the SFR at different time steps of the model.
Electronics 14 00043 g006
Figure 7. Attention maps for the last encoder block of SGSAFormer and Spikformer. The left figure averages the attention maps over all time steps and overlays them onto the original maps, with the black parts indicating regions of weaker attention. The right figure shows the attention maps of the last encoder block at different time steps.
Figure 7. Attention maps for the last encoder block of SGSAFormer and Spikformer. The left figure averages the attention maps over all time steps and overlays them onto the original maps, with the black parts indicating regions of weaker attention. The right figure shows the attention maps of the last encoder block at different time steps.
Electronics 14 00043 g007
Table 1. Performance demonstration of TA module in two SNNs models. Performance comparison with SNNs models at different time steps.
Table 1. Performance demonstration of TA module in two SNNs models. Performance comparison with SNNs models at different time steps.
ModelArchitectureTTop-1 Acc (%)Power (mJ)
SCNN7-layer SCNN1694.791.272
SCNN7-layer SCNN590.320.486
SCNN-TA7-layer SCNN1694.10.601
SpikeformerSpikformer-2-2561698.33.055
SpikeformerSpikformer-2-256594.790.954
Spikeformer-TASpikformer-2-2561696.531.253
Table 2. The performance of our proposed model on the CIFAR10DVS, DVS128Gesture, and N-Caltech101 datasets, and a comparison with various methods using ANNs and SNNs.
Table 2. The performance of our proposed model on the CIFAR10DVS, DVS128Gesture, and N-Caltech101 datasets, and a comparison with various methods using ANNs and SNNs.
DatasetModelMethodArchitectureTTop-1 Acc (%)
CIFAR10-DVSRG-CNNs [37]ANN--54.0
MLS [38]ANN--58.8
LIAF-Net [39]ANN7-layer CNN1070.4
MVF-Net [40]ANNResNet-34-76.2
tdBN [41]SNNResNet-171067.8
SEW-ResNet [42]SNN9-layer SCNN1674.4
Spikformer [11]SNNSpikformer-2-2561680.9
Spikingformer [25]SNNSpikingformer-2-2561681.3
S-Transformer [26]SNNS-Transformer-2-2561680.0
STSA [43]SNNSTSformer-2-2561679.9
SGLFormer [44]SNNSGLFormer-2-2561682.6
This workSNNSGSAFormer-1-5121685.0 ± 0.5
DVS128-GestureLIAF-Net [39]ANN7-layer CNN6097.4
DECOLLE [45]SNN8-layer SCNN50095.5
tdBN [41]SNNResNet-171096.9
SEW-ResNet [42]SNN9-layer SCNN1697.9
Spikformer [11]SNNSpikformer-2-2561698.3
Spikingformer [25]SNNSpikingformer-2-2561698.3
S-Transformer [26]SNNS-Transformer-2-2561699.3
STSA [43]SNNSTSformer-2-2561698.7
SGLFormer [44]SNNSGLFormer-2-2561698.6
This workSNNSGSAFormer-2-2561699.0 ± 0.03
N-Caltech101RG-CNNs [37]ANN--65.7
MLS [38]ANN--72.7
MVF-Net [41]ANNResNet-34-87.1
EventTransformer [46]SNNEvent Transformer1678.9
STCA-SNN [47]SNNVGG111680.88
TCJA-SNN [48]SNNSCNN1682.5
This workSNNSGSAFormer-2-2561683.9 ± 0.2
Table 3. Comparison of energy consumption in different methods.
Table 3. Comparison of energy consumption in different methods.
MethodArchitectureTime StepPower (mJ)
ANNTransformer-8-512438.34
SpikformerSpikformer-8-512411.58
SGSAFormerSGSAFormer-8-512412.68
Table 4. Ablation experiments are performed on our proposed method to validate the performance of each module of SGSAFormer. Performance on the CIFAR10DVS dataset using different hyperparameters and different time steps.
Table 4. Ablation experiments are performed on our proposed method to validate the performance of each module of SGSAFormer. Performance on the CIFAR10DVS dataset using different hyperparameters and different time steps.
ModelArchitectureTop-1 Acc (%)
T = 4T = 16T = 20
SpikeformerSpikeformer-2-25676.179.481.2
Spikeformer-SGLUSpikeformer-SGLU-2-25676.380.482.0
SGSAFormerSGSAFormer-2-25677.083.684.8
SpikeformerSpikeformer-1-51277.580.080.6
Spikeformer-SGLUSpikeformer-SGLU-1-51277.684.684.8
SGSAFormerSGSAFormer-1-51278.185.585.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, S.; Qin, Y.; Zhu, R.; Zhao, Z.; Zhou, H.; Zhu, Z. SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention. Electronics 2025, 14, 43. https://doi.org/10.3390/electronics14010043

AMA Style

Gao S, Qin Y, Zhu R, Zhao Z, Zhou H, Zhu Z. SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention. Electronics. 2025; 14(1):43. https://doi.org/10.3390/electronics14010043

Chicago/Turabian Style

Gao, Shouwei, Yu Qin, Ruixin Zhu, Zirui Zhao, Hao Zhou, and Zihao Zhu. 2025. "SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention" Electronics 14, no. 1: 43. https://doi.org/10.3390/electronics14010043

APA Style

Gao, S., Qin, Y., Zhu, R., Zhao, Z., Zhou, H., & Zhu, Z. (2025). SGSAFormer: Spike Gated Self-Attention Transformer and Temporal Attention. Electronics, 14(1), 43. https://doi.org/10.3390/electronics14010043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop