11institutetext: The Hong Kong University of Science and Technology (Guangzhou), China
Northwestern University, IL, USA
11email: renjingxu@hkust-gz.edu.cn

Spiking Wavelet Transformer

Yuetong Fang\orcidlink0000-0003-0228-9082 11    Ziqing Wang\orcidlink0009-0004-8940-0461 11 2 2    Lingfeng Zhang\orcidlink0009-0006-0696-4363 11    Jiahang Cao 11    Honglei Chen 11    Renjing Xu🖂🖂{}^{\href mailto:renjingxu@hkust-gz.edu.cn}start_FLOATSUPERSCRIPT 🖂 end_FLOATSUPERSCRIPT\orcidlink0000-0002-0792-8974  Equal Contribution; 🖂🖂{}^{\textrm{\Letter}}start_FLOATSUPERSCRIPT 🖂 end_FLOATSUPERSCRIPT  Corresponding Author. 11
Abstract

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by emulating the event-driven processing manner of the brain. Incorporating Transformers with SNNs has shown promise for accuracy. However, they struggle to learn high-frequency patterns, such as moving edges and pixel-level brightness changes, because they rely on the global self-attention mechanism. Learning these high-frequency representations is challenging but essential for SNN-based event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation - with negative spike dynamics incorporated in 1) to enhance frequency representation. The FATM enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer’s effectiveness in capturing spatial-frequency patterns in a multiplication-free and event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves a 22.03% reduction in parameter count, and a 2.52% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers. The code is available at: https://github.com/bic-L/Spiking-Wavelet-Transformer.

Keywords:
Spiking Neural Networks Wavelet Transform Vision Transformer Event-based vision

1 Introduction

Spiking neural networks (SNNs) have gained considerable interest as a promising alternative to standard artificial neural networks (ANNs) [25, 78, 71, 26]. Inspired by biological neurons, SNNs process information via binary events called spikes. Neurons transmit spikes only when their accumulated membrane potential exceeds a firing threshold, otherwise remaining inactive [58]. This sparse, event-driven processing offers orders of magnitude gains in efficiency and performance over conventional computing paradigms, especially on low-power neuromorphic chips, such as Loihi [12], True North [50], and Tianjic [54], which compute spikes asynchronously. With these advantages, there is a growing body of research applying SNNs, such as classification [48, 70, 83], object detection [8, 65], autonomous driving [84] and tracking [73, 33]. Despite the energy efficiency of SNNs, they lag behind ANNs in terms of accuracy, posing a major challenge.

Refer to caption
Figure 1: (a) Performance of SWformer and other SOTA SNN models in top-1 accuracy and energy consumption (detail in supplementary), with marker size reflecting model size. (b) Fourier spectrum comparison between the Spiking Transformer with global attention [75] (top) and SWformer (bottom). Brighter colors indicate higher magnitudes. (c) Corresponding relative log amplitudes of Fourier-transformed feature maps. (b-c) show SWformer captures more high-frequency signals, leading to better performance.

To get the best of both worlds, a line of works focuses on incorporating advanced architectures from ANNs with the unique spiking mechanism in SNNs. This has led to notable developments. The introduction of residual learning into SNNs has facilitated the development of deeper network architectures and thus enhanced their performance [18, 80, 30]. More recently, integrating attention mechanisms has granted SNNs improved global information capturing, strengthening their capability to handle intricate patterns [74, 79]. This success has motivated researchers to discover the potential of combining powerful Transformer architecture with energy-efficient SNNs [83, 75]. While there has been some research in this direction, existing works mostly inherited the architecture from Vision Transformer [67], known to function solely in the spatial domain and exhibit similar characteristics of low-pass filters [53, 9].

As a representative branch in neuromorphic computing, SNNs mimic biological vision by continuously sampling the input data and independently generating spikes in response to changes in the visual scene [29], conveying abundant local information. Specifically, neuromorphic data captures only brightness changes, primarily moving edges, which represent high-frequency patterns [41]. As supported by the empirical comparisons in Fig. 1(b-c), though Spiking Transformers are highly capable of handling low-frequency components, like global shapes and structures, they are not very powerful for learning high-frequency information, mainly including abrupt changes in images such as local edges and textures [62, 10] - this is intuitive since self-attention, their primary mechanism is a global operation that aggregates information across non-overlapping image patches. Porting the frequency information into SNNs is a natural and appealing idea; however, this has been non-trivial due to the spike-driven nature of SNNs.

Frequency analysis methods, like the Fourier transform, rely on precise matrix multiplications [5], while SNNs use sparse, binary signaling with only a portion of neurons activated at any given time [12, 69]. This sparse, binary signaling mechanism presents a significant obstacle in devising a spiking equivalent to measure the frequency features accurately. To reduce the information loss, existing works have investigated the adoption of precise data encoding like time-to-value mapping [44, 45], though at the expense of high latency. We argue that time-frequency decomposition can be a more effective and efficient representation space for SNNs, considering their sparse and robust properties [40, 24]. In fact, the human visual system discerns elementary features through time-frequency components [21, 38]: it is found that the human visual system analyzes images in a way similar to the multi-resolution breakdown by the wavelet functions.

We propose the Spiking Wavelet Transformer (SWformer) to effectively capture time-frequency information in an event-driven manner. SWformer integrates the robustness of wavelet transforms with the energy efficiency of Spiking Transformers. As shown in Fig. 1, SWformer captures more high-frequency information than Spiking Transformers with global attention, significantly enhancing performance. It processes data in a multiplication-free, event-driven way compatible with neuromorphic hardware while effectively capturing spatial-frequency information. The main contributions of this paper are:

  • We propose SWformer, a novel attention-free architecture that integrates time-frequency information with Spiking Transformers, enabling feature perception across a wide frequency range in an event-driven manner.

  • A key component of SWformer is the Frequency-Aware Token Mixer (FATM), which processes input in three branches to learn spatial, frequency, and cross-channel representations, allowing it to capture more high-frequency visual information than vanilla Spiking Transformers.

  • We incorporate negative spike dynamics, a simple yet effective method supported by theoretical and experimental observations, to provide robust frequency representation in SNNs.

  • Extensive experiments show that our model significantly outperforms SOTA SNN performances, achieving a 2.95% improvement on static datasets like ImageNet and a remarkable 4% increase on neuromorphic datasets, such as CIFAR10-DVS.

2 Preliminary

2.1 Bio-inspired Spiking Neural Networks

SNNs are variants of ANNs that mimic the spatial-temporal dynamics and binary spike activations found in biological neurons [58, 75]. This spike-based temporal processing paradigm allows sparse while efficient information transfer. However, the non-differentiable spike function hinders the use of gradient-based backpropagation to train SNNs effectively. Two main solutions exist: ANN-to-SNN conversion [6, 16] and direct training [80, 49]. The ANN-to-SNN conversion method aims to bridge the continuous activation value of ANNs with the firing rate of SNNs through neuron equivalence [6, 16], borrowing backpropagation to achieve high performance but requiring long simulation timesteps and high energy consumption. In this work, we employ the direct training method to fully leverage the benefits of low-power and sparse event-driven computing of SNNs.

2.2 Neuromorphic Chips

Neuromorphic chips, inspired by the brain, merge processing and memory units, using spiking neurons and synapses as fundamental elements [58, 59]. As shown in Fig. 2, the synapse block processes incoming spikes, retrieves synaptic weights from memory and generates spike messages to be routed to other cores [50, 12, 54]. This replaces energy-consuming Conv and MLP operations with energy-efficient routing and sparse addition algorithms [13, 77], though self-attention is not yet supported. The spike-based computation grants neuromorphic chips high parallelism, scalability, and low power consumption (tens to hundreds of milliwatts) [2]. Our SWformer design adheres to the spike-driven paradigm, making it well-suited for implementation on neuromorphic chips.

[Uncaptioned image]
Figure 2: Processing flow of a synapse block. Neuromorphic chips follow a spike-based computation paradigm, where both inputs and outputs are in spike form. [12]

2.3 Spiking Vision Transformers

Recent advancements in SNN architectures, inspired by deep learning [58, 76, 72] and brain-like processes such as long short-term memory and attention [55, 3], have significantly improved their performance with the benefits of spike-driven processing. This progress has led to the creation of Spiking Transformers [83, 75, 82], which merges the effectiveness of Transformers with the energy efficiency of SNNs, providing a solution for energy-sensitive scenarios [2]. However, previous works have directly inherited the Vision Transformer architecture [67], whose core self-attention mechanism primarily focuses on low-frequency information through global exchange among non-overlapping patch tokens and neglecting high-frequency components like detailed information, local edges, and abrupt pixel-level changes [62]. Our work emphasizes the importance of high frequencies for SNNs, which is expected given the independent and sparse spike generation of SNNs that yields abundant high-frequency data cross layers.

2.4 Learning in the Frequency Domain

Integrating frequency representation into SNNs is particularly important, considering neuromorphic data reflects brightness changes corresponding to high frequencies. Additionally, while static images do not inherently contain high-frequency information, spiking neurons can encode inputs into pixel-level brightness changes, enriching the images processed by the spiking layer with frequency information (Sec 4.4). However, few works have applied frequency representation to SNNs, such as devising spiking band-pass filters [34] or neurons that spike at specific frequencies [1], struggling to capture full-frequency spectra. More recently, Lopez et al. [44, 45] adopted time-to-value mapping for accurate Fourier transform but at the cost of high latency (~1024 timesteps). In this work, we ingeniously combine the sparsity of wavelet transform with the binary and sparse signaling of SNNs to provide robust frequency representation.

3 Spiking Wavelet Transformer

We devise the SWformer, a novel attention-free architecture that combines time-frequency information with Spiking Transformers. This allows for efficient feature perception across a wide frequency range without multiplication and in an event-driven manner. We first briefly introduce the spiking neuron layer, followed by an overview of SWformer and its components.

The spiking neuron layer encodes spatio-temporal information into membrane potentials, converts them into binary spikes, and passes them on to the next layer for continued spike-based computation. Throughout this work, we consistently use the Leaky Integrate-and-Fire (LIF) neuron model [46], as it efficiently simulates biological neuron dynamics. The following equations govern the dynamics of the LIF layer:

U[n]𝑈delimited-[]𝑛\displaystyle U[n]italic_U [ italic_n ] =V[n1]+I[n],absent𝑉delimited-[]𝑛1𝐼delimited-[]𝑛\displaystyle=V[n-1]+I[n],= italic_V [ italic_n - 1 ] + italic_I [ italic_n ] , (1)
s[n]𝑠delimited-[]𝑛\displaystyle s[n]italic_s [ italic_n ] =H(U[n1]Vth),absent𝐻𝑈delimited-[]𝑛1subscript𝑉th\displaystyle=H(U[n-1]-V_{\text{th}}),= italic_H ( italic_U [ italic_n - 1 ] - italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT ) , (2)
V[n]𝑉delimited-[]𝑛\displaystyle V[n]italic_V [ italic_n ] =Vresets[n]+(βU[n])(1s[n]),absentsubscript𝑉reset𝑠delimited-[]𝑛𝛽𝑈delimited-[]𝑛1𝑠delimited-[]𝑛\displaystyle=V_{\text{reset}}s[n]+(\beta U[n])(1-s[n]),= italic_V start_POSTSUBSCRIPT reset end_POSTSUBSCRIPT italic_s [ italic_n ] + ( italic_β italic_U [ italic_n ] ) ( 1 - italic_s [ italic_n ] ) , (3)

At each timestep n𝑛nitalic_n, the current membrane potential U[n]𝑈delimited-[]𝑛U[n]italic_U [ italic_n ] is generated by integrating the spatial input I[n]𝐼delimited-[]𝑛I[n]italic_I [ italic_n ] from input data or intermediate operations like Conv and MLP, with temporal dynamics V[n]𝑉delimited-[]𝑛V[n]italic_V [ italic_n ], which track the membrane potential over time. If U[n]𝑈delimited-[]𝑛U[n]italic_U [ italic_n ] exceeds the threshold Vthsubscript𝑉thV_{\text{th}}italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT, the neuron fires a spike (s[n]𝑠delimited-[]𝑛s[n]italic_s [ italic_n ]=1), otherwise it remains inactive (s[n]𝑠delimited-[]𝑛s[n]italic_s [ italic_n ]=0). The Heaviside step function H()𝐻H(\cdot)italic_H ( ⋅ ) determines spiking, where H(x)𝐻𝑥H(x)italic_H ( italic_x ) = 1 when x0𝑥0x\geq 0italic_x ≥ 0. The temporal output V[n]𝑉delimited-[]𝑛V[n]italic_V [ italic_n ] updates based on the spiking activity and decay factor β𝛽\betaitalic_β. If the neuron does not fire, U[n]𝑈delimited-[]𝑛U[n]italic_U [ italic_n ] decays to V[n]𝑉delimited-[]𝑛V[n]italic_V [ italic_n ].

[Uncaptioned image]
Figure 3: The overview of SWformer. We present two main innovations inspired by [83]. Firstly, FATM improves frequency perception in Spiking Transformers using only Conv and MLP operations, ensuring compatibility with neuromorphic hardware. Second, our Frequency Learner (FL) efficiently captures spectral features through spiking frequency representation and block-diagonal multiplication. ConvBN: a Conv layer followed by a BN layer.

3.1 Overall Architecture

Fig. 3 presents SWformer, which comprises a Spiking Patch Splitting (SPS) module, Spiking Encoder blocks, and a linear classification head. The SPS module, based on the design in [83], includes a Patch Splitting Module (PSM) with the initial four spiking Conv layers. For SNNs, the input sequence dimension is IT×C×H×W𝐼superscript𝑇𝐶𝐻𝑊I\in\mathbb{R}^{T\times C\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, where T𝑇Titalic_T is the number of timesteps. In static datasets, images are repeated T𝑇Titalic_T for creating a temporal sequence, while neuromorphic datasets inherently split data into T𝑇Titalic_T frame sequences. For a 2D image sequence IT×C×H×W𝐼superscript𝑇𝐶𝐻𝑊I\in\mathbb{R}^{T\times C\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, the SPS is formulated as:

U=PSM(I),IT×C×H×W,UT×N×Ds=Spk(U),sT×N×DRPE=ConvBN(s),RPET×N×DU0=U+RPE,U0T×N×D𝑈PSM𝐼formulae-sequence𝐼superscript𝑇𝐶𝐻𝑊𝑈superscript𝑇𝑁𝐷𝑠Spk𝑈𝑠superscript𝑇𝑁𝐷RPEConvBN𝑠RPEsuperscript𝑇𝑁𝐷subscript𝑈0𝑈RPEsubscript𝑈0superscript𝑇𝑁𝐷\begin{array}[]{ll}U=\operatorname{PSM}(I),{}{}&I\in\mathbb{R}^{T\times C% \times H\times W},U\in\mathbb{R}^{T\times N\times D}\\[3.00003pt] s=\operatorname{\textit{Spk}}(U),{}{}&s\in\mathbb{R}^{T\times N\times D}\\[3.0% 0003pt] \operatorname{RPE}=\operatorname{ConvBN}(s),{}{}&\operatorname{RPE}\in\mathbb{% R}^{T\times N\times D}\\[3.00003pt] U_{0}=U+\operatorname{RPE},{}{}&U_{0}\in\mathbb{R}^{T\times N\times D}\end{array}start_ARRAY start_ROW start_CELL italic_U = roman_PSM ( italic_I ) , end_CELL start_CELL italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT , italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_s = Spk ( italic_U ) , end_CELL start_CELL italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL roman_RPE = roman_ConvBN ( italic_s ) , end_CELL start_CELL roman_RPE ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_U + roman_RPE , end_CELL start_CELL italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY (4)

where U and U0 denote output membrane potential tensor of the PSM and SPS, respectively, with Spk(·) symbolizing the spiking neuron layer. The resulting patches are processed by l𝑙litalic_l Spiking Encoder Blocks, each containing a FATM followed by a Spiking MLP block, with residual connections applied to output membrane potentials in both blocks. FATM enables multi-scale feature extraction using the proposed spiking frequency representation (Sec 3.3). Lastly, the features from Spiking Encoders undergo Global Average-Pooling (GAP), producing a D𝐷Ditalic_D-dimensional feature, which is then fed into a fully-connected Classification Head (CH) to generate the final prediction Y.

The overall architecture of SWformer is:

S0=Spk(U0),S0T×N×DUl=FATM(Sl1)+Ul1,UlT×N×D,l=1,,MSl=Spk(MLP(Spk(Ul))+Ul),SNT×N×Dl=1,,MY=CH(GAP(SN))subscript𝑆0Spksubscript𝑈0subscript𝑆0superscript𝑇𝑁𝐷subscript𝑈𝑙FATMsubscript𝑆𝑙1subscript𝑈𝑙1formulae-sequencesubscript𝑈𝑙superscript𝑇𝑁𝐷𝑙1𝑀subscript𝑆𝑙SpkMLPSpksubscript𝑈𝑙subscript𝑈𝑙formulae-sequencesubscript𝑆𝑁superscript𝑇𝑁𝐷𝑙1𝑀𝑌CHGAPsubscript𝑆𝑁missing-subexpression\begin{array}[]{ll}S_{0}=\operatorname{\textit{Spk}}(U_{0}),&S_{0}\in\mathbb{R% }^{T\times N\times D}\\[3.00003pt] U_{l}=\operatorname{FATM}(S_{l-1})+U_{l-1},&U_{l}\in\mathbb{R}^{T\times N% \times D},~{}l=1,\ldots,M\\[3.00003pt] S_{l}=\operatorname{\textit{Spk}}(\operatorname{MLP}(\operatorname{\textit{Spk% }}(U_{l}))+U_{l}),&S_{N}\in\mathbb{R}^{T\times N\times D}~{}l=1,\ldots,M\\[3.0% 0003pt] Y=\operatorname{CH}(\operatorname{GAP}(S_{N}))\\ \end{array}start_ARRAY start_ROW start_CELL italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = Spk ( italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_FATM ( italic_S start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) + italic_U start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , end_CELL start_CELL italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT , italic_l = 1 , … , italic_M end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = Spk ( roman_MLP ( Spk ( italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) + italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT italic_l = 1 , … , italic_M end_CELL end_ROW start_ROW start_CELL italic_Y = roman_CH ( roman_GAP ( italic_S start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL end_ROW end_ARRAY (5)

where Ulsubscript𝑈𝑙U_{l}italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and Slsubscript𝑆𝑙S_{l}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the membrane potential and spike output of FATM at l𝑙litalic_l-th layer, M𝑀Mitalic_M refers to total number of layers.

3.2 Frequency-Aware Token Mixer

We propose the FATM, a novel component designed to facilitate the mixing of tokens across a wide frequency range in SNNs, serving as an alternative to self-attention-based token mixers in Spiking Transformers [83, 82, 75]. As shown in Fig. 3, the FATM operates on all channels concurrently through three parallel branches: (1) Frequency Learner (FL), using spiking wavelet transform for time-frequency domain learning; (2) Spatial Learner (SL), adopting 3×3333\times 33 × 3 Conv for extracting spatial features; and (3) Channel Mixer (CM), using spiking point-wise convolution that performs cross-channel information fusion. This design is inspired by the effectiveness of wavelet neural operators [66] and the local perception capability of Conv operations.

To enhance computational parallelism and parameter efficiency, we employ the block-diagonal structure in FL by splitting the d×d𝑑𝑑d\times ditalic_d × italic_d weight matrix into k smaller d/k×d/k𝑑𝑘𝑑𝑘d/k\times d/kitalic_d / italic_k × italic_d / italic_k matrices (Fig. 3.3). Given k𝑘kitalic_k weight blocks , the input feature sequence SlT×N×Dsubscript𝑆𝑙superscript𝑇𝑁𝐷S_{l}\in\mathbb{R}^{T\times N\times D}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT is reshaped into SlTk×N/k×H×Wsubscript𝑆𝑙superscript𝑇𝑘𝑁𝑘𝐻𝑊S_{l}\in\mathbb{R}^{Tk\times N/k\times H\times W}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_k × italic_N / italic_k × italic_H × italic_W end_POSTSUPERSCRIPT and processed by the FATM:

UFLl=FL(Sl)UFLlTk×N/k×H×W,l=1,,MUSLl=ConvBN(Sl)USLlTk×N/k×H×W,l=1,,MUCMl=ConvBN(Sl)UCMlTk×N/k×H×W,l=1,,MUFATMl=UFLl+USLl+UCMlUFATMlTk×N/k×H×W,l=1,,Msuperscriptsubscript𝑈𝐹𝐿𝑙𝐹𝐿superscriptsubscript𝑆𝑙formulae-sequencesuperscriptsubscript𝑈𝐹𝐿𝑙superscript𝑇𝑘𝑁𝑘𝐻𝑊𝑙1𝑀superscriptsubscript𝑈𝑆𝐿𝑙𝐶𝑜𝑛𝑣𝐵𝑁superscriptsubscript𝑆𝑙formulae-sequencesuperscriptsubscript𝑈𝑆𝐿𝑙superscript𝑇𝑘𝑁𝑘𝐻𝑊𝑙1𝑀superscriptsubscript𝑈𝐶𝑀𝑙𝐶𝑜𝑛𝑣𝐵𝑁superscriptsubscript𝑆𝑙formulae-sequencesuperscriptsubscript𝑈𝐶𝑀𝑙superscript𝑇𝑘𝑁𝑘𝐻𝑊𝑙1𝑀superscriptsubscript𝑈𝐹𝐴𝑇𝑀𝑙superscriptsubscript𝑈𝐹𝐿𝑙superscriptsubscript𝑈𝑆𝐿𝑙superscriptsubscript𝑈𝐶𝑀𝑙formulae-sequencesuperscriptsubscript𝑈𝐹𝐴𝑇𝑀𝑙superscript𝑇𝑘𝑁𝑘𝐻𝑊𝑙1𝑀\begin{array}[]{ll}U_{FL}^{l}=FL(S_{l}^{{}^{\prime}})&U_{FL}^{l}\in\mathbb{R}^% {Tk\times N/k\times H\times W},\ l=1,\ldots,M\\ U_{SL}^{l}=ConvBN(S_{l}^{{}^{\prime}})&U_{SL}^{l}\in\mathbb{R}^{Tk\times N/k% \times H\times W},\ l=1,\ldots,M\\ U_{CM}^{l}=ConvBN(S_{l}^{{}^{\prime}})&U_{CM}^{l}\in\mathbb{R}^{Tk\times N/k% \times H\times W},\ l=1,\ldots,M\\ U_{FATM}^{l}=U_{FL}^{l}+U_{SL}^{l}+U_{CM}^{l}&U_{FATM}^{l}\in\mathbb{R}^{Tk% \times N/k\times H\times W},\ l=1,\ldots,M\\ \end{array}start_ARRAY start_ROW start_CELL italic_U start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_F italic_L ( italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_U start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_k × italic_N / italic_k × italic_H × italic_W end_POSTSUPERSCRIPT , italic_l = 1 , … , italic_M end_CELL end_ROW start_ROW start_CELL italic_U start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v italic_B italic_N ( italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_U start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_k × italic_N / italic_k × italic_H × italic_W end_POSTSUPERSCRIPT , italic_l = 1 , … , italic_M end_CELL end_ROW start_ROW start_CELL italic_U start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_v italic_B italic_N ( italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_U start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_k × italic_N / italic_k × italic_H × italic_W end_POSTSUPERSCRIPT , italic_l = 1 , … , italic_M end_CELL end_ROW start_ROW start_CELL italic_U start_POSTSUBSCRIPT italic_F italic_A italic_T italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_U start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_CELL start_CELL italic_U start_POSTSUBSCRIPT italic_F italic_A italic_T italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_k × italic_N / italic_k × italic_H × italic_W end_POSTSUPERSCRIPT , italic_l = 1 , … , italic_M end_CELL end_ROW end_ARRAY (6)

where UFLlsuperscriptsubscript𝑈𝐹𝐿𝑙U_{FL}^{l}italic_U start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, USLlsuperscriptsubscript𝑈𝑆𝐿𝑙U_{SL}^{l}italic_U start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, UCMlsuperscriptsubscript𝑈𝐶𝑀𝑙U_{CM}^{l}italic_U start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, represent the membrane potential outputs of the FL, SL, and CM, respectively. After processing, UFATMlsuperscriptsubscript𝑈𝐹𝐴𝑇𝑀𝑙U_{FATM}^{l}italic_U start_POSTSUBSCRIPT italic_F italic_A italic_T italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is reshaped back to T×N×Dabsentsuperscript𝑇𝑁𝐷\in\mathbb{R}^{T\times N\times D}∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_N × italic_D end_POSTSUPERSCRIPT. The SL and CM use 3×3333\times 33 × 3 and 1×1111\times 11 × 1 convolutions, respectively, leveraging CNNs’ powerful capabilities to enhance local feature learning. Note that while block-diagonal multiplication is used only in FL, we also reduce channels in SL and CM, thus their parameters decrease linearly with more weight blocks k𝑘kitalic_k.

3.3 Frequency Learner

The FL projects features to a transform domain, weighting and passing specific frequency modes. Specifically, as shown in Fig. 3, Fl converts raw inputs to the time-frequency domain, weighted, and then converted back to the time domain. It incorporates two key designs: a robust spiking frequency representation that links the sparsity of wavelet transform with SNN’s binary and sparse signaling property and a modularized weight matrix that enhances parameter efficiency and computational parallelism.

[Uncaptioned image]

(a) Comparative of the standard Haar transform, binary spiking Haar transform, and ternary spiking Haar transform. Higher Peak Signal-to-Noise Ratio values indicate greater similarity between the images. (b) Schematic of block-diagonal matrix.

3.3.1 Frequency representation in SNNs

elegantly addresses the challenges of incorporating spike-driven frequency information in SNNs. Our approach is driven by two key insights: (1) the spiking output and reset rules of neuromorphic chips (Fig. 2) require converting intermediate results to spikes, complicating the direct use of signal transform algorithms with cascaded matrix multiplications; and (2) the amplitude-dependent response of spiking neurons makes algorithms with complex number operations, like the Fourier Transform, resource-intensive, as they need separate neuron banks for real and complex computations. We create the spiking frequency representation using the Haar wavelet transform. This method captures high-frequency details and low-frequency approximations with a sparse representation, while the wavelet transform’s decorrelation property enhances signal robustness. The spiking Haar forward and inverse transforms are formulated as:

Hfsubscript𝐻f\displaystyle H_{\text{f}}italic_H start_POSTSUBSCRIPT f end_POSTSUBSCRIPT =Spk(WhaarSpk(IWhaar))absent𝑆𝑝𝑘subscript𝑊haar𝑆𝑝𝑘𝐼superscriptsubscript𝑊haartop\displaystyle={Spk}(W_{\text{haar}}\cdot{Spk}(I\cdot W_{\text{haar}}^{\top}))= italic_S italic_p italic_k ( italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT ⋅ italic_S italic_p italic_k ( italic_I ⋅ italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) (7)
I𝐼\displaystyle Iitalic_I =Spk(WhaarSpk(HfWhaar))absent𝑆𝑝𝑘superscriptsubscript𝑊haartop𝑆𝑝𝑘subscript𝐻fsubscript𝑊haar\displaystyle={Spk}(W_{\text{haar}}^{\top}\cdot{Spk}(H_{\text{f}}\cdot W_{% \text{haar}}))= italic_S italic_p italic_k ( italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ italic_S italic_p italic_k ( italic_H start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ⋅ italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT ) )
Whaar(n)subscript𝑊haar𝑛\displaystyle W_{\text{haar}}(n)italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT ( italic_n ) ={[1],if n=1,12[Whaar(n1)[1,1]I2n1[1,1]],if n>1absentcasesdelimited-[]1if 𝑛112matrixtensor-productsubscript𝑊haar𝑛111tensor-productsubscript𝐼superscript2𝑛111if 𝑛1\displaystyle=\begin{cases}[1],&\text{if }n=1,\\ \frac{1}{\sqrt{2}}\begin{bmatrix}W_{\text{haar}}(n-1)\otimes[1,1]\\ I_{2^{n-1}}\otimes[1,-1]\end{bmatrix},&\text{if }n>1\end{cases}= { start_ROW start_CELL [ 1 ] , end_CELL start_CELL if italic_n = 1 , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ start_ARG start_ROW start_CELL italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT ( italic_n - 1 ) ⊗ [ 1 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ [ 1 , - 1 ] end_CELL end_ROW end_ARG ] , end_CELL start_CELL if italic_n > 1 end_CELL end_ROW (8)

where I𝐼Iitalic_I and Hfsubscript𝐻fH_{\text{f}}italic_H start_POSTSUBSCRIPT f end_POSTSUBSCRIPT represent the raw input and the matrix after the Haar forward transform, respectively, and Whaarsubscript𝑊haarW_{\text{haar}}italic_W start_POSTSUBSCRIPT haar end_POSTSUBSCRIPT denotes the transformation matrix. Ideally, the Haar inverse transform recovers Hfsubscript𝐻fH_{\text{f}}italic_H start_POSTSUBSCRIPT f end_POSTSUBSCRIPT back to I𝐼Iitalic_I without any error.

Since binary SNNs only generate {0, 1} spikes and Eq. 8 includes negative terms in the Haar transform, causing significant errors, we incorporate negative spike dynamics supported by neuromorphic chips [12, 56], expanding spike values to {-1, 0, 1}; the ternary neuron model is expressed as:

U[n]𝑈delimited-[]𝑛\displaystyle U[n]italic_U [ italic_n ] =V[n1]+I[n],absent𝑉delimited-[]𝑛1𝐼delimited-[]𝑛\displaystyle=V[n-1]+I[n],= italic_V [ italic_n - 1 ] + italic_I [ italic_n ] , (9)
s[n]𝑠delimited-[]𝑛\displaystyle s[n]italic_s [ italic_n ] =Hsym(U[n1]Vth),absentsubscript𝐻𝑠𝑦𝑚𝑈delimited-[]𝑛1subscript𝑉th\displaystyle=H_{sym}(U[n-1]-V_{\text{th}}),= italic_H start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_U [ italic_n - 1 ] - italic_V start_POSTSUBSCRIPT th end_POSTSUBSCRIPT ) , (10)
V[n]𝑉delimited-[]𝑛\displaystyle V[n]italic_V [ italic_n ] =Vresets[n]+U[n](1s[n]),absentsubscript𝑉reset𝑠delimited-[]𝑛𝑈delimited-[]𝑛1𝑠delimited-[]𝑛\displaystyle=V_{\text{reset}}s[n]+U[n](1-s[n]),= italic_V start_POSTSUBSCRIPT reset end_POSTSUBSCRIPT italic_s [ italic_n ] + italic_U [ italic_n ] ( 1 - italic_s [ italic_n ] ) , (11)

where Hsym(\cdot) refers to a symmetric Heaviside step function, defined as Hsym(x) = 1 when x0𝑥0x\geq 0italic_x ≥ 0, and Hsym(x) = -1 when x<0𝑥0x<0italic_x < 0. We adopt the integrate-and-fire neuron [7], equivalent to the LIF neuron with β=1𝛽1\beta=1italic_β = 1, for accurate signal transformation. As shown in Fig. 3.3, incorporating negative spike dynamics significantly enhances the quality of the spiking frequency representation. The spiking frequency representation, or spiking wavelet transform, includes forward and inverse transform processes, with negative spike dynamics used only in this part.

3.3.2 Modularized Weight Matrix

enables interpretability, computational parallelization, and parameter efficiency, which can be interpreted as batch matrix multiplication [11]. As illustrated in Fig. 3.3(b), the block-diagonal multiplication approach divides the d×d𝑑𝑑d\times ditalic_d × italic_d weight matrix into k𝑘kitalic_k smaller weight blocks, each having a size of d/k×d/k𝑑𝑘𝑑𝑘d/k\times d/kitalic_d / italic_k × italic_d / italic_k. This technique effectively reduces the parameter count from O(d2)𝑂superscript𝑑2O(d^{2})italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to O(d2/k)𝑂superscript𝑑2𝑘O(d^{2}/k)italic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_k ), with better parallelism. With this structure in place, the FL independently processes each weight splitting block as follows:

y~m,n=Wm,nxm,n,=1,,k,(m,n)H×Wformulae-sequencesubscriptsuperscript~𝑦𝑚𝑛subscriptsuperscript𝑊𝑚𝑛subscriptsuperscript𝑥𝑚𝑛formulae-sequence1𝑘𝑚𝑛𝐻𝑊{\tilde{y}^{\ell}}_{m,n}={{W}^{\ell}}_{m,n}{{x}^{\ell}}_{m,n},~{}~{}\ell=1,...% ,k,(m,n)\in H\times Wover~ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = italic_W start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , roman_ℓ = 1 , … , italic_k , ( italic_m , italic_n ) ∈ italic_H × italic_W (12)

where m,n𝑚𝑛m,nitalic_m , italic_n refers to the spatial coordinates of a token within the input tensor, and \ellroman_ℓ denotes the corresponding block id. Each block can be understood as a head in a multi-head self-attention mechanism, projecting the data into a specific subspace. Choosing an appropriate number of blocks is crucial for obtaining a high-dimensional representation, enabling efficient feature extraction in the frequency spectrum.

3.4 Membrane Shortcut

Residual learning and shortcuts are crucial for training deep SNNs [28, 81, 19, 32]. These techniques aim to implement identity mapping to prevent degradation in deep networks while maintaining spike-driven computing for hardware compatibility and energy efficiency. As shown in Fig. 3.4, three mainstream shortcut techniques are commonly used in SNNs. The Vanilla Shortcut scheme [81] borrows the shortcut scheme from ANNs [27], connecting membrane potential and spike, but it cannot impose identity mapping [28]. The Spiking Element-Wise Shortcut [19] connects spikes across layers but operates in an "integer-driven" rather than "spike-driven" manner due to integral output spikes. The Membrane Shortcut [32] combines identity mapping with spike-driven computation by connecting membrane potentials between layers, optimizing membrane potential distribution. This method, used in recent Spiking Transformers [75], is also adopted in our SWformer for its biological plausibility and high performance [75, 32].

[Uncaptioned image]

Mainstream shortcut schemes in SNNs.

4 Experiment

4.1 Experiment Setup

SNNs transmit spatio-temporal information, which are naturally suitable for handling temporal tasks. For static image classification, it is common practice to repeatedly input the same image at each timestep. While increasing simulation timesteps can improve accuracy, it also increases training time, hardware requirements, and inference energy consumption. Neuromorphic datasets with inherent spatio-temporal dynamics can fully exploit the energy-efficient advantages of SNNs.

4.1.1 Dataset

We evaluate our approach on a range of datasets, including static datasets like CIFAR-10 [36], CIFAR-100 [35], and ImageNet [14], as well as neuromorphic datasets such as CIFAR10-DVS [39], N-Caltech101 [52], N-Cars [63], ActionRecognition [51], ASL-DVS [4], and NavGesture [47] datasets. Details on training settings and energy consumption evaluation can be found in Supplementary Material.

4.2 Performance on Static Datasets

4.2.1 ImageNet

SWformer, our proposed model, outperforms the vanilla Spiking Transformer (SpikFormer), Spike-driven Transformer (SD Transformer), and other ResNet-based SNNs on ImageNet in terms of accuracy and efficiency. Experiments with different embedding dimensions, transformer blocks, and spiking wavelet settings demonstrate the importance of precise frequency information as shown in Table 1. SWformer (Block=2) with the Transformer-6-512 setting achieves 74.98% accuracy, 2.52% higher than SpikFormer, and 0.87% higher than SD Transformer. Increasing the number of weight splitting blocks further improves parameter efficiency and power consumption without compromising accuracy. Specifically, SWformer 8-512-block4-Vth-1 reaches 75.29% accuracy with 23.14M parameters, surpassing MS-ResNet-34 [31] and SEW-ResNet-34 [18], which achieve 67.04% and 69.15% accuracy, respectively, with approximately 21.8M parameters (22.03% less than SpikFormer). Note that the baseline using standard wavelet transform, shown in gray, generally achieves better performance, highlighting the importance of accurate frequency representation in SNNs, despite contradicting the spike-driven computing paradigm. To harness the energy efficiency of neuromorphic computing, exploring hardware designed for better frequency domain processing is essential. Recently, the Resonate-and-Fire neuron is supported by Loihi 2, which can compute the Short Time Fourier Transform, another type of time-frequency transform on-chip [20].

Table 1: Performance comparison between the proposed model and the SOTA models on the ImageNet dataset. Models denoted with an asterisk (*) use an input resolution of 256×\times×256, which is essential for the Haar transform to achieve optimal performance and consistent application throughout the entire input. All models were trained for 310 epochs with identical initial settings for a fair comparison. : B denotes the number of weight splitting blocks. --: standard wavelet transform (non-spiking).
Methods Architecture # Param (M) Power (mJ) Time Steps Accuracy (%)
Hybrid training [57]ICLR ResNet-34 21.79 - 250 61.48
SEW ResNet [19]NeurIPS SEW-ResNet-34 21.79 4.04 4 67.04
SEW-ResNet-50 25.56 4.89 4 67.78
SEW-ResNet-101 44.55 8.91 4 68.76
SEW-ResNet-152 60.19 12.89 4 69.26
TET [15]ICLR SEW-ResNet-34 21.79 - 4 68.00
MS ResNet [31]TNNLS MS-ResNet-18 11.69 4.29 4 63.10
MS-ResNet-34 21.80 5.11 4 69.42
Spiking ResNet [30]TNNLS ResNet-50 25.56 70.93 350 72.75
tdBN [80]AAAI Spiking-ResNet-34 21.79 6.39 6 63.72
ANN Transformer* Transformer-6-512 23.37 40.72 - 80.54
SpikFormer [83]ICLR Transformer-8-384 16.81 7.73 4 70.24
Transformer-6-512 23.37 9.41 4 72.46
Transformer-8-512 29.68 11.57 4 73.38
SD Transformer [75]NeurIPS Transformer-8-384 16.81 3.39 4 72.28
Transformer-6-512 23.37 3.56 4 74.11
Transformer-8-512 29.68 4.50 4 74.57
SWfomer* (B=2) -- Transformer-6-512 21.8 4.00 4 75.09
Transformer-8-512 27.6 4.31 4 75.26
Vth=0.5 Transformer-6-512 21.8 3.58 4 74.98
Transformer-8-512 27.6 4.89 4 75.18
Vth=1 Transformer-6-512 21.8 3.87 4 74.84
Transformer-8-512 27.6 5.08 4 75.43
SWfomer* (B=4) -- Transformer-6-512 18.46 3.51 4 74.86
Transformer-8-512 23.14 4.67 4 75.33
Vth=0.5 Transformer-6-512 18.46 3.91 4 74.62
Transformer-8-512 23.14 4.98 4 75.08
Vth=1 Transformer-6-512 18.46 3.75 4 74.69
Transformer-8-512 23.14 4.87 4 75.29

4.2.2 CIFAR10/ CIFAR100

Table 2 presents a comprehensive comparison of the SWformer model with current state-of-the-art (SOTA) SNN models on the CIFAR-10/100. SWformer outperforms all other models in terms of top-1 accuracy on both datasets with fewer parameters and time steps. In specific, compared to ResNet-based SNN models like tdBN [80], our SWformer model outperforms it by 2.0% on CIFAR10 and 5.4% on CIFAR100, with only 59.5% of the parameters. Additionally, the SWformer model also surpasses all the Transformer-based SNNs in accuracy and parameter efficiency. The superior performance of SWformer can be attributed to its unique designs.

Table 2: Performance comparison on CIFAR10/CIFAR100 and CIFAR10-DVS.
Method CIFAR10 CIFAR100 CIFAR10-DVS
# Param (M) T Acc. (%) # Param (M) T Acc. (%) # Param (M) T Acc. (%)
TET [15] ICLR 12.63 6 94.50 12.63 6 74.72 9.27 10 83.32
tdBN [80] AAAI 12.63 4 92.92 12.63 4 70.86 12.63 10 67.8
TEBN [17] NeurIPS 12.63 6 94.71 12.63 6 76.41 - 10 75.10
Real Spike [23] ECCV 12.63 6 95.78 39.9 10 71.24 12.63 10 72.85
DSR [48] CVPR 11.2 20 95.4 11.2 20 78.5 9.48 10 77.51
SpikFormer [83] ICLR 9.32 4 95.51 9.32 4 78.21 2.59 10 78.9
9.32 6 95.34 9.32 4 78.61 2.59 16 80.9
SD Transformer  [75] NeurIPS 9.32 4 95.6 9.32 4 78.4 2.59 10 78.9
SWformer 7.51 4 96.1 7.51 4 79.3 2.05 10 82.9
7.51 6 96.3 7.51 6 79.6 2.05 16 83.9
Table 3: Performance comparison of SWformer vs. SOTA on neuromorphic datasets.
Datasets Methods T Acc. (%)
N-CALTECH101 TIM [61] 10 79.00
TT-SNN [37] 6 77.00
NDA [42] 10 83.70
SWformer 10 88.45
N-CARS CarSNN [68] 10 86.00
NDA [42] 10 91.90
SWformer 10 96.32
Action Recognition STCA [22] 10 71.20
Mb-SNN  [43] 10 78.10
SWformer 10 88.88
ASL-DVS Meta-SNN [64] 100 96.04
SWformer 10 99.88
NavGesture KNN [47] - 95.90
SWformer 10 98.49
Table 4: Ablation study on FATM
Datasets Models T Acc. (%)
CIFAR100 SWformer-4-384NoHaar 4 78.89
SWformer-4-384MaskDC 4 79.34
SWformer-4-384NoNeg 4 78.64
SWformer-4-384NoInv 4 78.79
SWformer-4-384base 4 79.31
CIFAR10-DVS SWformer-2-256NoHaar 16 80.9
SWformer-2-256MaskDC 16 84.0
SWformer-2-256NoPool 16 81.8
SWformer-2-256base 16 83.9
Refer to caption
Figure 4: (a-b) Fourier analysis of feature maps on ImageNet for the output of SPS (top), the first token-mixer (middle), and the last layer (bottom): (a) global Spiking Self-Attention (gSSA), (b) FATM: (1-3) CM, SL, and FL. FATM is more effective at capturing frequency information than gSSA, which ensures feature perception across a wide frequency range. (c) Grad-CAM [60] activation map visualization of the gSSA block in Spiking Transformer[75] (top), and FATM in SWformer (bottom).

4.3 Performance on Neuromorphic Datasets

As shown in Table 2 and Table 4, the proposed SWformer outperforms SOTA SNN models on a variety of neuromorphic datasets, including CIFAR10-DVS [39], N-Caltech101 [52], and N-Cars [63], which are derived from static datasets and converted into neuromorphic data using event-based cameras. For CIFAR10-DVS [39], we revise the FATM shortcut: inputs are first processed by FL, which acts as a filter, and then by SL and CM. To further enhance high-frequency information in these datasets, a 1D max-pooling layer is placed at the beginning of the FATM module. SWformer achieves impressive accuracy on all neuromorphic tasks. These results surpass previous SOTA models by significant margins, with SWformer outperforming NDA [42] by 4.75% and 4.42% on N-CALTECH101 and N-CARS, Mb-SNN [43] by 10.78% on Action Recognition, Meta-SNN [64] by 3.84% while using 10 times fewer timesteps on ASL-DVS, and KNN [47] by 2.59% on NavGesture.

4.4 Method Analysis

4.4.1 Visualization

Existing Spiking Transformers [83, 82, 75] use global operations for exchanging information among non-overlapping patch tokens, while spiking neurons transmit pixel-level brightness changes, enriching images with local information, i.e., high-frequency components. This is supported by Fig. 4(a-b), where data processed by the SPS module, the initial embedding operation before subsequent Transformer blocks, contains rich high-frequency information. While the Spiking Transformer’s gSSA primarily focuses on low frequencies (Fig. 4(a)), SWformer’s FATM effectively captures specific frequency information on each channel, enabling comprehensive feature learning in the frequency spectrum and maintaining high-frequency information transmission even in deeper layers (Fig. 4(b)) This enhanced frequency learning capability facilitates more accurate and complete feature extraction, as shown in Fig. 4(c), leading to improved recognition capability.

4.4.2 Number of Weight Splitting Blocks

An appropriate number of weight splitting blocks in SWformer can lead to a more efficient architecture that strikes a better balance between performance and resource utilization, correlating with the frequency mixing range of the transformed signals. Table 1 demonstrates that increasing the weight splitting blocks from 2 to 4 in SWformer reduces both parameters and power consumption while maintaining high accuracy, as illustrated by the computational scheme in Fig. 3.3(b). Specifically, SWformer (Block=4) has about 16% fewer parameters than SWformer (Block=2) for both Transformer-6-512 and Transformer-8-512 architectures. Despite the reduced resources, SWformer (Block=2) and SWformer (Block=4) achieves accuracy improvements of 2.52% and 2.16% for Transformer-6-512 and 2.05% to 1.91% for Transformer-8-512 compared to SpikFormer. These findings highlight the importance of the number of weight splitting blocks in SWformer to create an efficient architecture without compromising performance.

4.4.3 Firing Threshold of Spiking Frequency Representation

Spiking neurons act as temporal pruning for inputs, leading to better energy efficiency. As shown in Fig. 3.3 and Table 1, the inherent sparsity and robustness of the wavelet transform, combined with negative spike dynamics, enable a significant reduction in power consumption without compromising accuracy. Increasing the Vth in SWformer (Block=4) reduces power consumption by 58.4% to 60.1%. This effect depends on the specific Vth value and SWformer architecture setting. Besides, the results using standard wavelet transform, without inserted spiking layers, represented by the gray background data, generally achieves better performance, emphasizing the importance of precise frequency representation in SNNs, albeit contradicting the spike-driven computing paradigm. Therefore, our spiking frequency representation, which combines time-frequency transform with SNNs, is crucial for the entire model design, enabling robust signal projection in just 4 timesteps, as shown in Table 1 and Fig. 3.3.

4.4.4 Ablation Study of Frequency-Aware Token Mixer

To better understand the advantages of the FATM, we performed ablation studies. Removing spiking wavelet transforms significantly decreased performance on CIFAR100 and CIFAR10-DVS, highlighting the critical role of frequency feature learning. Interestingly, masking the DC components actually improves performance, highlighting the significance of high-frequency information. Furthermore, on the CIFAR10-DVS dataset, removing the 1D max-pooling operation leads to a performance drop from 83.9% to 81.8%. We also assessed the effectiveness of the spiking frequency representation. Evaluating CIFAR100 without negative spike dynamics and spiking Haar inverse transform results in performance reductions from 79.31% to 78.64% and 78.79%, respectively. These findings underscore the vital importance of accurate frequency representation in FATM for optimal performance and emphasize the crucial role of high-frequency components SNNs.

5 Conclusion

In this work, we develop the Spiking Wavelet Transformer (SWformer), a powerful alternative to self-attention-based token mixers, with promising performance and parameter efficiency. The core innovation of SWformer is its Frequency-Aware Token Mixer (FATM), which combines spatial learner (SL), frequency learner (FL), and channel mixing (CM) branches. This unique design enables SWformer to emphasize high-frequency components and enhance the perception capability of Spiking Transformers in the frequency spectrum. Furthermore, we introduce a novel spiking frequency representation that facilitates robust, multiplication-free, and event-driven signal transform. Extensive experiments show that SWformer surpasses representative SNNs on both static and neuromorphic datasets, underscoring the crucial role of frequency learning in spiking neural networks. We believe this study offers the community valuable insights for designing efficient and effective SNN architectures.

Acknowledgements

This work is supported by the Guangzhou-HKUST(GZ) Joint Funding Program (Grant No. 2023A03J0682) and partially supported by a collaborative project with Brain Mind Innovation, inc. Special thanks to Mr. Yijian He.

References

  • [1] Auge, D., Mueller, E.: Resonate-and-fire neurons as frequency selective input encoders for spiking neural networks (2020)
  • [2] Basu, A., Deng, L., Frenkel, C., Zhang, X.: Spiking neural network integrated circuits: A review of trends and future directions. In: 2022 IEEE Custom Integrated Circuits Conference (CICC). pp. 1–8. IEEE (2022)
  • [3] Bellec, G., Salaj, D., Subramoney, A., Legenstein, R., Maass, W.: Long short-term memory and learning-to-learn in networks of spiking neurons. Advances in neural information processing systems 31 (2018)
  • [4] Bi, Y., Chadha, A., Abbas, A., Bourtsoulatze, E., Andreopoulos, Y.: Graph-based object classification for neuromorphic vision sensing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 491–501 (2019)
  • [5] Bochner, S., Chandrasekharan, K.: Fourier transforms. No. 19, Princeton University Press (1949)
  • [6] Bu, T., Fang, W., Ding, J., Dai, P., Yu, Z., Huang, T.: Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks. In: International Conference on Learning Representations (2021)
  • [7] Burkitt, A.N.: A review of the integrate-and-fire neuron model: I. homogeneous synaptic input. Biological cybernetics 95, 1–19 (2006)
  • [8] Cao, Y., Chen, Y., Khosla, D.: Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision 113(1), 54–66 (2015)
  • [9] Chen, S., Ye, T., Bai, J., Chen, E., Shi, J., Zhu, L.: Sparse sampling transformer with uncertainty-driven ranking for unified removal of raindrops and rain streaks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13106–13117 (2023)
  • [10] Chen, S., Ye, T., Liu, Y., Liao, T., Jiang, J., Chen, E., Chen, P.: Msp-former: Multi-scale projection transformer for single image desnowing. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)
  • [11] Dao, T., Chen, B., Sohoni, N.S., Desai, A., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Ré, C.: Monarch: Expressive structured matrices for efficient and accurate training. In: International Conference on Machine Learning. pp. 4690–4721. PMLR (2022)
  • [12] Davies, M., Srinivasa, N., Lin, T.H., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., et al.: Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro 38(1), 82–99 (2018)
  • [13] Davies, M., Wild, A., Orchard, G., Sandamirskaya, Y., Guerra, G.A.F., Joshi, P., Plank, P., Risbud, S.R.: Advancing neuromorphic computing with loihi: A survey of results and outlook. Proceedings of the IEEE 109(5), 911–934 (2021)
  • [14] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. Ieee (2009)
  • [15] Deng, S., Li, Y., Zhang, S., Gu, S.: Temporal efficient training of spiking neural network via gradient re-weighting. arXiv preprint arXiv:2202.11946 (2022)
  • [16] Ding, J., Yu, Z., Tian, Y., Huang, T.: Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks. arXiv preprint arXiv:2105.11654 (2021)
  • [17] Duan, C., Ding, J., Chen, S., Yu, Z., Huang, T.: Temporal effective batch normalization in spiking neural networks. Advances in Neural Information Processing Systems 35, 34377–34390 (2022)
  • [18] Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., Tian, Y.: Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems 34, 21056–21069 (2021)
  • [19] Fang, W., Yu, Z., Chen, Y., Huang, T., Masquelier, T., Tian, Y.: Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems 34, 21056–21069 (2021)
  • [20] Frady, E.P., Sanborn, S., Shrestha, S.B., Rubin, D.B.D., Orchard, G., Sommer, F.T., Davies, M.: Efficient neuromorphic signal processing with resonator neurons. Journal of Signal Processing Systems 94(10), 917–927 (2022)
  • [21] Gaudart, L., Crebassa, J., Petrakian, J.P.: Wavelet transform in human visual channels. Appl. Opt. 32(22), 4119–4127 (Aug 1993). https://doi.org/10.1364/AO.32.004119, https://opg.optica.org/ao/abstract.cfm?URI=ao-32-22-4119
  • [22] Gu, P., Xiao, R., Pan, G., Tang, H.: STCA: Spatio-Temporal Credit Assignment with Delayed Feedback in Deep Spiking Neural Networks. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. pp. 1366–1372. International Joint Conferences on Artificial Intelligence Organization, Macao, China (Aug 2019). https://doi.org/10.24963/ijcai.2019/189
  • [23] Guo, Y., Zhang, L., Chen, Y., Tong, X., Liu, X., Wang, Y., Huang, X., Ma, Z.: Real spike: Learning real-valued spikes for spiking neural networks. In: European Conference on Computer Vision. pp. 52–68. Springer (2022)
  • [24] He, C., Li, K., Zhang, Y., Tang, L., Zhang, Y., Guo, Z., Li, X.: Camouflaged object detection with feature decomposition and edge reconstruction. In: CVPR. pp. 22046–22055 (2023)
  • [25] He, C., Li, K., Zhang, Y., Xu, G., Tang, L.: Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. NeurIPS (2024)
  • [26] He, C., Shen, Y., Fang, C., Xiao, F., Tang, L., Zhang, Y., Zuo, W., Guo, Z., Li, X.: Diffusion models in low-level vision: A survey. arXiv preprint arXiv:2406.11138 (2024)
  • [27] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [28] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. pp. 630–645. Springer (2016)
  • [29] Hopkins, M., Pineda-Garcia, G., Bogdan, P.A., Furber, S.B.: Spiking neural networks for computer vision. Interface Focus 8(4), 20180007 (2018)
  • [30] Hu, Y., Tang, H., Pan, G.: Spiking deep residual networks. IEEE Transactions on Neural Networks and Learning Systems (2021)
  • [31] Hu, Y., Deng, L., Wu, Y., Yao, M., Li, G.: Advancing spiking neural networks towards deep residual learning. arXiv preprint arXiv:2112.08954 (2021)
  • [32] Hu, Y., Deng, L., Wu, Y., Yao, M., Li, G.: Advancing spiking neural networks toward deep residual learning. IEEE Transactions on Neural Networks and Learning Systems (2024)
  • [33] Ji, M., Wang, Z., Yan, R., Liu, Q., Xu, S., Tang, H.: Sctn: Event-based object tracking with energy-efficient deep convolutional spiking neural networks. Frontiers in Neuroscience 17, 1123698 (2023)
  • [34] Jiménez-Fernández, A., Cerezuela-Escudero, E., Miró-Amarante, L., Domínguez-Morales, M.J., de Asís Gómez-Rodríguez, F., Linares-Barranco, A., Jiménez-Moreno, G.: A binaural neuromorphic auditory sensor for fpga: a spike signal processing approach. IEEE transactions on neural networks and learning systems 28(4), 804–818 (2016)
  • [35] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
  • [36] Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998). https://doi.org/10.1109/5.726791
  • [37] Lee, D., Yin, R., Kim, Y., Moitra, A., Li, Y., Panda, P.: Tt-snn: Tensor train decomposition for efficient spiking neural network training. arXiv preprint arXiv:2401.08001 (2024)
  • [38] Lee, I., Kim, J., Kim, Y., Kim, S., Park, G., Park, K.T.: Wavelet transform image coding using human visual system. In: Proceedings of APCCAS’94-1994 Asia Pacific Conference on Circuits and Systems. pp. 619–623. IEEE (1994)
  • [39] Li, H., Liu, H., Ji, X., Li, G., Shi, L.: CIFAR10-DVS: An Event-Stream Dataset for Object Classification. Frontiers in Neuroscience 11 (2017)
  • [40] Li, Q., Shen, L., Guo, S., Lai, Z.: Wavelet integrated cnns for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7245–7254 (2020)
  • [41] Li, Y., Kim, Y., Park, H., Geller, T., Panda, P.: Neuromorphic data augmentation for training spiking neural networks. In: European Conference on Computer Vision. pp. 631–649. Springer (2022)
  • [42] Li, Y., Kim, Y., Park, H., Geller, T., Panda, P.: Neuromorphic Data Augmentation for Training Spiking Neural Networks. arXiv preprint arXiv:2203.06145 (2022)
  • [43] Liu, Q., Xing, D., Tang, H., Ma, D., Pan, G.: Event-based Action Recognition Using Motion Information and Spiking Neural Networks. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. pp. 1743–1749. International Joint Conferences on Artificial Intelligence Organization, Montreal, Canada (Aug 2021). https://doi.org/10.24963/ijcai.2021/240
  • [44] López-Randulfe, J., Duswald, T., Bing, Z., Knoll, A.: Spiking neural network for fourier transform and object detection for automotive radar. Frontiers in Neurorobotics 15, 688344 (2021)
  • [45] López-Randulfe, J., Reeb, N., Karimi, N., Liu, C., Gonzalez, H.A., Dietrich, R., Vogginger, B., Mayr, C., Knoll, A.: Time-coded spiking fourier transform in neuromorphic hardware. IEEE Transactions on Computers 71(11), 2792–2802 (2022)
  • [46] Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neural networks 10(9), 1659–1671 (1997)
  • [47] Maro, J.M., Ieng, S.H., Benosman, R.: Event-based gesture recognition with dynamic background suppression using smartphone computational capabilities. Frontiers in neuroscience 14,  275 (2020)
  • [48] Meng, Q., Xiao, M., Yan, S., Wang, Y., Lin, Z., Luo, Z.Q.: Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12444–12453 (2022)
  • [49] Meng, Q., Yan, S., Xiao, M., Wang, Y., Lin, Z., Luo, Z.Q.: Training much deeper spiking neural networks with a small number of time-steps. Neural Networks 153, 254–268 (2022)
  • [50] Merolla, P.A., Arthur, J.V., Alvarez-Icaza, R., Cassidy, A.S., Sawada, J., Akopyan, F., Jackson, B.L., Imam, N., Guo, C., Nakamura, Y.: A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345(6197), 668–673 (2014)
  • [51] Miao, S., Chen, G., Ning, X., Zi, Y., Ren, K., Bing, Z., Knoll, A.: Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in neurorobotics 13,  38 (2019)
  • [52] Orchard, G., Jayawant, A., Cohen, G.K., Thakor, N.: Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience 9 (2015)
  • [53] Park, N., Kim, S.: How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022)
  • [54] Pei, J., Deng, L., Song, S., Zhao, M., Zhang, Y., Wu, S., Wang, G., Zou, Z., Wu, Z., He, W., et al.: Towards artificial general intelligence with hybrid tianjic chip architecture. Nature 572(7767), 106–111 (2019)
  • [55] Rao, A., Plank, P., Wild, A., Maass, W.: A long short-term memory for ai applications in spike-based neuromorphic hardware. Nature Machine Intelligence 4(5), 467–479 (2022)
  • [56] Rathi, N., Chakraborty, I., Kosta, A., Sengupta, A., Ankit, A., Panda, P., Roy, K.: Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware. ACM Computing Surveys 55(12), 1–49 (2023)
  • [57] Rathi, N., Srinivasan, G., Panda, P., Roy, K.: Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv preprint arXiv:2005.01807 (2020)
  • [58] Roy, K., Jaiswal, A., Panda, P.: Towards spike-based machine intelligence with neuromorphic computing. Nature 575(7784), 607–617 (2019)
  • [59] Schuman, C.D., Kulkarni, S.R., Parsa, M., Mitchell, J.P., Kay, B., et al.: Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science 2(1), 10–19 (2022)
  • [60] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
  • [61] Shen, S., Zhao, D., Shen, G., Zeng, Y.: Tim: An efficient temporal interaction module for spiking transformer. arXiv preprint arXiv:2401.11687 (2024)
  • [62] Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., Yan, S.: Inception transformer. Advances in Neural Information Processing Systems 35, 23495–23509 (2022)
  • [63] Sironi, A., Brambilla, M., Bourdis, N., Lagorce, X., Benosman, R.: HATS: Histograms of Averaged Time Surfaces for Robust Event-Based Object Classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1731–1740 (2018)
  • [64] Stewart, K.M., Neftci, E.O.: Meta-learning spiking neural networks with surrogate gradient descent. Neuromorphic Computing and Engineering 2(4), 044002 (2022)
  • [65] Su, Q., Chou, Y., Hu, Y., Li, J., Mei, S., Zhang, Z., Li, G.: Deep directly-trained spiking neural networks for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6555–6565 (2023)
  • [66] Tripura, T., Chakraborty, S.: Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering 404, 115783 (2023)
  • [67] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [68] Viale, A., Marchisio, A., Martina, M., Masera, G., Shafique, M.: Carsnn: An efficient spiking neural network for event-based autonomous cars on the loihi neuromorphic research processor. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1–10. IEEE (2021)
  • [69] Wang, Z., Fang, Y., Cao, J., Xu, R.: Bursting spikes: Efficient and high-performance snns for event-based vision. arXiv preprint arXiv:2311.14265 (2023)
  • [70] Wang, Z., Fang, Y., Cao, J., Zhang, Q., Wang, Z., Xu, R.: Masked spiking transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1761–1771 (2023)
  • [71] Wu, H., Yang, Y., Chen, H., Ren, J., Zhu, L.: Mask-guided progressive network for joint raindrop and rain streak removal in videos. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 7216–7225 (2023)
  • [72] Yang, Y., Wu, H., Aviles-Rivero, A.I., Zhang, Y., Qin, J., Zhu, L.: Genuine knowledge from practice: Diffusion test-time adaptation for video adverse weather removal. arXiv preprint arXiv:2403.07684 (2024)
  • [73] Yang, Z., Wu, Y., Wang, G., Yang, Y., Li, G., Deng, L., Zhu, J., Shi, L.: DashNet: A hybrid artificial and spiking neural network for high-speed object tracking. arXiv preprint arXiv:1909.12942 (2019)
  • [74] Yao, M., Gao, H., Zhao, G., Wang, D., Lin, Y., Yang, Z., Li, G.: Temporal-wise attention spiking neural networks for event streams classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10221–10230 (2021)
  • [75] Yao, M., Hu, J., Zhou, Z., Yuan, L., Tian, Y., Xu, B., Li, G.: Spike-driven transformer. arXiv preprint arXiv:2307.01694 (2023)
  • [76] Yao, M., Zhao, G., Zhang, H., Hu, Y., Deng, L., Tian, Y., Xu, B., Li, G.: Attention spiking neural networks. IEEE transactions on pattern analysis and machine intelligence (2023)
  • [77] Ye, C., Kornijcuk, V., Yoo, D., Kim, J., Jeong, D.S.: Lacera: Layer-centric event-routing architecture. Neurocomputing 520, 46–59 (2023)
  • [78] Ye, T., Zhang, Y., Jiang, M., Chen, L., Liu, Y., Chen, S., Chen, E.: Perceiving and modeling density for image dehazing. In: European conference on computer vision. pp. 130–145. Springer (2022)
  • [79] Zhang, J., Dong, B., Zhang, H., Ding, J., Heide, F., Yin, B., Yang, X.: Spiking Transformers for Event-Based Single Object Tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8801–8810 (2022)
  • [80] Zheng, H., Wu, Y., Deng, L., Hu, Y., Li, G.: Going deeper with directly-trained larger spiking neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 11062–11070 (2021)
  • [81] Zheng, H., Wu, Y., Deng, L., Hu, Y., Li, G.: Going deeper with directly-trained larger spiking neural networks. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 11062–11070 (2021)
  • [82] Zhou, C., Yu, L., Zhou, Z., Zhang, H., Ma, Z., Zhou, H., Tian, Y.: Spikingformer: Spike-driven residual learning for transformer-based spiking neural network. arXiv preprint arXiv:2304.11954 (2023)
  • [83] Zhou, Z., Zhu, Y., He, C., Wang, Y., Yan, S., Tian, Y., Yuan, L.: Spikformer: When spiking neural network meets transformer. arXiv preprint arXiv:2209.15425 (2022)
  • [84] Zhu, R.J., Wang, Z., Gilpin, L., Eshraghian, J.K.: Autonomous driving with spiking neural networks. arXiv preprint arXiv:2405.19687 (2024)