NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

Jason Yik Harvard University Correspondence to: jyik@g.harvard.edu Korneel Van den Berghe Harvard University Delft University of Technology Douwe den Blanken Delft University of Technology Younes Bouhadjar Forschungszentrum Jülich Maxime Fabre University of Groningen Paul Hueber Delft University of Technology IMEC Netherlands Denis Kleyko Research Institutes of Sweden Örebro University Noah Pacik-Nelson Accenture Labs Pao-Sheng Vincent Sun City University of Hong Kong Guangzhi Tang IMEC Netherlands Shenqi Wang IMEC Netherlands Eindhoven University of Technology Biyan Zhou City University of Hong Kong Soikat Hasan Ahmed Forschungszentrum Jülich George Vathakkattil Joseph Innatera Nanosystems B.V. Benedetto Leto Politecnico di Torino Aurora Micheli Delft University of Technology Anurag Kumar Mishra Forschungszentrum Jülich Gregor Lenz NeuroBus Tao Sun Centrum Wiskunde & Informatica Zergham Ahmed Harvard University Mahmoud Akl SpiNNcloud Systems GmbH Brian Anderson Intel Andreas G. Andreou Johns Hopkins University Chiara Bartolozzi Istituto Italiano di Tecnologia Arindam Basu City University of Hong Kong Petrut Bogdan Innatera Nanosystems B.V. Sander Bohte Centrum Wiskunde & Informatica Sonia Buckley National Institute of Standards and Technology Gert Cauwenberghs UCSD Elisabetta Chicca University of Groningen Federico Corradi Eindhoven University of Technology Guido de Croon Delft University of Technology Andreea Danielescu Accenture Labs Anurag Daram UTSA Mike Davies Intel Yigit Demirag University of Zurich ETH Zurich Jason Eshraghian UCSC Tobias Fischer Queensland University of Technology Jeremy Forest Cornell University Vittorio Fra Politecnico di Torino Steve Furber University of Manchester P. Michael Furlong U Waterloo William Gilpin University of Texas at Austin Aditya Gilra Centrum Wiskunde & Informatica Hector A. Gonzalez SpiNNcloud Systems GmbH Giacomo Indiveri University of Zurich ETH Zurich Siddharth Joshi University of Notre Dame Vedant Karia UTSA Lyes Khacef Sony Europe B.V. James C. Knight University of Sussex Laura Kriener University of Bern Rajkumar Kubendran University of Pittsburgh Dhireesha Kudithipudi UTSA Yao-Hong Liu IMEC Netherlands Shih-Chii Liu University of Zurich ETH Zurich Haoyuan Ma CentraleSupélec, Université Paris-Saclay Rajit Manohar Yale University Josep Maria Margarit-Taulé Instituto de Microelectrónica de Barcelona Christian Mayr Technische Universität Dresden Konstantinos Michmizos Rutgers University Dylan Muir SynSense AI Emre Neftci Forschungszentrum Jülich RWTH Aachen Thomas Nowotny University of Sussex Fabrizio Ottati Politecnico di Torino Ayca Ozcelikkale Uppsala University Priyadarshini Panda Yale University Jongkil Park Korea Institute of Science and Technology Melika Payvand University of Zurich ETH Zurich Christian Pehle Heidelberg University Mihai A. Petrovici University of Bern Alessandro Pierro Intel Christoph Posch Prophesee Alpha Renner Forschungszentrum Jülich Yulia Sandamirskaya Intel ZHAW Clemens JS Schaefer University of Notre Dame André van Schaik Western Sydney University Johannes Schemmel Heidelberg University Samuel Schmidgall Johns Hopkins University Catherine Schuman University of Tennessee Jae-sun Seo Cornell Tech Sadique Sheik SynSense AI Sumit Bam Shrestha Intel Manolis Sifalakis IMEC Netherlands Amos Sironi Prophesee Matthew Stewart Harvard University Kenneth Stewart UCI Forschungszentrum Jülich Terrence C. Stewart National Research Council Canada Philipp Stratmann Intel Jonathan Timcheck Intel Nergis Tömen Delft University of Technology Gianvito Urgese Politecnico di Torino Marian Verhelst KU Leuven Craig M. Vineyard Sandia National Laboratories Bernhard Vogginger Technische Universität Dresden Amirreza Yousefzadeh IMEC Netherlands Fatima Tuz Zohora UTSA Charlotte Frenkel Delft University of Technology Joint supervision Vijay Janapa Reddi Harvard University Joint supervision
Abstract

Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of nearly 100 co-authors across over 50 institutions in industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we present initial performance baselines across various model architectures on the algorithm track and outline the system track benchmark tasks and guidelines. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community.

keywords:
benchmark, neuromorphic
\maketitlewithnodistribute

Introduction

In recent years, the rapid growth of artificial intelligence (AI) and machine learning (ML) has resulted in increasingly complex and large models in pursuit of higher accuracy and range of use cases [1]. The substantial growth rate of model computation exceeds efficiency gains realized through Moore and Dennard technology scaling [2], indicating a looming limit to continued advancements with existing techniques. This issue is compounded by the open challenges of adapting such methods for resource-constrained edge devices (tinyML) in order to enable pervasive and decentralized intelligence through the Internet of Things (IoT) [3]. As such, the urgency for exploring new resource-efficient and scalable computing architectures has intensified.

Neuromorphic computing has emerged as a promising area in addressing these challenges, aiming to unlock key hallmarks of biological intelligence by porting primitives and computational strategies employed in the brain into engineered computing devices and algorithms [4, 5, 6]. Neuromorphic systems hold a critical position in the investigation of novel architectures, as the brain exemplifies an exceptional model for accomplishing scalable, energy-efficient, and real-time embodied computation.

Initially, the term “neuromorphic” referred specifically to approaches that aimed to emulate the biophysics of the brain by leveraging physical properties of silicon, as proposed by Mead in the 1980’s [7]. However, the field of neuromorphic computing research has since grown to encompass a wide range of brain-inspired computing techniques at the algorithmic, hardware, and system levels [4]. While the range of approaches is diverse, neuromorphic computing research generally utilizes mechanisms emulating or simulating biophysical properties more closely than conventional methods, aiming to reproduce high-level performance and efficiency characteristics of biological neural systems.

Neuromorphic algorithms [8] encompass neuroscience-inspired methods which strive towards goals of expanded learning capabilities, such as predictive intelligence, data efficiency, and adaptation, and include approaches such as spiking neural networks (SNNs) and primitives of neuron dynamics, plastic synapses, and heterogeneous network architectures. Algorithm exploration often makes use of simulated execution on readily-available conventional hardware such as CPUs and GPUs, with the goal of driving design requirements for next-generation neuromorphic hardware.

Neuromorphic systems [9] are composed of algorithms deployed to hardware, which seek greater energy efficiency, real-time processing capabilities, and resilience compared to conventional systems. Neuromorphic hardware utilizes a variety of biologically-inspired hardware approaches, including analog neuron emulation, event-based computation, non-von-Neumann architectures, and in-memory processing. Neuromorphic systems target a wide range of applications, from neuroscientific exploration, to low-power edge intelligence and datacenter-scale acceleration.

Despite its promises, progress in the field of neuromorphic research is impeded due to the absence of fair and widely-adopted objective metrics and benchmarks [10, 8]. Without such benchmarks, the validity of neuromorphic solutions cannot be directly quantified, hindering the research community from measuring technological advancement. Standard and rigorous benchmarking is necessary for the neuromorphic community to objectively assess and compare the achievements of novel approaches, and make evidence-based decisions on which directions show promise for achieving breakthrough efficiency, speed, and intelligence, thereby helping to focus research and commercialization efforts on techniques that concretely improve on prior work and conventional computing. Neuromorphic benchmarks have been previously proposed for classical vision [11, 12] and audition tasks [13], open-loop [14] and closed-loop [15] tasks, and for SNN simulator performance assessment [16]. While prior works have made valuable contributions, there are opportunities to further advance the field by addressing three outstanding challenges:

  • Lack of a formal definition. The variety of approaches to exploring brain-inspired principles creates difficulties in defining a set of criteria for what should be benchmarked as a “neuromorphic” solution. Closed definitions can impose narrow assumptions and thus risk unfairly excluding promising methods. This challenge necessitates inclusive benchmarks that can be applied generally across the spectrum of potential approaches, allowing for flexible implementation while focusing on task capabilities and metrics of interest such as temporal processing and efficiency. Furthermore, the benchmarks should ideally allow for direct comparison of neuromorphic and conventional approaches.

  • Implementation diversity. A wide array of different frameworks targeting different goals, such as neuroscientific exploration [17] and automatic SNN training [18], are used in neuromorphic research. This diversity, which has been instrumental in exploring the landscape of bio-inspired techniques following different methodologies and abstraction levels, comes at the cost of portability and standardization, which in turn limits the ease of benchmark implementation. Benchmarks require common infrastructure that unites tooling to enable actionable implementation and comparison of new methods.

  • Rapid research evolution. Neuromorphic approaches are continually and rapidly evolving as part of an emerging field. As the research community continues to make technological progress, so too should benchmark suites and methodology expand to foster inclusion and capture salient performance metrics. An iterative benchmark framework with structured versioning will facilitate productive foundational and evolving performance evaluation.

Refer to caption
Figure 1: The two NeuroBench tracks: algorithms and systems. Grey boxes designate what is defined by the benchmark, and orange boxes indicate what is unique to each solution. Connecting arrows between the two tracks denote the co-innovation between the tracks and the cross-stack innovation enabled by this approach. Between algorithm and system solutions, best-performing results from each track can motivate future solutions to the other. In addition, system metrics and results can inform hardware-independent algorithmic complexity metrics.

To tackle these challenges, this article presents NeuroBench, a dual-track, multi-task benchmark framework. NeuroBench addresses the existing neuromorphic benchmark challenges by advancing prior work in three distinct ways. Firstly, the benchmark framework reduces assumptions regarding the specific solution being assessed, encouraging inclusive participation of neuromorphic and non-neuromorphic approaches by utilizing general, task-level benchmarking and hierarchical metric definitions which capture key performance indicators of interest. Secondly, the NeuroBench benchmarks are associated with a common open-source benchmark harness tool which facilitates actionable benchmark implementation and offers structure for further expansion to neuromorphic algorithm frameworks and systems. Finally, NeuroBench establishes an iterative, community-driven initiative designed to evolve over time to ensure representation and relevance to neuromorphic research, analogous to the well-established MLPerf benchmark framework for machine learning [19, 20]. As a whole, NeuroBench intends to align the neuromorphic research community on standard benchmarking, providing a dynamically evolving platform to ensure ongoing relevance and facilitate advancements through workshops, competitions, and a centralized leaderboard.

As Figure 1 shows, the NeuroBench framework involves two tracks to enable agile algorithm and system development. As an emerging technology, neuromorphic hardware has not converged to a single platform which is commercially available, thus a large fraction of neuromorphic research explores algorithmic advancement on conventional systems which may not be optimal for performance. Thus, NeuroBench consists of an algorithm track for hardware-independent evaluation and a system track for fully deployed solutions. The algorithm track defines four novel benchmarks for neuromorphic methods across diverse domains, namely few-shot continual learning, computer vision, motor cortical decoding, and chaotic forecasting, and utilizes complexity metrics to analyze solution costs. Such hardware-independent benchmarking enables algorithmic exploration and prototyping, especially when simulating algorithm execution on non-neuromorphic platforms. Meanwhile, the system track defines standard protocols to measure the real-world speed and efficiency of neuromorphic hardware on benchmarks ranging from standard machine learning tasks to promising fields for neuromorphic systems, such as optimization. Both the algorithm and system track will be extended and co-developed as NeuroBench continues to expand.

The following Results section organizes descriptions of the algorithm track benchmark framework and its baseline results, as well as specifications of the system track benchmark framework and tasks. Further details regarding the benchmark metric formulations, task specifications, and baseline solutions can be found in the Methods section. Baseline results on NeuroBench benchmarks outline unexplored research opportunities in optimizing algorithmic architectures and training of sparse, stateful models to achieve greater performance and resource efficiency. As NeuroBench is intended to continually grow over time, the latest developments and opportunities to engage with the project are reported on the website.aaahttps://neurobench.ai

Results

The complete NeuroBench framework is shown in Figure 1. It includes two tracks with defined datasets, metrics, and modular evaluation components to enable flexible development. The algorithm track focuses on hardware-independent algorithm prototyping to identify promising methods. These in turn inform system design by highlighting target algorithms for optimization and relevant system workloads for benchmarking. The system track enables optimization and evaluation of performant implementations, providing feedback to refine algorithmic complexity modeling and analysis. The interplay between the tracks creates a virtuous cycle: algorithm innovations guide system implementation, while system-level insights accelerate further algorithmic progress. This approach allows NeuroBench to advance neuromorphic algorithm-system co-design.

In the next few sections, we describe the algorithm track, including general complexity metric definitions, benchmark tasks, and common infrastructure tooling. We apply the framework to report baseline results for each algorithm benchmark. Then, we specify protocols and tasks established in the system track to assess deployed neuromorphic performance across promising application workloads. By outlining both tracks, we provide a roadmap towards standardizing benchmark procedures in both hardware-independent and hardware-dependent settings.

Algorithm Track Benchmark Framework

Refer to caption
Figure 2: An overview of the NeuroBench algorithm track.

The algorithm benchmark track aims to evaluate algorithms in a system-independent manner, separating algorithm performance from specific implementation details. The implementation platform can thus be ill-matched to the particular algorithm benchmark that it executes (e.g., SNN execution via dense matrix multiplication on a GPU), and the algorithm complexity and expected performance can be examined in a theoretical manner, motivating agile prototyping and functional analysis. Furthermore, minimal assumptions are made about the solutions tested, promoting inclusion of diverse algorithmic approaches.

The framework, as illustrated in Figure 2, is composed of inclusively-defined benchmark metrics, datasets and data loaders, and common harness infrastructure, shown in red. The metrics focus on assessing algorithm correctness on specific tasks as well as capturing general metrics that reflect the architectural complexity, computational demands, and storage requirements of the models. The datasets and data loaders specify the details of the tasks used for evaluation and ensure consistency across benchmarks. Finally, the harness infrastructure automates runtime execution and result output for the algorithm benchmark specified by the input interface, which consists of the user’s model and customizable components for data processing and desired metrics, shown in green and orange.

Algorithm Track Metrics

The algorithm track establishes solution-agnostic primary metrics which are generally relevant to all types of solutions, including artificial and spiking neural networks (ANNs, SNNs). Firstly, there are correctness metrics, which measure the quality of the model predictions on the particular task, such as accuracy, mean average precision (mAP), and mean-squared error (MSE). The correctness metrics are specified per task for each benchmark. Next, there are complexity metrics, which measure the computational demands of the algorithm. In the first iteration of the NeuroBench algorithm track, we assume a digital, time-stepped execution of the algorithm and define the following complexity metrics:

  • Footprint – A measure of the memory footprint, in bytes, required to represent a model, which reflects quantization, parameters, and buffering requirements. The metric summarizes (and can be further broken down into) synaptic weight count, weight precision, trainable neuron parameters, data buffers, etc. Zero weights are included, as they are distinguished in the connection sparsity metric.

  • Model Execution Rate – Execution rate, in Hz, of the model computation based on forward inference passes per second, measured in the time-stepped simulation timescale. The time is correlated to real-world data time. For example, if a model is designed to process data from an event camera with 50 ms input stride, the model execution rate is 20 Hz. This metric provides intuition into the deployed real-time responsiveness of a model, as well as its computational requirements.

  • Connection Sparsity – For a given model, the connection sparsity is the number of zero weights divided by the total number of weights, accumulated over all layers. 0 refers to no sparsity (fully connected) and 1 refers to full sparsity (no connections). This metric accounts for deliberate pruning and sparse network architectures.

  • Activation Sparsity – During execution, the average sparsity of neuron activations over all neurons in all model layers, for all timesteps of all tested samples, where 0 refers to no sparsity (i.e., all neurons are always activated), and 1 refers to the case where all neurons have a zero output.

  • Synaptic Operations – Average number of synaptic operations per model execution, based on neuron activations and the associated fanout synapses. This metric is further subdivided into dense, effective multiply-accumulate, and effective accumulate synaptic operations (Dense, Eff_MACs, Eff_ACs). Dense accounts for all zero and nonzero neuron activations and synaptic connections, and reflects the number of operations necessary on hardware that does not support sparsity. Eff_MACs and Eff_ACs only count effective synaptic operations by disregarding zero activations (e.g., produced by the ReLU function in an ANN or no spike in an SNN) and zero connections, thus reflecting operation cost on sparsity-aware hardware. Synaptic operations with non-binary activation are considered multiply-accumulates (MACs), while those with binary activation are considered accumulates (ACs).

Footprint and connection sparsity are classified as static metrics, which can be analytically determined from the model only. Activation sparsity, synaptic operations, and correctness are classified as workload metrics, which are dependent on execution or simulation of the model based on the benchmark data. Model execution rate is an exception, as it is a feature of the algorithm which neither needs to be calculated nor extracted from the model or its outputs, and thus is reported directly by the solution designer in benchmark results.

The complexity metrics are measured independently of the underlying hardware and therefore do not explicitly correlate with post-deployment latency or energy consumption. However, they provide valuable insight into algorithm performance and resource requirements, enabling high-level comparison and facilitating prototyping. For instance, the execution rate and number of synaptic operations can be taken together to estimate the speed and dynamic power of a model deployed to certain hardware, and the footprint and connection sparsity can be used to proxy hardware resource utilization.

Furthermore, the algorithm track can be extended with solution-specific secondary metrics, which can offer deeper insights by using information specific to particular types of solutions. For example, for algorithms geared towards analog hardware, noise robustness is an important solution-specific metric. In addition, approaches with complex neuron dynamics may warrant measuring the overall complexity of a neuron update (i.e., type and counts of operations necessary to simulate the update), which can be combined with the total number of neuron updates in a model pass to calculate the cost of state updates. Such solution-specific metrics are expected to be community-driven and will be included in future NeuroBench algorithm track releases.

Algorithm Track Benchmarks

The v1.0 iteration of the NeuroBench algorithm track includes four benchmarks for neuromorphic computing research. The benchmarks were chosen by the NeuroBench community to capture key ongoing challenges for neuromorphic algorithm design. The list of tasks highlights features which are relevant to neuromorphic research interests: few-shot continual learning, object detection utilizing the high dynamic range and temporal resolution of event cameras, sensorimotor decoding based on cortical signals, and low-dimensional predictive modeling useful for prototyping resource-constrained networks that are suitable for small mixed-signal systems. Benchmark tasks are listed below and summarized in Table 1. Detailed specifications of benchmark tasks are provided in the Methods section.

 
Task Dataset Correctness metric Task description
Keyword FSCIL MSWC [21] Accuracy Few-shot, continual learning of keyword classes.
Event Camera Object Detection Prophesee 1MP Automotive [22] COCO mAP Detecting automotive objects from event camera video.
NHP Motor Prediction Primate Reaching [23] R2 Predicting fingertip velocity from cortical recordings.
Chaotic Function Prediction Mackey-Glass time series [24] sMAPE Autoregressive modeling of chaotic functions.
 
Table 1: NeuroBench algorithm track v1.0 benchmarks.
  • Keyword Few-Shot Class-Incremental Learning (FSCIL) – Learning new tasks from a small amount of experiences while retaining knowledge of prior tasks is a hallmark of biological intelligence and a long-standing goal of general AI [25]. It is especially a key challenge to endow edge devices with the ability to adapt to their environments and users. This benchmark thus evaluates the capacity of a model to successively incorporate new keywords over multiple sessions (class-incremental), with only a handful of samples from the new classes to train with (few-shot). The FSCIL task is a recently established benchmark in the computer vision domain [26], but it has not yet been adapted to other data modalities. Aligning with a neuromorphic interest in temporal data modalities, this benchmark introduces a FSCIL task with streaming audio data using the large Multilingual Spoken Word Corpus (MSWC) [21] keyword classification dataset. The task is designed to be approached in two phases: pre-training and incremental learning. First, for pre-training, a set of 100 words spanning 5 base languages (English, German, Catalan, French, Kinyarwanda) with 500 training samples each are made available to train an initial model. Next, for incremental learning, the model undergoes 10 successive sessions to learn words from 10 new languages (Persian, Spanish, Russian, Welsh, Italian, Basque, Polish, Esparanto, Portuguese, Dutch) in a few-shot learning scenario. Each incremental session adds 10 words of the corresponding session language with only 5 training samples available per word. After each session, the model is tested in classification accuracy on all prior learned classes, including the 100 base pre-training classes and the few-shot-learned classes, therefore evaluating the FSCIL solution on its ability to learn new classes while retaining knowledge about the previously learned ones. Each session learns a new language, for a total knowledge base of 200 keywords by the end of the benchmark.

  • Event Camera Object Detection – Object detection is a widely-used computer vision task with applications in robotics, autonomous driving, and surveillance. Such scenarios at the edge may require high energy efficiency and real-time performance, which can be achieved via event-based vision sensors [27]. The event camera object detection benchmark uses the Prophesee 1 Megapixel automotive detection dataset [22], a large labeled object detection dataset with over 15 hours of event camera video from the front of a car driving in various scenarios. Predetermined training, validation, and testing splits include 11.2 htimes11.2hour11.2\text{\,}\mathrm{h}, 2.2 htimes2.2hour2.2\text{\,}\mathrm{h}, and 2.2 htimes2.2hour2.2\text{\,}\mathrm{h} of recording, respectively. Pedestrian, two-wheeler, and car object classes are used in evaluation, and correctness is measured using COCO mean average precision (mAP) [28].

  • Non-human Primate (NHP) Motor Prediction – Studying models which can accurately replicate features of biological computation presents opportunities in understanding sensorimotor behavior and developing closed-loop methods for future robotic agents. It also is foundational to the development of wearable or implantable neuro-prosthetic devices that can accurately generate motor activity from neural or muscle signals. This benchmark utilizes a dataset consisting of multi-channel recordings from the sensorimotor cortex of two non-human primates (NHP Indy and NHP Loco) during reaching movements, along with corresponding fingertip motion of the reach [23]. Six total sessions are included from the dataset, for a total of 8712 seconds of data. The task is to train a model to predict the two-dimensional components of finger velocity using recent neural data. The sessions are treated independently (i.e., models are trained separately for each session), and the data is split to allow the first 75% for training and validation and the last 25% for evaluation. Correctness of the predictions is evaluated by the coefficient of determination (R2superscript𝑅2R^{2}) score against the true finger velocity targets, averaged over all six sessions.

  • Chaotic Function Prediction – The real-world data benchmarks presented thus far are high-dimensional and can require large networks to achieve high accuracy, raising challenges for solution types with limited I/O support and network capacity, such as mixed-signal edge prototype solutions. To address this, we include a synthetic benchmark based on prediction of one-dimensional Mackey-Glass time series [24], which can be effectively tackled by smaller networks. Mackey-Glass has been widely adopted as a benchmark for evaluating temporal predictors, including neuromorphic models [29, 30, 31]. The task involves prediction of the next timestep value f(t+Δt)𝑓𝑡Δ𝑡f(t+\Delta t) given the current timestep value f(t)𝑓𝑡f(t). The model is trained and validated using the first half of the time series, during which the ground truth state f(t)𝑓𝑡f(t) are supplied to the model to predict the next timestep f(t+Δt)superscript𝑓𝑡Δ𝑡f^{\prime}(t+\Delta t). During the evaluation, the model uses its prior prediction f(t)superscript𝑓𝑡f^{\prime}(t) to generate each next value f(t+Δt)superscript𝑓𝑡Δ𝑡f^{\prime}(t+\Delta t), autoregressively forecasting the second half of the time series. Correctness is measured using symmetric mean absolute percentage error (sMAPE) of the generated time series against the target time series, a standard metric in forecasting [32]. The benchmark includes a set of 14 Mackey-Glass time series, which vary by the equation parameter τ𝜏\tau, the delay constant. Lyapunov time (L)𝐿(L), the expected predictability timescale for chaos [33], is used as the time unit for each time series. The total length of each series is 20 Lyapunov times, and 75 points are sampled per Lyapunov time (ΔtΔ𝑡\Delta t = L/75𝐿75L/75).

Algorithm Track Benchmark Harness

The NeuroBench algorithm benchmarks are wrapped in a harness which standardizes the benchmark interfaces. The harness provides benchmark users with a consistent framework for loading data, processing data and model outputs, and calculating and reporting metrics, thereby ensuring fair and standard comparisons of the results. It is built with straightforward interfaces which are designed to be extended with new frameworks, algorithms, and tasks. The benchmark harness is open-source for use and development.bbbhttps://github.com/NeuroBench/neurobench

The components of the algorithm benchmark harness are summarized in Figure 2. Datasets are loaded in a common format and pass through Processors to be pre-processed. The Model generates predictions based on the processed data, and Accumulators post-process the predictions, for instance to accumulate spikes and transform to labels. Static metrics of algorithm footprint and connection sparsity are calculated via model analysis, while metrics of correctness, activation sparsity, and synaptic operations are calculated using predictions and model execution traces. For benchmark users, task evaluation simply involves utilizing the existing dataloaders, processors, and metrics within the harness and wrapping their own code to fit the standard interfaces.

Currently, the harness and all baseline models are built using PyTorch [34] or frameworks based on it, such as snnTorch [18] and SpikingJelly [35]. Due to its modular structure and simple interfaces, the harness can grow to be compatible with further neuromorphic tools such as Lava [36] and Fugu [37]. Furthermore, it also supports the extension of data and metric pipelines in order to implement additional benchmark tasks. Benchmarks outside of the NeuroBench v1.0 suite can make use of the harness infrastructure for open reproducibility, and also to garner interest in the community towards inclusion in a future NeuroBench version, through which the task will have long-term support and appear in affiliated leaderboards and workshop events.

Algorithm Track Limitations and Further Extensions

Before diving into the baseline results, it is worth discussing several possible improvements to the NeuroBench algorithm track framework in its current form. Specifically, the initial iteration of metrics is restricted to the assumption of digital, time-stepped algorithm execution. While complexity analysis of such prototypes can serve as an intermediate step for solutions intended for analog or continuous time deployment, the metric measurements are not yet defined for those execution settings. Informed by further benchmark implementations, future versions of NeuroBench will extend inclusiveness by expanding measurement protocols to include such algorithms.

Furthermore, the synaptic operations metric, intended to capture model computation cost, currently does not account for neuron updates. The dynamics of neuron models, including mechanisms like leakage and reset, can vary heavily in complexity. However, counting the number and type of operations from neuron updates, as well as estimating their overall costs, depends on the specific arithmetic or circuit implementation. Thus, they are not accounted for in the broader algorithmic complexity metrics. Solution-specific metrics that assume a particular implementation platform, as have been defined previously [38], can be used to estimate neuron update costs. These estimates can then be combined with the total number of neuron updates per model computation to measure overall neuron operation complexity during evaluation.

Data pre- and post-processing can also amount to significant costs not yet captured in the NeuroBench algorithm track metrics. Such costs are, however, captured in the deployed metrics of the system track, which accounts for data processing hardware as part of the overall system during performance and efficiency measurements. Data processing metrics will be added as a separate complexity category for the algorithm track benchmark in the future.

The v1.0 algorithm track benchmark suite is also intended to expand in the future. This could include covering further data modalities such as inertial measurement unit (IMU) sensing [39] and extending to closed-loop sensorimotor tasks to demonstrate embodied intelligence. As with the initial benchmarks, further tasks will undergo approval and development by the open NeuroBench community before being included in a future versioned benchmark suite.

Algorithm Track Baseline Results

In our first iteration of the algorithmic track, we report baseline algorithm performance on each benchmark using various model architectures, including artificial neural networks commonly used in deep learning, spiking neural networks, and reservoir networks. We evaluate each benchmark with two substantially different algorithm baselines. From these evaluations, we extract baseline comparisons, identify trends, and uncover motivations for future research. Except for the event camera object detection task, each benchmark utilizes a novel data split, and all tasks use novel metric measurement. The presented baselines are a snapshot of the solution search space and will be starting points for leaderboards, thereby calling for further research to push the state of the art for each task. Detailed specifications of each of the baselines can be found in the Methods section.

Keyword FSCIL

  Baseline Accuracy (Base / Session Avg) Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs M5 ANN (97.09% / 89.27%) 6.03×1066.03superscript1066.03\times 10^{6} 1 0.0 0.783 2.59×1072.59superscript1072.59\times 10^{7} 7.85×1067.85superscript1067.85\times 10^{6} 0 SNN (93.48% / 75.27%) 1.36×1071.36superscript1071.36\times 10^{7} 200 0.0 0.916 3.39×1063.39superscript1063.39\times 10^{6} 0 3.65×1053.65superscript1053.65\times 10^{5}  

Table 2: Baseline results for the keyword few-shot class-incremental learning task. Base accuracy refers to accuracy on the 100 base classes after pre-training while session average accuracy is the average accuracy over all sessions for the corresponding prototypical baseline. The detailed accuracy per session for the different baselines are shown in Figure 3.

The keyword FSCIL task has an ANN and SNN baseline, using different model architectures:

  • M5 ANN – The ANN baseline uses a tuned version of the M5 deep convolutional network architecture [40], with samples pre-processed into Mel-frequency cepstral coefficients (MFCC). The network contains four successive convolution-normalization-pooling layers, followed by a readout fully-connected layer. Each model execution (forward pass) uses the data from the full pre-processed sample, and convolution kernels are applied over the temporal dimension of the samples. This is reported as a 1 Hz model execution rate.

  • SNN – The SNN baseline uses a recurrent SNN with adaptive leaky integrate-and-fire (LIF) neurons and heterogeneous time constants [41]. The SNN consists of two recurrent adaptive LIF layers and one linear output layer. Audio samples are pre-processed to binary spike trains using Speech2Spikes [42], which relies on a Mel Spectrogram with the same parameters as the MFCC of the ANN baseline. Each input timestep to the model represents 5 ms of audio data, thus the model has a 200 Hz model execution rate. Output neuron activations are summed over time to produce the word class prediction.

After pre-training using standard batched training, the ANN and SNN baseline networks reach high accuracies on the base classes of 97.09% and 93.48%, respectively. As reported by the model execution rate metric, the SNN baseline computes each sample over 200 passes, using an order of magnitude fewer effective AC synaptic operations compared to the ANN baseline’s effective MACs per model execution. Considering both the model execution rate and synaptic operation metrics, the number of aggregated ACs over the length of the sample (2003.65×105=7.30×1072003.65superscript1057.30superscript107200*3.65\times 10^{5}=7.30\times 10^{7}) exceeds the Dense and effective MAC operations necessary for the ANN baseline, which spatially flattens the sample and processes it in one model execution. However, outside of the static-length keyword classification scenario, the low-cost per-execution temporal processing of SNNs can enable efficient, always-on, high-frequency prediction capabilities in deployed continuous audio recognition scenarios.

Refer to caption
Figure 3: Test accuracy per session on the keyword FSCIL task for prototypical and frozen baselines, with the accuracy on both base classes and incrementally-learned classes (left), and accuracy on all incrementally-learned classes only (right). Incremental session 0 refers to the accuracy on base classes after pre-training only. Shaded area represents 5th and 95th percentile on 100 runs. Frozen baselines with no adaptation do not learn incremental classes and thus have a fixed 0% accuracy for New Classes Performance.

We present two approaches for the incremental stage for both the ANN and SNN baselines. The frozen models are locked after pre-training on base classes and have 0% accuracy on all new incremental classes, providing a reference for models with no learning or catastrophic forgetting of prior classes. The prototypical models employ a prototypical network [43] for incremental learning, which is a feature-based clustering approach that can be implemented with a simple linear readout layer on top of the pre-trained network backbone. Prototypical weights and biases of prior and incremental classes are directly defined based on the average features of the corresponding class and directly substitute pre-trained readout layer parameters. The complexity results in Table 2 thus empirically apply to both the frozen and prototypical models.

The test accuracy for the baseline models over all sessions, as well as the test accuracy on only the new incrementally-learned classes, are shown in Figure 3. Using prototypical networks, the ANN model reaches 89.27% accuracy on average over all sessions, demonstrating significant greater performance of 21.41 accuracy points with respect to the frozen model. The accuracy on new classes, averaged over all incremental sessions, is 79.61%. The SNN prototypical baseline, on the other hand, reaches 75.27% accuracy on average over all sessions, surpassing the frozen SNN performance by 9.97 accuracy points, with an average accuracy on new classes over all sessions of 57.23%.

The accuracy loss over the incremental sessions is similar between the ANN and SNN prototypical baselines. However, the lower overall accuracy of the SNN is largely due to the conversion from the original backpropagation-trained readout classifier, which is used in the frozen baseline, to the prototype readout classifier. On the base classes (session 0 in Figure 3), the ANN sees a drop of 2.37% between the frozen and prototypical baselines, while the SNN has a larger drop of 9.17%. The larger drop indicates that our particular SNN baseline has a less general feature extraction than the ANN. This may be due to the challenges of backpropagation through time for online temporal inference to learn to extract long-term temporal keyword features with the chosen spiking recurrent model. Additionally, the Speech2Spikes [42] pre-processing algorithm converting audio to spikes may also cause information loss. Overall, the keyword FSCIL benchmark presents opportunities for further research in learning methods, preprocessing, and model architectures for continual learning of temporal data.

Event Camera Object Detection

  Baseline mAP Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs RED ANN 0.429 9.13×1079.13superscript1079.13\times 10^{7} 20 0.0 0.634 2.84×10112.84superscript10112.84\times 10^{11} 2.48×10112.48superscript10112.48\times 10^{11} 0 Hybrid 0.271 1.21×1071.21superscript1071.21\times 10^{7} 20 0.0 0.613 9.85×10109.85superscript10109.85\times 10^{10} 3.76×10103.76superscript10103.76\times 10^{10} 5.60×1085.60superscript1085.60\times 10^{8}  

Table 3: Baseline results for the event camera object detection task.

The event camera object detection task reports a prior baseline, the RED ANN, and a novel conversion of the architecture to a hybrid ANN-SNN model:

  • RED ANN – The RED architecture [22] consists of blocks of feed-forward squeeze-and-excite [44] convolutional layers followed by blocks of recurrent convolution-LSTM (ConvLSTM [45]) layers. A single-shot detection (SSD [46]) head is used to predict the location and class of the bounding box based on multi-scale outputs from the recurrent layers. Raw event data is binned into 50 ms and pre-processed into time surfaces.

  • Hybrid – The hybrid ANN-SNN architecture adopts feedforward LIF spiking neural layers to replace the ConvLSTM layers in RED, and shares the same feed-forward convolutional blocks as the RED. It uses the same input encoding method and SSD head as the RED model.

Results for the two networks can be found in Table 3. The RED ANN represents the current state-of-the-art correctness on the benchmark, at 0.429 mAP. The Hybrid network is a smaller network, reflected by the footprint and synaptic operations metrics measuring an order of magnitude smaller than for the RED ANN. The smaller size comes at the expense of lower correctness of 0.271 mAP.

For the RED ANN, the activation sparsity metric (0.634) represents zero activations by the ReLU function for each neuron. From this, one may expect that the number of effective operations (operations with a nonzero activation and nonzero weight) would be around 35% of dense operations, however the actual ratio is 87%. This is due to the presence of normalization layers applied to activations before synaptic weight multiplication. Furthermore, neurons with lower activation frequency in the network tend to have a smaller fanout than neurons with high activation frequency. Thus, while activation sparsity alone can provide a proxy for the cost of the network, architectural characteristics may impede actual computation reduction, and the synaptic operations must be considered in tandem.

The Hybrid network demonstrates a significant reduction in total effective operations against dense operations, outlining significant gains if deployed on specialized sparsity-aware hardware. However, for the particular network, the number of effective ACs, generated by the spiking neuron components, is two orders of magnitude smaller than the number of effective MACs within the ANN components. Such a hybrid network may not warrant specialized accumulation units, and the baseline motivates further research in hybrid networks with a larger proportion of spiking neuron activity compared to artificial neuron activity.

NHP Motor Prediction

  Baseline R2superscript𝑅2R^{2} Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs ANN 0.593 20824 250 0.0 0.683 4704 3836 0 0.558 33496 250 0.0 0.668 7776 6103 0 SNN 0.593 19648 250 0.0 0.997 4900 0 276 0.568 38848 250 0.0 0.999 9700 0 551  

Table 4: Baseline results for the NHP motor prediction task, for NHP Indy (96-channel data, top), and NHP Loco (192-channel data, bottom).

Small fully-connected, feedforward networks were developed for the NHP motor prediction baselines:

  • ANN – In the ANN baseline, the cortical activity from the 50 most recent data samples is buffered to be used as network input. The network has two hidden layers and 222 final outputs predicting X𝑋X and Y𝑌Y velocities, with a fully-connected topology of Nchsubscript𝑁𝑐N_{ch}-32-48-2, where Nchsubscript𝑁𝑐N_{ch} refers to the channels of cortical data (96 for NHP Indy, and 192 for NHP Loco). Batch normalization is applied after each hidden layer.

  • SNN – The SNN uses the data samples directly as input to the network, without buffering. It has a hidden layer of 50 LIF neurons, for a fully connected topology of Nchsubscript𝑁𝑐N_{ch}-50-2 LIF neurons. The output neurons do not have a reset mechanism, and the membrane potential is directly read to produce the output velocities.

Table 4 shows the results for the ANN and SNN baselines, averaged between sessions from each NHP (Indy and Loco). The ANN and SNN are similar in footprint size and number of dense operations per model forward pass, and also reach comparable prediction quality based on R2superscript𝑅2R^{2} score. Each model is small in footprint and operation count, demonstrating that this task can be solved by shallow edge networks, validating prior studies [47].

Between the baselines, the SNN realizes similar correctness at significantly reduced complexity compared to the ANN. Extremely high activation sparsity in the SNN (0.998) directly translates to low effective accumulate operations, demonstrating the adequacy of stateful, binary-activation neuron models for sparse regression tasks. Meanwhile, similarly to the RED ANN in the event camera object detection task, activation sparsity in the ANN baseline does not translate to effective operation efficiency, as batch normalization is applied to activations before multiplication with synaptic weights.

Refer to caption
Refer to caption
Figure 4: Footprint and effective synaptic operations vs R2superscript𝑅2R^{2}, for four task baselines. Each model has two points: the solid marker represents NHP Indy, and the hollow marker represents NHP Loco.

We conduct further exploration for increasing task accuracies with more complex ANN and SNN models: ANN_Flat and SNN_Flat. For these networks, 50 data samples of buffered input are split into np=7subscript𝑛𝑝7n_{p}=7 accumulated bins. For ANN_Flat, the 7 bins are spatially flattened as input to the network, so its topology is (7×Nch7subscript𝑁𝑐7\times N_{ch})-32-48-2. SNN_Flat uses the Nchsubscript𝑁𝑐N_{ch}-32-48-2 topology, and the 7 bins are temporally flattened as input, presented to the network as separate input timesteps. Each prediction still uses the membrane potential of the output neurons after input timesteps, and the network is reset for each prediction. Layer normalization is also applied on the SNN_Flat inputs.

Figure 4 shows plots of complexity and predictive quality of all four baseline networks. Both flattened networks demonstrate significantly greater R2superscript𝑅2R^{2} performance than the other two networks. However, the larger input dimension of the ANN_Flat network is reflected in its greater footprint, and the increased model timesteps and layer normalization sharply increase the effective operations of SNN_Flat by two orders of magnitude compared to the simpler SNN. Thus, while input flattening and normalization increase the quality of model predictions for ANNs and SNNs, each comes with a significant complexity trade-off.

Chaotic Function Prediction

  Baseline sMAPE Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs ESN 14.79 2.81×1052.81superscript1052.81\times 10^{5} - 0.876 0.0 3.52×1043.52superscript1043.52\times 10^{4} 4.37×1034.37superscript1034.37\times 10^{3} 0 LSTM 13.37 4.90×1054.90superscript1054.90\times 10^{5} - 0.0 0.530 6.03×1046.03superscript1046.03\times 10^{4} 6.03×1046.03superscript1046.03\times 10^{4} 0  

Table 5: Baseline results for the chaotic function prediction task. Execution rate is not reported as the data is a synthetic time series, with no real-time correlation.

The chaotic function prediction task has two recurrent ANN baselines, which feature distinct network architectures:

  • Long short-term memory (LSTM) – LSTMs are a class of recurrent ANN architectures [48], utilizing multiple gates for selective retention or omission of past information. The LSTM baseline consists of a single LSTM with a hidden state of 100 neurons, followed by a feed-forward layer to produce single-dimension output predictions. In addition, the LSTM baseline utilizes explicit memory by buffering 50 previous datapoints, spatially flattening them into 50 input channels.

  • Echo state network (ESN) – ESNs are randomized recurrent ANNs that belong to a class of algorithms known collectively as reservoir computing [49], featuring more biologically-inspired principles than LSTMs despite not being spiking networks. Standard ESNs have only one hidden layer (the reservoir), where synaptic connections projecting input data to the hidden layer and recurrent synaptic connections within the hidden layer are chosen randomly and stay fixed during the training. The model architecture for the ESN baseline has two neurons in the input layer, which projects the Mackey-Glass function input and additional constant bias input into a hidden layer of 186 neurons. Within the hidden layer, the probability of recurrent connections is set to 0.11.

The LSTM and ESN models were evaluated on a Mackey-Glass time series with τ=17𝜏17\tau=17. The model is evaluated over 30 instantiations of the system; in each instance the start point is shifted forward by half of the Lyapunov time. The model is re-initialized and re-trained on each instance, and the results are averaged over all 30 instances.

The ESN model is architecturally unique compared to the other ANN and SNN baselines. The connection sparsity metric (0.8760.8760.876) reflects the high number of zero-weight connections across its reservoir hidden layer. Due to this sparsity, hardware with support for sparse synaptic representation by ignoring zero weights would require less memory to represent the network, thus decreasing the deployed footprint of the model. The high connection sparsity of the ESN leads to significant reduction in synaptic operations - the ESN uses an order of magnitude fewer effective operations (4.37×103)4.37\times 10^{3}) than the LSTM (6.03×1046.03superscript1046.03\times 10^{4}), while achieving comparable sMAPE. The activation sparsity of the ESN is 0 due to neurons using tanh()\tanh(\cdot), rather than ReLU activations.

Refer to caption
Figure 5: ESN and LSTM models evaluated on varying Mackey-Glass time series using a constant set of hyperparameters.

Furthermore, we show the generalization and robustness capabilities of the particular ESN and LSTM models by applying them, with fixed hyperparameter sets, to other Mackey-Glass time series. Figure 5 shows the sMAPE score of the models over varied time series with the τ𝜏\tau Mackey-Glass parameter varying between 17 and 30. The models were trained independently for each time series. As the Mackey-Glass τ𝜏\tau parameter characterizes the time-delay of the system, its increase roughly corresponds to prediction difficulty, shown by the increasing sMAPE trend through the plot. Notably, the LSTM maintains an error that is relatively lower than that of the ESN for all τ>18𝜏18\tau>18. However, the LSTM uses explicit memory via input buffering, so it is conjectured that the historical data allows for greater robustness to the varying time series characteristics. The ESN uses only one previous timestep, so its memory is only implicitly retained within its hidden layer. While the ESN tunes well to the τ𝜏\tau=17 case and demonstrates greatly reduced effective operations compared to the LSTM, the same set of hyperparameters does not generalize as well to other time series. Further research is motivated in explicit memory buffers versus implicit memory within the network state for trade-offs in single-series forecasting performance, complexity, and generalization capability.

Discussion and Opportunities for Further Research

Baseline results for the four v1.0 algorithm track tasks compare the correctness and complexities of various solution types. Compared to ANNs, SNNs and ESNs demonstrate complexity advantages such as smaller footprints, high sparsity, and accumulate rather than multiply-and-accumulate operations. Especially on the motor prediction and chaotic function prediction regression tasks, the SNN and ESN baselines already achieve competitive correctness at lower complexity than the ANN and LSTM counterparts. Further research opportunities in model architectures, data pre-processing and buffering, and training paradigms to achieve greater performance is enabled by the standard framework and tooling provided by NeuroBench.

System Track Benchmark Framework

While the algorithm track aims to benchmark solutions in a system-independent manner via complexity analysis, the NeuroBench system track aims to evaluate deployed latency, throughput, and efficiency of systems comprised of an algorithm deployed to a hardware platform. In order for the hallmarks of neuromorphic hardware to be aptly judged against conventional systems and foster the expansion of neuromorphic solutions, fair comparisons must be made between sufficiently mature neuromorphic systems and conventional systems solving the same tasks.

Refer to caption
Figure 6: Types of neuromorphic systems at various integration scales.

A key challenge for benchmarking neuromorphic hardware is that systems are implemented and deployed at vastly different scales to serve diverse applications, from cloud services (e.g. multi-chip platforms like Loihi [50] and SpiNNaker [51]) to embedded sensing intelligence (e.g. Speck [52] and SNP [53, 54]). This range is visualized in Figure 6. Existing benchmarks for conventional machine learning systems separate submissions between datacenter-level computing [19] and embedded processing [55]. Thus, rather than pursuing a one-size-fits-all suite of tasks for neuromorphic systems, the goal of the NeuroBench system track is to develop benchmarks at various scales and use cases, united under a common set of guidelines and measurement methodology, thereby embracing the diversity of neuromorphic approaches.

In this section, we present the system track guidelines outlining the metrics, tasks, and harness components, representing collective design between multiple owners and vendors of neuromorphic hardware. As the system track benchmark tooling is currently under development, insight from the algorithm track’s early results will be exploited towards a future release of the detailed harness documentation, benchmark procedure, and baselines for the system track by the end of 2024.

System Track Metrics

In order to be representative of the properties of a deployed system, the system benchmarks, like the algorithm benchmarks, are assessed at the task level for the overall system, as opposed to operation or kernel level assessment of individual components. Task-level benchmarks enable straightforward comparison between systems of any type with regard to their abilities to solve problems, and the overall system-level measurement describes the realistic capability and efficiency of a whole solution. For ease of comparison and benchmark result analysis, key features of the NeuroBench system track are consistent with features from the widely-adopted machine learning system benchmark MLPerf Inference [19].

Task Scenarios.

Where applicable, the NeuroBench system track will utilize benchmark task scenarios from MLPerf. MLPerf describes task scenarios under which system benchmarks are presented to the system under test (SUT). For large-scale batched processing, MLPerf defines the Offline and Server scenarios, which are latency-unconstrained and latency-constrained, respectively, with performance measured in terms of prediction throughput. For batch-size-1 processing, MLPerf defines Single-stream scenarios, for which performance is measured in terms of prediction latency.

However, the MLPerf task scenarios all consider data as discrete samples, and therefore do not fit with some key neuromorphic system applications. The NeuroBench system track thus defines two additional task scenarios: Real-time and Optimization. Real-time uses continuous data streams (e.g., from an event camera), which must be processed online by the SUT, in contrast with the Single-stream scenario, where samples are sent only once the SUT has completed processing. Optimization applications use heuristic methods for otherwise intractable problems, and thus do not have notions of sample throughput or sample latency. The Optimization scenario benchmarks will report latency for the SUT to reach multiple correctness thresholds, as many approaches find sucessively improved solutions over time.

For each task and scenario, the following metric categories are reported:

  • Correctness – Due to the tight coupling between an algorithm and its system implementation in many existing neuromorphic hardware solutions, the particular model used to solve the benchmark task is unconstrained. Therefore, correctness must be measured to verify the validity of the solution. No correctness thresholds are imposed on submissions, but the benchmark leaderboard will impose tiers of solution correctness on submissions to evaluate accuracy-efficiency trade-offs of system approaches.

  • Performance – Depending on the task scenario, the performance of the system is measured differently. The Offline and Server scenarios report throughput, while the Single-stream scenario reports latency. The Real-time scenario similarly reports latency, which for a large portion of execution (e.g., 90%, defined per-benchmark) must not exceed a real-time threshold, or else the submission is not considered valid. The Optimization scenario reports the time to solution at various solution quality thresholds, defined per-benchmark.

  • Efficiency – Conventional system benchmarks such as TOP500 [56] for HPC and MLPerf Inference [19] for deep learning do not require power measurement submission in the main benchmark, instead allowing for separate submissions to an adjacent power track (Green500 [57] and MLPerf Power [58], respectively). Not only has efficiency been usually considered as a second-order metric for conventional systems, it is also notoriously difficult to precisely measure. However, as energy efficiency is a key hallmark of biology and thus is a focus of neuromorphic research, power and energy consumption must be first-order metrics in the NeuroBench system track.

    The Server, Offline, and Real-time scenarios report average power as their batched and online processing is continuous, the Single-stream scenario reports the energy per-sample, and the Optimization scenario reports total energy consumed at each solution quality threshold.

Benchmark submissions may perform separate runs to report performance and power in order to demonstrate system flexibility (e.g., a ‘performance-mode’ run optimal for latency and an ‘efficiency-mode’ run optimal for energy), however in all runs, both metrics must be reported.

Importantly for the NeuroBench system track, in measuring latency and efficiency, data pre- and post-processing must be taken into account. Neuromorphic methods will often consume and produce non-standard (e.g., event-based) data modalities, the processing of which may consume a significant amount of the overall execution latency and may not be computed on the neuromorphic hardware itself. As many instances of neuromorphic hardware cannot be deployed without such associated processing, it is essential that latency timing and efficiency measurement captures the cost of data processing, which stands in contrast with conventional system benchmarks that measure starting from pre-processed data [59].

System Track Benchmarks

Benchmarks for the system track will include tasks of interest to neuromorphic systems, from embedded to datacenter scales. The tasks are key application areas for existing systems, and they differ from the tasks in the algorithm track, which are more research-oriented. Towards future iterations of the NeuroBench suite, the algorithm and system tracks are intended to coalesce as both algorithms and systems mature. As the benchmark results identify properties of highly effective algorithms and systems, the algorithm and system tracks will converge to the same selection of tasks that are seen as the most impactful for future progress in the field. Two benchmark specifications for the system track are defined in this article.

  • Acoustic Scene Classification – The acoustic scene classification benchmark challenges systems to classify audio into predefined categories based on the environmental audio context. Such capabilities are key for hearable devices, which can utilize them to automatically adjust sound equalisation profiles, appropriately target microphone denoising, and support active noise cancellation. The application further challenges systems to fulfill technical requirements, such as always-on and real-time operation, and time series processing. Acoustic scenes provide a rich repertoire of features that are necessary for prediction, thus this tasks is a complement to keyword classification, which mainly focuses on shorter-term features (e.g., phonemes) with a relatively smaller feature repertoire.

    The benchmark evaluates the classification capabilities of both neuromorphic systems and conventional computing platforms using datasets such as those from the DCASE challenge [60] (if permissible, subject to license). These datasets consist of a myriad of audio recordings from diverse environments, including airports, public parks, and buses, thus providing a comprehensive foundation for testing both application- and system-level performance.

    The task will be presented under the Real-time or Single-stream task scenarios, providing a continuous audio stream or sliced samples to the SUT, during which the acoustic scene periodically changes. Classification probability will be sampled to determine the correctness of the prediction. The average power measured during inference is a key indicator of the efficiency and always-on capability of the SUT, and inference latency will be measured in terms of the prediction time relative to the onset of each new acoustic scene.

  • QUBO – As an Optimization scenario task, NeuroBench incorporates quadratic unconstrained binary optimization (QUBO). QUBO is a particularly beneficial first optimization task for NeuroBench for two reasons. First, the binary variables are inclusive to neuromorphic systems with purely binary spike communication. Second, real-world QUBO applications typically feature sparse cost matrices [61] which benefit from the sparse synaptic connectivity / matrix multiplication that neuromorphic systems are often optimized for [62]. The initial set of QUBO workloads in NeuroBench searches for the maximum independent (i.e. unconnected) set of nodes in graphs, a task that has wide applications across industry and academia, such as resource allocation in wireless networks, portfolio optimization, and task scheduling.

    NeuroBench will provide a QUBO generator that can uniquely specify each workload by three specific parameters provided by the benchmark: the number of graph nodes, the sparsity of graph connections, and a random seed. The generator provides a large dataset for reliable statistics and allows scaling from modest workloads for small-scale and prototype systems to large workloads for larger-scale systems. An independent dataset is provided to tune the hyperparameters of the solver. The SUT is evaluated based on three metrics: The first success metric is the maximum supported workload size. The second and third metric is the time and energy required to obtain pre-defined levels of optimality, respectively. The solution optimality is measured by the size of the independent set of nodes found by the SUT.

System Track Harness

Refer to caption
Figure 7: An overview of the NeuroBench harness supporting the system track.

A diagram of the NeuroBench harness with extended support for the system track is shown in Figure 7. Blue boxes in the figure denote hardware-specific infrastructure components which will connect to the general top-level harness interfaces. Common interfaces within the runtime enforce that the harness can be modularly extended to various system backends.

Within the system runtime, intermediate representations (IR) are used to compile and map models. In ML development, the ONNX IR [63] has been an enabler of cross-platform portability, thus IR infrastructure within the harness is an important component of common tooling. A common high-level IR such as the Neuromorphic Intermediate Representation (NIR) [64] or Lava [36] can unify spatio-temporal graphs describing the neuromorphic models, and may be transformed into a more optimized low-level IR that is specific to the target hardware (e.g., sPyNNaker machine graph [65]). The inclusion of shared IRs within the harness allows for benchmarking equivalent models across multiple systems, which will highlight system optimizations for cross-compatibility. The IRs may also be hardware-specific, tightly integrated within the system to maximize performance.

From a top-level API, users will be able to run algorithm benchmarks or system benchmarks through the corresponding runtimes. By providing the runtimes side-by-side, the harness presents a single tool for both algorithm and system benchmarking. As the algorithm and system tracks mature to support the same tasks, the harness will accelerate benchmarking by facilitating quick prototype complexity analysis followed by deployed performance measurement with minimal implementation overhead.

Discussion

Benchmarking neuromorphic computing has faced challenges stemming from the diversity of neuromorphic approaches, the range of implementation and deployment tools, and rapid research evolution. NeuroBench addresses these challenges as a framework for the inclusive, actionable, and iterative benchmarking of neuromorphic solutions, by including novel tasks and metrics, open-source and extendable harness tools, and facilitating systematic growth via community collaboration. NeuroBench is supported and developed by a broad community of neuromorphic researchers to be a standard, agreed-upon benchmarking framework for neuromorphic technology.

Initial NeuroBench benchmarks span applications across domains of continual learning, computer vision, sensorimotor prediction, and time-series forecasting, as well as system implementations for audio and optimization settings. Baselines for each benchmark of the complete v1.0 algorithm track demonstrate the utility and validity of the metric framework, and offer a starting point to further algorithmic research in model architecture and training for greater performance and lowered complexity.

The initial NeuroBench algorithm and system tracks achieve the first steps of designing benchmarks for algorithms executed in a digital time-stepped fashion and systems in mature deployment stages. In order to expand inclusive and fair benchmarking to further approaches, the next milestones of the NeuroBench project will be to extend metrics and standard protocol designs to cover continuous-time execution and a wider range of hardware platforms including FPGAs, custom integrated circuits, as well as more exploratory platforms in simulation stage, such as memristive hardware.

Another important direction for NeuroBench is towards closed-loop benchmarks [15, 66]. Biological systems excel in interacting with dynamic environments, demonstrating high energy efficiency, real-time reaction, and versatility. As such, embodied intelligence with adaptive sensory and action capabilities are of interest to neuromorphic research. In closed-loop scenarios, the objective is to sense and act within an environment to complete a task, rather than to statically process a frozen dataset, thus the benchmark harness infrastructure and measurement protocols will be extended to facilitate such benchmarks.

All future NeuroBench expansion will be informed by collected results and continue to be driven by the interests and development of the broader community.

Methods

This section outlines details and specifications of the benchmark metrics, tasks, and baselines.

Specifications of the Algorithm Track Metrics

NeuroBench includes correctness and complexity metrics, the latter of which is divided in static and workload metrics. Static metrics do not depend on the model inference and input data, while the workload metrics do. Note that the defined metrics reflect only the model and model execution. Data pre-processors and post-processors are not taken into account in the v1.0 algorithm track results.

Footprint

The footprint metric reflects the memory footprint a model. It is distinct from execution memory, which may incur further usage, e.g. to store activations. It is computed for a model by accumulating the sizes of the model’s parameters and buffers, in bytes. Parameters store the model synaptic weights, and buffers include other inference memory requirements, such as the internal states of recurrent or spiking layers and buffers of recent input data, if the model must record data for input binning. Considering n𝑛n parameters, each requiring pisubscript𝑝𝑖p_{i} bytes, and b𝑏b buffers of size qjsubscript𝑞𝑗q_{j}, the total model footprint is i=0npi+j=0bqjsuperscriptsubscript𝑖0𝑛subscript𝑝𝑖superscriptsubscript𝑗0𝑏subscript𝑞𝑗\sum_{i=0}^{n}p_{i}+\sum_{j=0}^{b}q_{j}.

Model Execution Rate

Execution rate is a metric which is not directly computed by the harness, but should be reported by the user. The metric reflects the real-time correlation of the rate at which the model computes input data. If the model processes input with a temporal stride of t𝑡t seconds, then the rate should be reported as t1superscript𝑡1t^{-1} Hz. Note the distinction between stride and bin window - input can be binned in overlapping windows, but execution rate depends on the temporal stride of window processing. As an example, a model may use 50 ms windows of input and compute every 10 ms, which would give an execution rate of 100 Hz.

This metric is currently not well-defined for models operating under event-based or continuous-time contexts. These limitations will be addressed in future benchmark versions.

Connection sparsity

The parameter matrices of each layer l𝑙l in a model, representing synaptic weights, are collected, and the number of zero weights mlsubscript𝑚𝑙m_{l} and total weights nlsubscript𝑛𝑙n_{l} are aggregated, with the connection sparsity defined as lmllnlsubscript𝑙subscript𝑚𝑙subscript𝑙subscript𝑛𝑙\frac{\sum_{l}m_{l}}{\sum_{l}n_{l}}.

Activation sparsity

Activation sparsity is computed after the inference phase. The sparsity is calculated by accumulating the number of zero activations (z𝑧z), over all neuron layers (l𝑙l), timesteps (t𝑡t), and input samples (i𝑖i) and dividing by the total number of neurons (N𝑁N), ltizl,titliNl,tisubscript𝑙subscript𝑡subscript𝑖superscriptsubscript𝑧𝑙𝑡𝑖subscript𝑡subscript𝑙subscript𝑖superscriptsubscript𝑁𝑙𝑡𝑖\frac{\sum_{l}\sum_{t}\sum_{i}z_{l,t}^{i}}{\sum_{t}\sum_{l}\sum_{i}N_{l,t}^{i}}. The outputs of ReLU functions and spikes from spiking neurons are considered activations.

Synaptic operations

Synaptic operations are the multiplication of weights by activation or input data, and are calculated using the inputs and weights of connection layers (e.g., torch.nn.Linear and torch.nn.Conv2d). Effective synaptic operations are operations where a non-zero weight is multiplied by a non-zero activation. Effective operations are further divided into multiply-accumulates (MACs), and accumulates (ACs), where accumulates correlate with activations or input data only containing values of [-1, 0, 1], and multiply-accumulates cover all other cases. The reported number of synaptic operations is the average number of synaptic operations required per model execution, the rate of which is defined by the model execution rate metric.

The number of effective synaptic operations is computed by performing the forward pass of a layer and counting the number of operations in which there is no zero multiplication. Practically, this is implemented in the harness by setting all non-zero weights in the layer and all the non-zero activations to 1, then performing the forward pass and summing the output to give the number of synaptic operations.

The number of dense synaptic operations is computed in a similar fashion, by setting all weights and activations to 1 and accumulating the output of the forward pass. Biases are not taken into account in the calculation of the synaptic operations, as they are added after weight multiplications and accumulation.

Note that processing of activations before the connection layer, for instance using batch normalization, can transform sparse activations into dense input at the connection layer, which will lead to high effective synaptic operations despite high activation sparsity. Furthermore, such processing can transform binary activations to non-binary data, causing effective operations to be MACs rather than ACs. When deployed to neuromorphic hardware, such algorithms that normalize activations before multiplication with synaptic weights may lose the benefits of sparse operation, e.g., an SNN with normalization following each spiking layer would require dense MAC weight calculation, no matter how few spikes were generated.

In some cases, algorithm execution may have distinct temporal sections of higher and lower synaptic operations, such as during initial caching versus continuous inference. For such algorithms, benchmark users may choose to distinguish synaptic operations and other complexity measurements between execution sections.

Algorithm Track Benchmark Tasks

Keyword FSCIL

Few-shot Class-Incremental Learning, FSCIL, is an established benchmark task setting in the computer vision domain [26]. It can be defined as follows: a base session with fixed classes, each with abundant training data, is used to train an initial model. Then, successive incremental training sessions introduce new classes in a few-shot learning scenario. In each session, only the current session classes are available to the model for training. After each incremental training session, the model is evaluated on all previously seen classes, including the base classes. Therefore, the model has to learn new classes while retaining knowledge about the previously learned ones.

Formally, for M𝑀M-step FSCIL, where M𝑀M is the total number of incremental sessions, each training session uses a support dataset D(t)superscript𝐷𝑡D^{(t)}, t[0,M]𝑡0𝑀t\in[0,M] to train new classes on. L(t)superscript𝐿𝑡{L^{(t)}} is the set of classes of the t𝑡t-th session where i,jfor-all𝑖𝑗\forall i,j where ij𝑖𝑗i\neq j, L(i)L(j)=superscript𝐿𝑖superscript𝐿𝑗L^{(i)}\cap L^{(j)}=\varnothing, meaning each training session uses a unique set of classes. D(0)superscript𝐷0D^{(0)} and L(0)superscript𝐿0L^{(0)} are the base class training data and set of base classes, respectively, D(1)superscript𝐷1D^{(1)} and L(1)superscript𝐿1L^{(1)} represent the first incremental session set, and so on. At session t𝑡t, only D(t)superscript𝐷𝑡D^{(t)} is available for training, and for t>0𝑡0t>0, D(t)superscript𝐷𝑡D^{(t)} contains a fixed number of classes (N𝑁N) with few samples per class (K𝐾K). This form of FSCIL is therefore named N𝑁N-way K𝐾K-shot FSCIL. At the end of each session t𝑡t, model accuracy is reported on the test samples of all previously seen classes {L(0)L(1)L(t)}superscript𝐿0superscript𝐿1superscript𝐿𝑡\{L^{(0)}\cup L^{(1)}\cup...\cup L^{(t)}\}.

For the Keyword FSCIL task, classes in the base set (L(0)superscript𝐿0L^{(0)}) have 700 samples each, with a fixed train/validation/test sample split of 500/100/100. All classes within incremental sessions have 200 samples per word, with a fixed train/test split of 100/100. Of the 100 training samples, 5 are randomly selected for few-shot learning (each session is 10-way, 5-shot). The inclusion of 200 samples allows for increasing learning up to 100 samples.

NeuroBench proposes an audio keyword classification version of the FSCIL task, which to the best of our knowledge is the first of its kind. This novel task is established by selecting a subset of the words and languages from the Multilingual Spoken Word Corpus (MSWC) [21] dataset. The FSCIL task consists of a multilingual set of 100 base classes and 10 incremental sessions of 10 classes each, for a final total of 200 learned classes. Fifteen languages are represented: the base classes are composed of a set of five base languages with 20 words each, and each of the ten incremental sessions contains 10 words from a distinct language. The languages were chosen based on data availability within the MSWC dataset. The top five languages with the greatest number of potential words (words with enough data samples) are used as the base class languages, while the next ten languages with the greatest numbers are the incremental classes. The base languages are English, German, Catalan, French and Kinyarwada. Incremental languages are Persian, Spanish, Russian, Welsh, Italian, Basque, Polish, Esparanto, Portuguese and Dutch. The order of languages presented in the incremental sessions are randomized, but each incremental session will represent exactly one new language.

For each language, the longest length words (that had the appropriate number of samples) were selected to allow for rich and robust temporal features to be learned. Next to the richness of longer words, there are practical considerations for this choice. The MSWC dataset normalizes all samples to a duration of 1 second, centered around the 0.5 seconds mark. For shorter words, this means that the data needs to be zero-padded on both sides to fill the entire duration. Longest-length words are likely to fill the complete sample and reduce zero-padding, which is also useful in scenarios in which algorithms seek to classify words before the sample has completed [67]. Furthermore, common keyword spotting solutions, such as Ok Google, Alexa, and Hey Siri, use multi-syllable wake-phrases to assist in accurate word classification.

Within each language, words showing great similarity in phonics and meaning are not included (e.g. l’amendement and amendements in French). Across different, but related languages, words with similar pronunciation and meaning were not included as well (e.g., university, universität and universitat in English, German and Catalan).

The subset of MSWC used for this FSCIL task is significantly smaller in size (630MB) compared to the full MSWC datatset (124GB), and subset download details can be found in the harness.

Event Camera Object Detection

The task of object detection using event camera data involves identifying bounding boxes of objects belonging to multiple predetermined classes in an event stream. The dataset is the Prophesee 1 Megapixel automotive detection dataset [22], which is one of the largest and highest-resolution event-camera detection datasets currently available. The performance of the task is defined by the COCO mean average precision (mAP) metric [28], a metric that is commonly used for the evaluation of object detection algorithms. Only three out of the seven available object classes within the dataset are used due to limited sample availability in the dataset, which matches prior work [22].

COCO mAP is calculated using the intersection over union (IoU, Equation 1) of the bounding boxes produced by the model against ground-truth boxes. Here, A𝐴A and B𝐵B refer to bounding boxes, and the intersection and union consider the overlapping area and the area covered by both boxes, respectively. The IoU is compared against 10 thresholds between 0.50 and 0.95, with a step size of 0.05. For each threshold, precision is calculated (Equation 2) with True Positives (TP) and False Positives (FP) determined by whether the IoU meets the threshold or not, respectively. The mAP is calculated as the averaged precision over all thresholds for each class, which is further averaged over all classes to produce the final result.

IoU(A,B)=|AB||AB|𝐼𝑜𝑈𝐴𝐵𝐴𝐵𝐴𝐵IoU(A,B)=\frac{|A\cap B|}{|A\cup B|} (1)
Precision(TP,FP)=TPTP+FP𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃𝐹𝑃𝑇𝑃𝑇𝑃𝐹𝑃Precision(TP,FP)=\frac{\sum TP}{\sum TP+\sum FP} (2)

Note that in the dataset, labels are generated from images from an RGB camera. Due to the nature of event cameras, objects which are still at the start of a recording sequence have no generated events and cannot be detected. Therefore, labels within the first 0.5 seconds of each sequence are not taken into account. Furthermore, as the RGB camera used for labeling has a higher resolution than the event camera, not all objects which appear in the RGB image are recognizable from the generated events. Thus, objects with a diagonal of less than 60 pixels are also not considered. The dataset and metric measurement is implemented using the Prophesee Metavision software [68].

Non-human Primate Motor Prediction

The non-human primate motor prediction task involves predictive modeling of two-dimensional fingertip velocity, given neural motor cortex data. The six sessions used for the benchmark comprise three recording sessions each from two non-human primates (NHP Indy and NHP Loco) such that the chosen sessions approximately span the entire duration of the experiment [23] (several months). The specific sessions used are indy_20170131_02, indy_20160630_01, indy_20160622_01, loco_20170301_05, loco_20170215_02, and loco_20170210_03. Each of these sessions consists of one day of experiments, during which multiple reaches are recorded. During each reach, a target position is displayed, which the NHP needs to localize and touch with its finger. Once the NHP touches the correct target for the current reach, the next reach is instantiated, showing a new target position. The data contains sensorimotor cortex recordings from 96 channels for the recordings of the first NHP (Indy), while 192 channels were used for the second NHP (Loco), and was gathered and labeled at a frequency of 250Hz. Two-dimensional position data of the NHP fingertip during its reaches is provided in the dataset, and these are translated into X𝑋X and Y𝑌Y velocity ground-truth labels using discrete derivatives [69].

Each session is segmented into individual reaches based on the target position for the NHP to touch. The data in each session is split such that the initial 75%percent7575\% of reaches are used for training and validation, and the remaining 25%percent2525\% of reaches are test data. The user can choose how to utilize the training and validation split for their particular method.

During evaluation, the coefficient of determination (R2superscript𝑅2R^{2}, Equation 3) for the X𝑋X and Y𝑌Y velocities are averaged to report the correctness score for each session, where n𝑛n is the number of labeled points in the test split of the session, yisubscript𝑦𝑖y_{i} is the ground-truth velocity, yi^^subscript𝑦𝑖\hat{y_{i}} is the predicted velocity, and y¯¯𝑦\bar{y} is the mean of the ground-truth velocities. The R2superscript𝑅2R^{2} from sessions for each NHP are averaged, producing two final correctness scores.

R2=1i=1n(yiyi^)2i=1n(yiy¯)2superscript𝑅21superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖^subscript𝑦𝑖2superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖¯𝑦2R^{2}=1-\frac{\sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}} (3)

Chaotic Function Prediction

The chaotic function prediction is another sequence-to-sequence problem. Given an input sequence generated from a one-dimensional Mackey-Glass function, the task is to predict the future values of the same function. The dataset used for this task is synthetically generated, following the Mackey-Glass differential equation [24] (Equation 4), which is integrated and discretized with a timestep of ΔtΔ𝑡\Delta t. The time series generated by this differential equation is a function of the Mackey-Glass parameters n𝑛n, β𝛽\beta, γ𝛾\gamma and τ𝜏\tau. Adhering to standard parameters [29], the values used for n𝑛n, β𝛽\beta, and γ𝛾\gamma are 10, 0.2 and 0.1 respectively. τ𝜏\tau is varied between 17 (a standard value) and 30, leading to 14 time series which vary greatly in dynamics and can be used to analyze the generalization of predictive models.

Each value of τ𝜏\tau is associated with a Lyapunov time, the expected predictability timescale for chaos [33], which is used as the time unit for each series. To calculate the overall Lyapunov time for each value of τ𝜏\tau, we average the Lyapunov times of 30,000 generated time series of 2,000 timesteps, with Δt=1.0Δ𝑡1.0\Delta t=1.0, each with a randomly chosen initial condition. All time series and Lyapunov times were generated and estimated using the JiTCDDE library [70]. For each final time series used for benchmarking, initial conditions are a point randomly chosen along the series. The Lyapunov time and initial condition x0subscript𝑥0x_{0} for each of the 14 final time series are provided in Table 6.

dxdt=βx(tτ)1+x(tτ)nγx(t)𝑑𝑥𝑑𝑡𝛽𝑥𝑡𝜏1𝑥superscript𝑡𝜏𝑛𝛾𝑥𝑡\frac{dx}{dt}=\frac{\beta x(t-\tau)}{1+x(t-\tau)^{n}}-\gamma x(t) (4)

As the integration of the differential equation can depend on underlying floating-point arithmetic and thus produce varying time series on different machines, the datasets are precomputed and loaded for training and evaluation. In the benchmark results, 30 instantiations of the Mackey Glass system are used, each with a length of 20 Lyapunov times and successively shifted forwards by half a Lyapunov time. The dataset time series are generated for 50 total Lyapunov times to allow for varied offset starting points. The generated time series are available to be downloaded under the NeuroBench harness.

τ𝜏\tau Lyapunov Time x0subscript𝑥0x_{0}
17 197 0.7206597
18 138 0.7744313
19 315 0.7783468
20 131 0.9225991
21 191 0.9479431
22 119 0.5455960
23 106 0.8622247
24 97 0.3259660
25 98 0.8297825
26 104 1.0033490
27 112 0.6491406
28 119 1.0957495
29 131 0.9256179
30 139 0.2713639
Table 6: Mackey-Glass parameters used for the 14 time series.

Symmetric mean absolute percentage error (sMAPE, Equation 5), a standard metric in forecasting [32], is used to measure the correctness of the model predictions yi^^subscript𝑦𝑖\hat{y_{i}} against the ground-truth yisubscript𝑦𝑖y_{i}, over n𝑛n data points in the test split of the time series. The sMAPE metric has a bounded range of [0,200]0200[0,200], thus diverging predictions (infinity or NaN) due to floating-point arithmetic have bounded error which can be used to average correctness over multiple time series instantiations.

sMAPE=200×1n(i=1n|yiyi^|(|yi|+|yi^|))𝑠𝑀𝐴𝑃𝐸2001𝑛superscriptsubscript𝑖1𝑛subscript𝑦𝑖^subscript𝑦𝑖subscript𝑦𝑖^subscript𝑦𝑖sMAPE=200\times\frac{1}{n}\left(\sum_{i=1}^{n}\frac{|y_{i}-\hat{y_{i}}|}{(|y_{i}|+|\hat{y_{i}}|)}\right) (5)

Algorithm Track Baselines

All baselines are implemented using PyTorch nn.Module objects in order to interface with the harness.

Keyword FSCIL

The ANN baseline employs Mel-frequency cepstral coefficients (MFCC) pre-processing along with a modified version of the M5 deep convolutional network architecture [40].

The MFCC pre-processing converts the 48 kHz, 1 second audio samples from MSWC into 20 channels of 200 timesteps (5 ms stride, 10 ms time bins), focusing on frequencies within the human voice range between 20 Hz and 40 kHz. The network contains four successive blocks, each consisting of 1D convolution, batch-normalization, ReLU activation, and max-pooling layers, followed by a single readout fully-connected layer. Convolutional layers apply their kernels over the temporal dimension of the samples, thus extracting longer temporal features through the depth of the network. We also incorporate dropout after the ReLU activations to avoid over-fitting and let the network be more general for incremental learning. The network is trained with stochastic gradient descent using cross-entropy loss and the Adam optimizer.

For the SNN baseline, we employ the Speech2Spikes [42] (S2S) preprocessing algorithm to convert audio samples to spikes. For S2S we use the default parameters from the original implementation, only the hop length is updated to match the 48 kHz audio frequency of the MSWC samples, whereas the original implementation was applied to 16 kHz audio. S2S applies a Mel Spectrogram and a log operation to raw audio samples, converting them to positive and negative trains of spikes using delta-encoding.

Spike trains from S2S are used as input for the recurrent SNN (RSNN), which consists of 2 recurrent adaptive leaky integrate-and-fire (RadLIF) layers of 1024 neurons and one linear output layer. The model architecture is adapted from Bittar’s work [41]. The RadLIF neurons in these layers are LIF neurons that produce a binary spike 𝐬(t)𝐬𝑡\mathbf{s}(t) and reset via subtraction when their membrane potential 𝐮(t)𝐮𝑡\mathbf{u}(t) crosses a certain threshold value θ𝜃\theta, combined with an extra adaptation variable 𝐰(t)𝐰𝑡\mathbf{w}(t) to enable more complex temporal dynamics and firing patterns. Equation 6 is the input current to neurons, with x𝑥x the input spikes from the previous layer, Wfsubscript𝑊𝑓W_{f} the forward weight matrix, BNTT𝐵𝑁𝑇𝑇BNTT batch-normalization through time, and Wrsubscript𝑊𝑟W_{r} the recurrent weight matrix. 𝐮(t)𝐮𝑡\mathbf{u}(t) and 𝐰(t)𝐰𝑡\mathbf{w}(t) are shown in Equation 7, where α𝛼\alpha, β𝛽\beta, a𝑎a and b𝑏b are heterogeneously trainable parameters of the neuron. Finally, spikes 𝐬(t)𝐬𝑡\mathbf{s}(t) are generated according to Equation 8.

𝐈(t)=BNTT(Wf[x(t)])+Wr[s(t1)]𝐈𝑡𝐵𝑁𝑇𝑇subscript𝑊𝑓delimited-[]𝑥𝑡subscript𝑊𝑟delimited-[]𝑠𝑡1\mathbf{I}(t)=BNTT(W_{f}[x(t)])+W_{r}[s(t-1)] (6)
𝐮(t)𝐮𝑡\displaystyle\mathbf{u}(t) =α[𝐮(t1)]+(1α)[𝐈(t)𝐰(t1)]θ[𝐬(t1)]absent𝛼delimited-[]𝐮𝑡11𝛼delimited-[]𝐈𝑡𝐰𝑡1𝜃delimited-[]𝐬𝑡1\displaystyle=\alpha\left[\mathbf{u}(t-1)\right]+(1-\alpha)\left[\mathbf{I}(t)-\mathbf{w}(t-1)\right]-\theta[\mathbf{s}(t-1)] (7)
𝐰(t)𝐰𝑡\displaystyle\mathbf{w}(t) =β[𝐰(t1)]+a(1β)[𝐮(t1)]+b[𝐬(t1)]absent𝛽delimited-[]𝐰𝑡1𝑎1𝛽delimited-[]𝐮𝑡1𝑏delimited-[]𝐬𝑡1\displaystyle=\beta[\mathbf{w}(t-1)]+a(1-\beta)[\mathbf{u}(t-1)]+b[\mathbf{s}(t-1)]
𝐬(t)={0 if 𝐮(t)<θ1 if 𝐮(t)θ𝐬𝑡cases0 if 𝐮𝑡𝜃otherwise1 if 𝐮𝑡𝜃otherwise\mathbf{s}(t)=\begin{cases}0\text{ if }\mathbf{u}(t)<\theta\\ 1\text{ if }\mathbf{u}(t)\geq\theta\end{cases} (8)

The last layer of the network is a readout linear classifier, and the class corresponding to the maximum of the summation of output activities over all timesteps is chosen as the network prediction. The RSNN network is trained with backpropagation through time using a boxed pseudo-gradient and cross-entropy loss.

Algorithm 1 Few-Shot Class-Incremental Learning with Prototypes

Requires: Pre-trained network gf𝑔𝑓g\circ f consisting of feature extractor f𝑓f and classifier g:xWx+b:𝑔maps-to𝑥𝑊𝑥𝑏g:x\mapsto Wx+b
Define: (x)lsubscript𝑥𝑙(x)_{l}, wlsubscript𝑤𝑙w_{l} and blsubscript𝑏𝑙b_{l} respectively the set of input samples, classifier weights and biases associated with a class l𝑙l

1:for each base class k𝑘k do
2:     Compute prototype embedding ck=Mean[f((x)k)]subscript𝑐𝑘Meandelimited-[]𝑓subscript𝑥𝑘c_{k}=\text{Mean}[f((x)_{k})]                  (also summed over time for SNN baseline)
3:     Compute corresponding classifier weights wk=2cksubscript𝑤𝑘2subscript𝑐𝑘w_{k}=2c_{k} and biases bk=ckckTsubscript𝑏𝑘subscript𝑐𝑘superscriptsubscript𝑐𝑘𝑇b_{k}=-c_{k}c_{k}^{T}
4:end for
5:Replace classifier layer g𝑔g with prototype weights: WWB=(wk)kB𝑊subscript𝑊𝐵subscriptsubscript𝑤𝑘𝑘𝐵W\leftarrow W_{B}=(w_{k})_{k\in B} and biases bbB=(bk)kB𝑏subscript𝑏𝐵subscriptsubscript𝑏𝑘𝑘𝐵b\leftarrow b_{B}=(b_{k})_{k\in B}
6:for each session i𝑖i in sessions do
7:     Get session support Sisuperscript𝑆𝑖S^{i}
8:     Repeat lines 1 to 4 for all new classes of Sisuperscript𝑆𝑖S^{i} to get prototype weights WSisubscript𝑊superscript𝑆𝑖W_{S^{i}} and biases bSisubscript𝑏superscript𝑆𝑖b_{S^{i}}
9:     Extend the classifier layer weights W[W,WSi]𝑊𝑊subscript𝑊superscript𝑆𝑖W\leftarrow[W,W_{S^{i}}] and b[b,bSi]𝑏𝑏subscript𝑏superscript𝑆𝑖b\leftarrow[b,b_{S^{i}}]
10:end for

We implement baseline solutions for the FSCIL task with both ANN and SNN models. The frozen baselines do not learn any new classes while the prototypical baselines follow the prototypical networks approach [43] to classify new classes. For both baselines, the ANN and SNN models are pre-trained on the 100 base classes B𝐵B, which employs the abundant number of samples to develop a robust feature extractor f𝑓f, which generates embeddings from hidden layers that are passed to a readout classifier.

For the frozen baselines, the models parameters are frozen after pre-training for inference during all incremental sessions, thus setting a ‘worst-case’ reference with no incremental learning but also no risk of catastrophic forgetting.

For the prototypical baselines, the pre-trained models learn 100 extra classes within the 10 incremental sessions in a 5-shot learning scenario. The prototypical networks protocol is applied in each incremental session as shown in Algorithm 1. Prototypical networks provide a clustering algorithm for classification that is equivalent to a readout affine operation on feature embeddings, resulting in a linear layer of weights and biases. Each class k𝑘k is represented by a prototype vector ck=Mean[f((x)k)]subscript𝑐𝑘Meandelimited-[]𝑓subscript𝑥𝑘c_{k}=\text{Mean}[f((x)_{k})] defined as the average feature embedding produced by f𝑓f over all corresponding training samples (x)ksubscript𝑥𝑘(x)_{k}. The readout classifier layer is defined based on this prototype such that the weights wksubscript𝑤𝑘w_{k} and biases bksubscript𝑏𝑘b_{k} associated with class k𝑘k follow wk=2cksubscript𝑤𝑘2subscript𝑐𝑘w_{k}=2c_{k} and bk=ckckTsubscript𝑏𝑘subscript𝑐𝑘superscriptsubscript𝑐𝑘𝑇b_{k}=-c_{k}c_{k}^{T}, which associates embeddings with the closest prototype with respect to the squared Euclidean distance [43].

For the SNN baselines, as the features also have a temporal dimensionality, we accumulate embeddings over all timesteps t𝑡t to define the prototype vector ck=Mean[t(f((x)k)t)]subscript𝑐𝑘Meandelimited-[]subscript𝑡𝑓subscriptsubscript𝑥𝑘𝑡c_{k}=\text{Mean}[\sum_{t}(f((x)_{k})_{t})]. Also, as we maintain the summation over timesteps after the final prototype layer to keep the online nature of the SNN baseline, the biases will be applied at each timestep. Thus to maintain the balance between weighted inputs and biases, for the SNN baseline we also normalize the biases by the total number of timesteps T𝑇T: bk=ckckT/Tsubscript𝑏𝑘subscript𝑐𝑘superscriptsubscript𝑐𝑘𝑇𝑇b_{k}=-c_{k}c_{k}^{T}/T.

We fit the prototypical networks approach to the FSCIL task by first discarding the original output layer and replacing it with the prototype weights WBsubscript𝑊𝐵W_{B} and biases bBsubscript𝑏𝐵b_{B} of the base classes, computed as described above based on the averaged feature embeddings over all 500 training samples per base class. This causes an initial accuracy drop, as the trained output layer weights are replaced by clustered weights for the prototypical learning approach. Then, for each incremental session, the prototype of each of the 10 new classes is defined based on the 5 corresponding support samples. The prototype weights and biases are computed in the same manner and concatenated to the existing classifier layer to accommodate for the new classes.

Event Camera Object Detection

For both the RED ANN and Hybrid ANN-SNN baselines, the event data from the event camera are converted into frame-based representations using multi-channel time surfaces. Non-overlapping 50 ms time bins (with 50 ms stride), are further subdivided into three sub-bins. Each sub-bin, starting at timestamp t0subscript𝑡0t_{0}, generates two time surfaces TS𝑇𝑆TS (Equation 9), based on each event (x,y,p,t)𝑥𝑦𝑝𝑡(x,y,p,t) in the sub-bin, where x,y𝑥𝑦x,y are event coordinates, p𝑝p is positive or negative polarity, and t𝑡t is the event time.

TS(p,y,x)=tt0 for each event (x,y,p,t) in the sub-bin.𝑇𝑆𝑝𝑦𝑥𝑡subscript𝑡0 for each event 𝑥𝑦𝑝𝑡 in the sub-bin.\displaystyle TS(p,y,x)=t-t_{0}\text{ for each event }(x,y,p,t)\text{ in the sub-bin.} (9)

The RED ANN [22] is a deep convolutional neural network model using three feed-forward squeeze-and-excite [44] convolution layers followed by five recurrent convolution-LSTM [45] (ConvLSTM) layers. The squeeze-and-excite layers provide effective feature extraction while the ConvLSTM layers provide effective temporal learning. The single-shot detection (SSD [46]) head is used to predict the location and class of the bounding box based on multi-scale outputs from the recurrent layers.

The Hybrid ANN-SNN architecture adopts five LIF spiking neural layers to replace the ConvLSTM layers in RED, and shares the same feed-forward convolutional blocks as the RED. The LIF neuron layers are connected with feed-forward convolution, and have far fewer weights than the ConvLSTM layers. The Hybrid model uses the same input encoding method, object detection head, and training loss functions as the RED model. The LIF units are built using the SpikingJelly library [35], and the neuron dynamics of the LIF membrane potential are given in Equations 1011, and 12. 𝐡(t)𝐡𝑡\mathbf{h}(t) is the charged potential before spiking during a timestep, dependent on activation input X(t)𝑋𝑡X(t), and membrane time contant τ𝜏\tau, and 𝐮(t)𝐮𝑡\mathbf{u}(t) is the final potential of the timestep which resets to the reset value Vresetsubscript𝑉𝑟𝑒𝑠𝑒𝑡V_{reset} if 𝐡(t)𝐡𝑡\mathbf{h}(t) reaches the threshold voltage Vthsubscript𝑉𝑡V_{th}. The same thresholds determine 𝐬(t)𝐬𝑡\mathbf{s}(t), whether a spike is produced. In the experiments, τ𝜏\tau is set to 2.0; Vthsubscript𝑉𝑡V_{th} is 1.0, and Vresetsubscript𝑉𝑟𝑒𝑠𝑒𝑡V_{reset} is 0.0.

𝐡(t)=𝐮(t1)+1τ(X(t)𝐮(t1))𝐡𝑡𝐮𝑡11𝜏𝑋𝑡𝐮𝑡1\mathbf{h}(t)=\mathbf{u}(t-1)+\frac{1}{\tau}(X(t)-\mathbf{u}(t-1)) (10)
𝐮(t)={𝐡(t)if 𝐡(t)<VthVresetif 𝐡(t)Vth𝐮𝑡cases𝐡𝑡if 𝐡𝑡subscript𝑉𝑡subscript𝑉𝑟𝑒𝑠𝑒𝑡if 𝐡𝑡subscript𝑉𝑡\mathbf{u}(t)=\begin{cases}\mathbf{h}(t)&\text{if }\mathbf{h}(t)<V_{th}\\ V_{reset}&\text{if }\mathbf{h}(t)\geq V_{th}\end{cases} (11)
𝐬(t)={0if 𝐡(t)<Vth1if 𝐡(t)Vth𝐬𝑡cases0if 𝐡𝑡subscript𝑉𝑡1if 𝐡𝑡subscript𝑉𝑡\mathbf{s}(t)=\begin{cases}0&\text{if }\mathbf{h}(t)<V_{th}\\ 1&\text{if }\mathbf{h}(t)\geq V_{th}\end{cases} (12)

The losses used to train the RED ANN and Hybrid baselines match previous work [22], using a combination of regression and classification loss functions. Regression loss Lrsubscript𝐿𝑟L_{r} (Equation 13) for all predicted boxes B𝐵B and ground-truth boxes T𝑇T is given by smooth l1𝑙1l1 loss Lssubscript𝐿𝑠L_{s} [46] (Equation 14), averaged over N𝑁N predicted bounding boxes Bisubscript𝐵𝑖B_{i} and their corresponding ground-truth boxes Tisubscript𝑇𝑖T_{i}. Smooth l1𝑙1l1 loss is a piecewise loss function with threshold β𝛽\beta, which is set to 0.11. For the classification loss Lcsubscript𝐿𝑐{L}_{c} (Equation 15), softmax focal loss [71] is used, with correct-class probability plsubscript𝑝𝑙p_{l} for all default boxes in the regression head and constant γ𝛾\gamma, which is set to 2.

Lr(B,T)=1NjLs(Bi,Ti)subscript𝐿𝑟𝐵𝑇1𝑁subscript𝑗subscript𝐿𝑠subscript𝐵𝑖subscript𝑇𝑖\displaystyle L_{r}(B,T)=\frac{1}{N}\sum_{j}L_{s}\left(B_{i},T_{i}\right) (13)
Ls(Bi,Ti)={|BiTi|β2 if |BiTi|β12β(BiTi)2 otherwise subscript𝐿𝑠subscript𝐵𝑖subscript𝑇𝑖casessubscript𝐵𝑖subscript𝑇𝑖𝛽2 if subscript𝐵𝑖subscript𝑇𝑖𝛽12𝛽superscriptsubscript𝐵𝑖subscript𝑇𝑖2 otherwise {L}_{s}\left(B_{i},T_{i}\right)=\begin{cases}\left|B_{i}-T_{i}\right|-\frac{\beta}{2}&\text{ if }\left|B_{i}-T_{i}\right|\geq\beta\\ \frac{1}{2\beta}\left(B_{i}-T_{i}\right)^{2}&\text{ otherwise }\end{cases} (14)
Lc(pl)=(1pl)γlog(pl)subscript𝐿𝑐subscript𝑝𝑙superscript1subscript𝑝𝑙𝛾subscript𝑝𝑙\displaystyle L_{c}\left(p_{{l}}\right)=-\left(1-p_{{l}}\right)^{\gamma}\log\left(p_{{l}}\right) (15)

Non-human Primate Motor Prediction

All baseline models have linear feed-forward layer architectures, where ANN, ANN_Flat, and SNN_Flat have topologies Nch32482subscript𝑁𝑐32482N_{ch}-32-48-2, and SNN uses Nch502subscript𝑁𝑐502N_{ch}-50-2. The varying topologies between SNN and SNN_Flat attempt to optimize for complexity in the former and correctness in the latter.

The LIF neurons used in the SNN networks are developed using snnTorch [18], and have potential dynamics shown in Equations 16 and 17. Note that unlike the SpikingJelly neurons (Equations 1011, and 12), the potential 𝐮(t)𝐮𝑡\mathbf{u}(t) is reset in the timestep following a spike, rather than during the same timestep. As before, Vresetsubscript𝑉𝑟𝑒𝑠𝑒𝑡V_{reset} is 0.0 and Vthsubscript𝑉𝑡V_{th} is 1.0, while β𝛽\beta is 0.96 for the SNN baseline and 0.50 for the SNN_Flat baseline. The potential of the readout neurons in both baselines is directly read to produce velocity predictions, thus there is no spiking or reset mechanism and the neurons function as leaky accumulators.

𝐮(t)={β𝐮(t1)+X(t)if 𝐬(t1)=0Vresetif 𝐬(t1)=1𝐮𝑡cases𝛽𝐮𝑡1𝑋𝑡if 𝐬𝑡10subscript𝑉𝑟𝑒𝑠𝑒𝑡if 𝐬𝑡11\mathbf{u}(t)=\begin{cases}\beta\mathbf{u}(t-1)+X(t)&\text{if }\mathbf{s}(t-1)=0\\ V_{reset}&\text{if }\mathbf{s}(t-1)=1\end{cases} (16)
𝐬(t)={0if 𝐮(t)Vth1if 𝐮(t)>Vth𝐬𝑡cases0if 𝐮𝑡subscript𝑉𝑡1if 𝐮𝑡subscript𝑉𝑡\mathbf{s}(t)=\begin{cases}0&\text{if }\mathbf{u}(t)\leq V_{th}\\ 1&\text{if }\mathbf{u}(t)>V_{th}\end{cases} (17)

ANN, ANN_Flat, and SNN_Flat are trained using mean-squared error (MSE) loss over 50 epochs. The SNN baseline used a sliding window of 50 consecutive data points, representing 200 ms of data (50-point window, single-point stride) in order to calculate the loss, to allow for more information for backpropagation and avoid dead neurons and vanishing gradients. The MSE loss was linearly weighted from 0 to 1 for the 50 points within the window. The SNN was trained with 10-fold cross-validation, using an early-stopping regime with patience (epochs for which there is no improvement to the validation set) of 10 epochs.

Chaotic Function Prediction

The LSTM baseline uses one LSTM layer followed by a ReLU activation and linear readout layer. As input, the LSTM uses an explicit memory buffer of the last M=50𝑀50M=50 points. During training, input 𝐱(t)𝐱𝑡\mathbf{x}(t) to the LSTM uses the Mackey-Glass data 𝐟(t)𝐟𝑡\mathbf{f}(t) (Equation 18), whereas, during autoregressive evaluation, the input uses prior predictions 𝐲(t)𝐲𝑡\mathbf{y}(t) (Equation 19). Values 𝐮(t<0)𝐮𝑡0\mathbf{u}(t<0) and 𝐯(t<0)𝐯𝑡0\mathbf{v}(t<0) are zero.

𝐱(t)=(𝐟(tM),𝐟(tM1),,𝐟(t))𝐱𝑡𝐟𝑡𝑀𝐟𝑡𝑀1𝐟𝑡\mathbf{x}(t)=(\mathbf{f}(t-M),\mathbf{f}(t-M-1),\ldots,\mathbf{f}(t)) (18)
𝐱(t)=(𝐲(tM1),𝐲(tM2),,𝐲(t1))𝐱𝑡𝐲𝑡𝑀1𝐲𝑡𝑀2𝐲𝑡1\mathbf{x}(t)=(\mathbf{y}(t-M-1),\mathbf{y}(t-M-2),\ldots,\mathbf{y}(t-1)) (19)

The LSTM is trained using MSE loss for backpropagation with 200 epochs. The hyperparameter sweep used the evaluation setup of 30 instantiations of τ=17𝜏17\tau=17 Mackey-Glass data, with each instance shifted forward by half of the Lyapunov time. The corresponding sets with the lowest sMAPE scores were used to report the results.

For the ESN, the standard architecture with one hidden layer (i.e., reservoir) with recurrent connections was used, where the states of the reservoir 𝐫(t)D𝐫𝑡superscript𝐷\mathbf{r}(t)\in\mathbb{R}^{D} at timesteps t𝑡t are evolving according to the dynamics shown in Equation 20. The random matrix 𝐖inD×d+1superscript𝐖insuperscript𝐷𝑑1\mathbf{W}^{\mathrm{in}}\in\mathbb{R}^{D\times d+1} with components drawn from the uniform distribution projects d𝑑d-dimensional input 𝐟(t)𝐟𝑡\mathbf{f}(t) (d=1𝑑1d=1 for the Mackey-Glass system), augmented with constant bias, into D𝐷D neurons of the reservoir. The recurrent connectivity is defined by the second (potentially sparse) random matrix 𝐖D×D𝐖superscript𝐷𝐷\mathbf{W}\in\mathbb{R}^{D\times D} with nonzero components drawn from the normal distribution; α𝛼\alpha, γ𝛾\gamma, and β𝛽\beta are hyperparameters controlling the behavior of the ESN.

𝐫(t)=(1α)𝐫(t1)+αtanh(γ𝐖𝐫(t1)+β𝐖in[1;𝐟(t)])𝐫𝑡1𝛼𝐫𝑡1𝛼𝛾𝐖𝐫𝑡1𝛽superscript𝐖in1𝐟𝑡\mathbf{r}(t)=(1-\alpha)\mathbf{r}(t-1)+\alpha\tanh\left(\gamma\mathbf{W}\mathbf{r}(t-1)+\beta\mathbf{W}^{\mathrm{in}}[1;\mathbf{f}(t)]\right) (20)

To make a prediction 𝐲(t)𝐲𝑡\mathbf{y}(t), the ESN uses the readout matrix 𝐖outd×D+d+1superscript𝐖outsuperscript𝑑𝐷𝑑1\mathbf{W}^{\mathrm{out}}\in\mathbb{R}^{d\times D+d+1} that computes the activation of the output layer based on the current states of the input and hidden layers: 𝐲(t)=𝐖out[𝐟(t);𝐫(t)]𝐲𝑡superscript𝐖out𝐟𝑡𝐫𝑡\mathbf{y}(t)=\mathbf{W}^{\mathrm{out}}[\mathbf{f}(t);\mathbf{r}(t)]. To predict the values of the system at the next timestep, i.e. 𝐲(t)𝐲𝑡\mathbf{y}(t) predicts 𝐟(t+1)𝐟𝑡1\mathbf{f}(t+1), the output layer has d𝑑d neurons.

The training of 𝐖outsuperscript𝐖out\mathbf{W}^{\mathrm{out}} is formulated as a linear regression problem so that it can be computed with the regularized least squares estimator (Equation 21), where 𝐇M×D+d+1𝐇superscript𝑀𝐷𝑑1\mathbf{H}\in\mathbb{R}^{M\times D+d+1} is an activation matrix that stores the readout for M𝑀M timesteps in the training data, 𝐘M×d𝐘superscript𝑀𝑑\mathbf{Y}\in\mathbb{R}^{M\times d} is another matrix that stores the corresponding ground-truth values for the same timesteps, and λ𝜆\lambda is the regularization parameter of the estimator.

𝐖out=𝐘𝐇(𝐇𝐇+λ𝐈)1superscript𝐖outsuperscript𝐘top𝐇superscriptsuperscript𝐇top𝐇𝜆𝐈1\mathbf{W}^{\mathrm{out}}=\mathbf{Y}^{\top}\mathbf{H}\left(\mathbf{H}^{\top}\mathbf{H}+\lambda\mathbf{I}\right)^{-1} (21)

Like for the LSTM, optimal hyperparameters are chosen based on lowest average sMAPE score over 30 time series. For each series, the ESN weight matrices 𝐖insuperscript𝐖in\mathbf{W}^{\mathrm{in}} and 𝐖𝐖\mathbf{W} were randomly initialized. The corresponding sets with the lowest sMAPE scores were used to report the results.

Acknowledgements

This work has been supported in parts by Semiconductor Research Corporation (SRC), the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 101001448), a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 11200922], ARC Laureate Fellowship FL210100156, and the EU H2020 project BeFerroSynaptic (871737). The authors would like to acknowledge the financial support of the CogniGron research center and the Ubbo Emmius Funds (Univ. of Groningen).

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC (NTESS), a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration (DOE/NNSA) under contract DE-NA0003525. This written work is authored by an employee of NTESS. The employee, not NTESS, owns the right, title and interest in and to the written work and is responsible for its contents. Any subjective views or opinions that might be expressed in the written work do not necessarily represent the views of the U.S. Government. The publisher acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this written work or allow others to do so, for U.S. Government purposes. The DOE will provide public access to results of federally sponsored research in accordance with the DOE Public Access Plan. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Author contributions statement

Authors are grouped based on contributions, and ordered alphabetically within groups. JY led project discussions and management. JY and KVdB implemented harness metric infrastructure, conducted experiments, and prepared the manuscript. The following authors primarily developed the main results: DdB and MF on the keyword few-shot class-incremental learning task; GT and SW on the event camera object detection task; PH, PVS and BZ on the non-human primate motor prediction task; YB and DK on the chaotic function prediction task; and NP led harness infrastructure development. SHA, GVJ, BL, AM, AKM, GL, and TS developed components of the harness infrastructure. ZA, MA, BA, AGA, CB, AB, PB, SB, SB, GC, EC, FC, GdC, AD, AD, MD, YD, JE, TF, JF, VF, SF, PMF, WG, AG, HAG, GI, SJ, VK, LK, JCK, LK, RK, DK, YL, SL, HM, RM, FMM, CM, KM, DM, EN, TN, FO, AO, PP, JP, MP, CP, MAP, AP, CP, AR, YS, CJSS, AvS, JS, SS, CS, JS, SS, SBS, MS, AS, MS, KS, TCS, PS, JT, NT, GU, MV, CMV, BV, AY, and FTZ participated in discussions during meetings, prepared sections for the present manuscript and/or its preprint, and reviewed the manuscript. CF and VJR jointly supervised the project, reviewed the manuscript, and analyzed results.

References

  • [1] Sevilla, J. et al. Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), 1–8, DOI: https://doi.org/10.1109/IJCNN55064.2022.9891914 (2022).
  • [2] Shankar, S. & Reuther, A. Trends in energy estimates for computing in ai/machine learning accelerators, supercomputers, and compute-intensive applications. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), 1–8, DOI: https://doi.org/10.1109/HPEC55821.2022.9926296 (2022).
  • [3] Ray, P. P. A review on tinyml: State-of-the-art and prospects. \JournalTitleJournal of King Saud University - Computer and Information Sciences 34, 1595–1623, DOI: https://doi.org/10.1016/j.jksuci.2021.11.019 (2022).
  • [4] Schuman, C. D. et al. A survey of neuromorphic computing and neural networks in hardware (2017). https://doi.org/10.48550/arXiv.1705.06963.
  • [5] James, C. D. et al. A historical survey of algorithms and hardware architectures for neural-inspired and neuromorphic computing applications. \JournalTitleBiologically Inspired Cognitive Architectures 19, DOI: https://doi.org/10.1016/j.bica.2016.11.002 (2017).
  • [6] Thakur, C. S. et al. Large-scale neuromorphic spiking array processors: A quest to mimic the brain. \JournalTitleFrontiers in Neuroscience 12, DOI: https://doi.org/10.3389/fnins.2018.00891 (2018).
  • [7] Mead, C. A. Neuromorphic electronic systems. \JournalTitleProceedings of the IEEE 78, 1629–1636, DOI: https://doi.org/10.1109/5.58356 (1990).
  • [8] Schuman, C. et al. Opportunities for neuromorphic computing algorithms and applications. \JournalTitleNature Computational Science 2, 10–19, DOI: https://doi.org/10.1038/s43588-021-00184-y (2022).
  • [9] Frenkel, C., Bol, D. & Indiveri, G. Bottom-up and top-down approaches for the design of neuromorphic processing systems: Tradeoffs and synergies between natural and artificial intelligence. \JournalTitleProceedings of the IEEE 111, 623–652, DOI: https://doi.org/10.1109/JPROC.2023.3273520 (2023).
  • [10] Davies, M. Benchmarks for progress in neuromorphic computing. \JournalTitleNature Machine Intelligence 1, 386388 (2019).
  • [11] Orchard, G., Jayawant, A., Cohen, G. K. & Thakor, N. Converting static image datasets to spiking neuromorphic datasets using saccades. \JournalTitleFrontiers in Neuroscience 9, DOI: https://doi.org/10.3389/fnins.2015.00437 (2015).
  • [12] Amir, A. et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7243–7252, DOI: https://doi.org/10.1109/CVPR.2017.781 (2017).
  • [13] Cramer, B., Stradmann, Y., Schemmel, J. & Zenke, F. The heidelberg spiking data sets for the systematic evaluation of spiking neural networks. \JournalTitleIEEE Transactions on Neural Networks and Learning Systems 33, 2744–2757, DOI: https://doi.org/10.1109/tnnls.2020.3044364 (2022).
  • [14] Ostrau, C., Klarhorst, C., Thies, M. & Rückert, U. Benchmarking neuromorphic hardware and its energy expenditure. \JournalTitleFrontiers in Neuroscience 16, DOI: https://doi.org/10.3389/fnins.2022.873935 (2022).
  • [15] Milde, M. B. et al. Neuromorphic engineering needs closed-loop benchmarks. \JournalTitleFrontiers in Neuroscience 16, DOI: https://doi.org/10.3389/fnins.2022.813555 (2022).
  • [16] Kulkarni, S. R., Parsa, M., Mitchell, J. P. & Schuman, C. D. Benchmarking the performance of neuromorphic and spiking neural network simulators. \JournalTitleNeurocomputing 447, 145–160, DOI: https://doi.org/10.1016/j.neucom.2021.03.028 (2021).
  • [17] Gewaltig, M.-O. & Diesmann, M. Nest (neural simulation tool). \JournalTitleScholarpedia 2, 1430 (2007).
  • [18] Eshraghian, J. K. et al. Training spiking neural networks using lessons from deep learning. \JournalTitleProceedings of the IEEE 111, 1016–1054, DOI: https://doi.org/10.1109/JPROC.2023.3308088 (2023).
  • [19] Reddi, V. J. et al. Mlperf inference benchmark. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture, ISCA ’20, 446–459, DOI: https://doi.org/10.1109/ISCA45697.2020.00045 (IEEE Press, 2020).
  • [20] Mattson, P. et al. Mlperf training benchmark. \JournalTitleProceedings of Machine Learning and Systems 2, 336–349 (2020).
  • [21] Mazumder, M. et al. Multilingual spoken words corpus. In Vanschoren, J. & Yeung, S. (eds.) Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, vol. 1 (Curran, 2021).
  • [22] Perot, E., de Tournemire, P., Nitti, D., Masci, J. & Sironi, A. Learning to detect objects with a 1 megapixel event camera. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20 (2020).
  • [23] O’Doherty, J. E., Cardoso, M. M. B., Makin, J. G. & Sabes, P. N. Nonhuman primate reaching with multichannel sensorimotor cortex electrophysiology, DOI: https://doi.org/10.5281/zenodo.788569 (2017).
  • [24] Mackey, M. C. & Glass, L. Oscillation and chaos in physiological control systems. \JournalTitleScience 197, 287–289 (1977).
  • [25] Kudithipudi, D. et al. Biological underpinnings for lifelong learning machines. \JournalTitleNature Machine Intelligence 4, 196–210, DOI: https://doi.org/10.1038/s42256-022-00452-0 (2022).
  • [26] Tao, X. et al. Few-shot class-incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020).
  • [27] Gallego, G. et al. Event-based vision: A survey. \JournalTitleIEEE Transactions on Pattern Analysis and Machine Intelligence 44, 154–180, DOI: https://doi.org/10.1109/TPAMI.2020.3008413 (2022).
  • [28] Lin, T.-Y. et al. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014, 740–755 (2014).
  • [29] Jaeger, H. & Haas, H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. \JournalTitlescience 304, 78–80 (2004).
  • [30] Mukhopadhyay, S. & Banerjee, S. Learning dynamical systems in noise using convolutional neural networks. \JournalTitleChaos: An Interdisciplinary Journal of Nonlinear Science 30, 103125 (2020).
  • [31] Chilkuri, N. R. & Eliasmith, C. Parallelizing legendre memory unit training. In International Conference on Machine Learning, 1898–1907 (PMLR, 2021).
  • [32] Makridakis, S., Spiliotis, E. & Assimakopoulos, V. The m4 competition: 100,000 time series and 61 forecasting methods. \JournalTitleInternational Journal of Forecasting 36, 54–74, DOI: https://doi.org/10.1016/j.ijforecast.2019.04.014 (2020). M4 Competition.
  • [33] Gilpin, W. Model scale versus domain knowledge in statistical forecasting of chaotic systems (2023). https://doi.org/10.48550/arXiv.2303.08011.
  • [34] Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 8024–8035 (Curran Associates, Inc., 2019).
  • [35] Fang, W. et al. Spikingjelly. https://github.com/fangwei123456/spikingjelly (2020).
  • [36] Intel. Lava software framework. https://github.com/lava-nc/lava (2021).
  • [37] Aimone, J. B., Severa, W. & Vineyard, C. M. Composing neural algorithms with fugu. In Proceedings of the International Conference on Neuromorphic Systems, 1–8 (2019).
  • [38] Lemaire, E. et al. An analytical estimation of spiking neural networks energy efficiency. In Neural Information Processing, 574–587, DOI: https://doi.org/10.1007/978-3-031-30105-6_48 (Springer International Publishing, 2023).
  • [39] Fra, V. et al. Human activity recognition: suitability of a neuromorphic approach for on-edge aiot applications. \JournalTitleNeuromorphic Computing and Engineering 2, DOI: https://doi.org/10.1088/2634-4386/ac4c38 (2022).
  • [40] Dai, W., Dai, C., Qu, S., Li, J. & Das, S. Very deep convolutional neural networks for raw waveforms (2016). 1610.00087.
  • [41] Bittar, A. & Garner, P. N. A surrogate gradient spiking baseline for speech command recognition. \JournalTitleFrontiers in Neuroscience 16, DOI: https://doi.org/10.3389/fnins.2022.865897 (2022).
  • [42] Stewart, K. M., Shea, T., Pacik-Nelson, N., Gallo, E. & Danielescu, A. Speech2spikes: Efficient audio encoding pipeline for real-time neuromorphic systems. In Proceedings of the 2023 Annual Neuro-Inspired Computational Elements Conference, NICE ’23, 71–78, DOI: https://doi.org/10.1145/3584954.3584995 (Association for Computing Machinery, New York, NY, USA, 2023).
  • [43] Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. \JournalTitleAdvances in neural information processing systems 30 (2017).
  • [44] Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, DOI: https://doi.org/10.1109/CVPR.2018.00745 (2018).
  • [45] Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, 802–810 (MIT Press, Cambridge, MA, USA, 2015).
  • [46] Liu, W. et al. SSD: Single shot MultiBox detector. In Computer Vision – ECCV 2016, 21–37, DOI: https://doi.org/10.1007/978-3-319-46448-0_2 (Springer International Publishing, 2016).
  • [47] Willsey, M. et al. Real-time brain-machine interface in non-human primates achieves high-velocity prosthetic finger movements using a shallow feedforward neural network decoder. \JournalTitleNature Communications 13, 6899, DOI: https://doi.org/10.1038/s41467-022-34452-w (2022).
  • [48] Hochreiter, S. & Schmidhuber, J. Long short-term memory. \JournalTitleNeural Computation 9, 1735–1780, DOI: https://doi.org/10.1162/neco.1997.9.8.1735 (1997).
  • [49] Scardapane, S. & Wang, D. Randomness in neural networks: an overview. \JournalTitleData Mining and Knowledge Discovery 7, 1–18, DOI: https://doi.org/10.1002/widm.1200 (2017).
  • [50] Davies, M. et al. Loihi: A neuromorphic manycore processor with on-chip learning. \JournalTitleIEEE Micro 38, 82–99, DOI: https://doi.org/10.1109/MM.2018.112130359 (2018).
  • [51] Mayr, C., Hoeppner, S. & Furber, S. Spinnaker 2: A 10 million core processor system for brain simulation and machine learning (2019). https://doi.org/10.48550/arXiv.1911.02385.
  • [52] Speck. https://www.synsense.ai/products/speck/. Accessed: 2023-04-03.
  • [53] Innatera’s Spiking Neural Processor (SNP). www.innatera.com/snp.pdf. Accessed: 2023-11-24.
  • [54] Levy, M. Innatera’s Spiking Neural Processor - brain-like architecture targets ultra-low power ai. https://www.innatera.com/innatera-mpr-2021.pdf (2021). Accessed: 2023-12-18.
  • [55] Banbury, C. et al. Mlperf tiny benchmark. \JournalTitleProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).
  • [56] TOP500. Top500. https://www.top500.org/ (2023).
  • [57] Green500. Green500. https://www.top500.org/lists/green500/ (2023).
  • [58] MLCommons. Mlcommons power working group. https://mlcommons.org/en/groups/best-practices-power/ (2023).
  • [59] MLCommons. Mlperf inference policies. https://github.com/mlcommons/inference_policies/ (2023).
  • [60] Heittola, T., Mesaros, A. & Virtanen, T. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020), 56–60 (2020).
  • [61] Koch, T., Berthold, T., Pedersen, J. & Vanaret, C. Progress in mathematical programming solvers from 2001 to 2020. \JournalTitleEURO Journal on Computational Optimization 10, 100031, DOI: https://doi.org/10.1016/j.ejco.2022.100031 (2022).
  • [62] Aimone, J. B. et al. A review of non-cognitive applications for neuromorphic computing. \JournalTitleNeuromorphic Computing and Engineering 2, 032003, DOI: https://doi.org/10.1088/2634-4386/ac889c (2022).
  • [63] Bai, J., Lu, F., Zhang, K. et al. Onnx: Open neural network exchange. https://github.com/onnx/onnx (2019).
  • [64] Pedersen, J. E. et al. Neuromorphic intermediate representation: A unified instruction set for interoperable brain-inspired computing (2023). https://doi.org/10.48550/arXiv.2311.14641.
  • [65] Rhodes, O. et al. spynnaker: A software package for running pynn simulations on spinnaker. \JournalTitleFrontiers in Neuroscience 12, DOI: https://doi.org/10.3389/fnins.2018.00816 (2018).
  • [66] Stewart, T. C., DeWolf, T., Kleinhans, A. & Eliasmith, C. Closed-loop neuromorphic benchmarks. \JournalTitleFrontiers in Neuroscience 9, DOI: https://doi.org/10.3389/fnins.2015.00464 (2015).
  • [67] Jeffares, A., Guo, Q., Stenetorp, P. & Moraitis, T. Spike-inspired rank coding for fast and accurate recurrent neural networks. In International Conference on Learning Representations (2022).
  • [68] Prophesee. Event-based vision software - metavision intelligence. https://www.prophesee.ai/metavision-intelligence/ (2023).
  • [69] Makin, J. G., O’Doherty, J. E., Cardoso, M. M. B. & Sabes, P. Superior arm-movement decoding from cortex with a new, unsupervised-learning algorithm. \JournalTitleJ. Neural Eng. 15, DOI: https://doi.org/10.1088/1741-2552/aa9e95 (2018).
  • [70] Ansmann, G. Efficiently and easily integrating differential equations with JiTCODE, JiTCDDE, and JiTCSDE. \JournalTitleChaos 28, 043116, DOI: https://doi.org/10.1063/1.5019320 (2018).
  • [71] Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. \JournalTitleIEEE Transactions on Pattern Analysis and Machine Intelligence 42, 318–327, DOI: https://doi.org/10.1109/TPAMI.2018.2858826 (2020).

Additional information

Competing interests statement

The NeuroBench benchmark framework was developed collaboratively and specifically to allow for as objective and applicable comparison as possible. The selection of initial benchmark tasks reflect the authors’ research interests, which includes commercial interests for companies. These do not affect our results in any way, nor the value of our contributions. The benchmark harness and framework are open-source and intended to be further extended by the community over time.