NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

Jason Yik Harvard University Correspondence to: jyik@g.harvard.edu Korneel Van den Berghe Harvard University Delft University of Technology Douwe den Blanken Delft University of Technology Younes Bouhadjar Forschungszentrum Jülich Maxime Fabre University of Groningen Paul Hueber Delft University of Technology IMEC Netherlands Denis Kleyko Research Institutes of Sweden Örebro University Noah Pacik-Nelson Accenture Labs Pao-Sheng Vincent Sun City University of Hong Kong Guangzhi Tang IMEC Netherlands Shenqi Wang IMEC Netherlands Eindhoven University of Technology Biyan Zhou City University of Hong Kong Soikat Hasan Ahmed Forschungszentrum Jülich George Vathakkattil Joseph Innatera Nanosystems B.V. Benedetto Leto Politecnico di Torino Aurora Micheli Delft University of Technology Anurag Kumar Mishra Forschungszentrum Jülich Gregor Lenz NeuroBus Tao Sun Centrum Wiskunde & Informatica Zergham Ahmed Harvard University Mahmoud Akl SpiNNcloud Systems GmbH Brian Anderson Intel Andreas G. Andreou Johns Hopkins University Chiara Bartolozzi Istituto Italiano di Tecnologia Arindam Basu City University of Hong Kong Petrut Bogdan Innatera Nanosystems B.V. Sander Bohte Centrum Wiskunde & Informatica Sonia Buckley National Institute of Standards and Technology Gert Cauwenberghs UCSD Elisabetta Chicca University of Groningen Federico Corradi Eindhoven University of Technology Guido de Croon Delft University of Technology Andreea Danielescu Accenture Labs Anurag Daram UTSA Mike Davies Intel Yigit Demirag University of Zurich ETH Zurich Jason Eshraghian UCSC Tobias Fischer Queensland University of Technology Jeremy Forest Cornell University Vittorio Fra Politecnico di Torino Steve Furber University of Manchester P. Michael Furlong U Waterloo William Gilpin University of Texas at Austin Aditya Gilra Centrum Wiskunde & Informatica Hector A. Gonzalez SpiNNcloud Systems GmbH Giacomo Indiveri University of Zurich ETH Zurich Siddharth Joshi University of Notre Dame Vedant Karia UTSA Lyes Khacef Sony Europe B.V. James C. Knight University of Sussex Laura Kriener University of Bern Rajkumar Kubendran University of Pittsburgh Dhireesha Kudithipudi UTSA Yao-Hong Liu IMEC Netherlands Shih-Chii Liu University of Zurich ETH Zurich Haoyuan Ma CentraleSupélec, Université Paris-Saclay Rajit Manohar Yale University Josep Maria Margarit-Taulé Instituto de Microelectrónica de Barcelona Christian Mayr Technische Universität Dresden Konstantinos Michmizos Rutgers University Dylan Muir SynSense AI Emre Neftci Forschungszentrum Jülich RWTH Aachen Thomas Nowotny University of Sussex Fabrizio Ottati Politecnico di Torino Ayca Ozcelikkale Uppsala University Priyadarshini Panda Yale University Jongkil Park Korea Institute of Science and Technology Melika Payvand University of Zurich ETH Zurich Christian Pehle Heidelberg University Mihai A. Petrovici University of Bern Alessandro Pierro Intel Christoph Posch Prophesee Alpha Renner Forschungszentrum Jülich Yulia Sandamirskaya Intel ZHAW Clemens JS Schaefer University of Notre Dame André van Schaik Western Sydney University Johannes Schemmel Heidelberg University Samuel Schmidgall Johns Hopkins University Catherine Schuman University of Tennessee Jae-sun Seo Cornell Tech Sadique Sheik SynSense AI Sumit Bam Shrestha Intel Manolis Sifalakis IMEC Netherlands Amos Sironi Prophesee Matthew Stewart Harvard University Kenneth Stewart UCI Forschungszentrum Jülich Terrence C. Stewart National Research Council Canada Philipp Stratmann Intel Jonathan Timcheck Intel Nergis Tömen Delft University of Technology Gianvito Urgese Politecnico di Torino Marian Verhelst KU Leuven Craig M. Vineyard Sandia National Laboratories Bernhard Vogginger Technische Universität Dresden Amirreza Yousefzadeh IMEC Netherlands Fatima Tuz Zohora UTSA Charlotte Frenkel Delft University of Technology Joint supervision Vijay Janapa Reddi Harvard University Joint supervision

Abstract

Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of nearly 100 co-authors across over 50 institutions in industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we present initial performance baselines across various model architectures on the algorithm track and outline the system track benchmark tasks and guidelines. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community.

keywords:

benchmark, neuromorphic

\maketitlewithnodistribute

Introduction

In recent years, the rapid growth of artificial intelligence (AI) and machine learning (ML) has resulted in increasingly complex and large models in pursuit of higher accuracy and range of use cases [1]. The substantial growth rate of model computation exceeds efficiency gains realized through Moore and Dennard technology scaling [2], indicating a looming limit to continued advancements with existing techniques. This issue is compounded by the open challenges of adapting such methods for resource-constrained edge devices (tinyML) in order to enable pervasive and decentralized intelligence through the Internet of Things (IoT) [3]. As such, the urgency for exploring new resource-efficient and scalable computing architectures has intensified.

Neuromorphic computing has emerged as a promising area in addressing these challenges, aiming to unlock key hallmarks of biological intelligence by porting primitives and computational strategies employed in the brain into engineered computing devices and algorithms [4, 5, 6]. Neuromorphic systems hold a critical position in the investigation of novel architectures, as the brain exemplifies an exceptional model for accomplishing scalable, energy-efficient, and real-time embodied computation.

Initially, the term “neuromorphic” referred specifically to approaches that aimed to emulate the biophysics of the brain by leveraging physical properties of silicon, as proposed by Mead in the 1980’s [7]. However, the field of neuromorphic computing research has since grown to encompass a wide range of brain-inspired computing techniques at the algorithmic, hardware, and system levels [4]. While the range of approaches is diverse, neuromorphic computing research generally utilizes mechanisms emulating or simulating biophysical properties more closely than conventional methods, aiming to reproduce high-level performance and efficiency characteristics of biological neural systems.

Neuromorphic algorithms [8] encompass neuroscience-inspired methods which strive towards goals of expanded learning capabilities, such as predictive intelligence, data efficiency, and adaptation, and include approaches such as spiking neural networks (SNNs) and primitives of neuron dynamics, plastic synapses, and heterogeneous network architectures. Algorithm exploration often makes use of simulated execution on readily-available conventional hardware such as CPUs and GPUs, with the goal of driving design requirements for next-generation neuromorphic hardware.

Neuromorphic systems [9] are composed of algorithms deployed to hardware, which seek greater energy efficiency, real-time processing capabilities, and resilience compared to conventional systems. Neuromorphic hardware utilizes a variety of biologically-inspired hardware approaches, including analog neuron emulation, event-based computation, non-von-Neumann architectures, and in-memory processing. Neuromorphic systems target a wide range of applications, from neuroscientific exploration, to low-power edge intelligence and datacenter-scale acceleration.

Despite its promises, progress in the field of neuromorphic research is impeded due to the absence of fair and widely-adopted objective metrics and benchmarks [10, 8]. Without such benchmarks, the validity of neuromorphic solutions cannot be directly quantified, hindering the research community from measuring technological advancement. Standard and rigorous benchmarking is necessary for the neuromorphic community to objectively assess and compare the achievements of novel approaches, and make evidence-based decisions on which directions show promise for achieving breakthrough efficiency, speed, and intelligence, thereby helping to focus research and commercialization efforts on techniques that concretely improve on prior work and conventional computing. Neuromorphic benchmarks have been previously proposed for classical vision [11, 12] and audition tasks [13], open-loop [14] and closed-loop [15] tasks, and for SNN simulator performance assessment [16]. While prior works have made valuable contributions, there are opportunities to further advance the field by addressing three outstanding challenges:

•

Lack of a formal definition. The variety of approaches to exploring brain-inspired principles creates difficulties in defining a set of criteria for what should be benchmarked as a “neuromorphic” solution. Closed definitions can impose narrow assumptions and thus risk unfairly excluding promising methods. This challenge necessitates inclusive benchmarks that can be applied generally across the spectrum of potential approaches, allowing for flexible implementation while focusing on task capabilities and metrics of interest such as temporal processing and efficiency. Furthermore, the benchmarks should ideally allow for direct comparison of neuromorphic and conventional approaches.
•

Implementation diversity. A wide array of different frameworks targeting different goals, such as neuroscientific exploration [17] and automatic SNN training [18], are used in neuromorphic research. This diversity, which has been instrumental in exploring the landscape of bio-inspired techniques following different methodologies and abstraction levels, comes at the cost of portability and standardization, which in turn limits the ease of benchmark implementation. Benchmarks require common infrastructure that unites tooling to enable actionable implementation and comparison of new methods.
•

Rapid research evolution. Neuromorphic approaches are continually and rapidly evolving as part of an emerging field. As the research community continues to make technological progress, so too should benchmark suites and methodology expand to foster inclusion and capture salient performance metrics. An iterative benchmark framework with structured versioning will facilitate productive foundational and evolving performance evaluation.

Refer to caption — Figure 1: The two NeuroBench tracks: algorithms and systems. Grey boxes designate what is defined by the benchmark, and orange boxes indicate what is unique to each solution. Connecting arrows between the two tracks denote the co-innovation between the tracks and the cross-stack innovation enabled by this approach. Between algorithm and system solutions, best-performing results from each track can motivate future solutions to the other. In addition, system metrics and results can inform hardware-independent algorithmic complexity metrics.

To tackle these challenges, this article presents NeuroBench, a dual-track, multi-task benchmark framework. NeuroBench addresses the existing neuromorphic benchmark challenges by advancing prior work in three distinct ways. Firstly, the benchmark framework reduces assumptions regarding the specific solution being assessed, encouraging inclusive participation of neuromorphic and non-neuromorphic approaches by utilizing general, task-level benchmarking and hierarchical metric definitions which capture key performance indicators of interest. Secondly, the NeuroBench benchmarks are associated with a common open-source benchmark harness tool which facilitates actionable benchmark implementation and offers structure for further expansion to neuromorphic algorithm frameworks and systems. Finally, NeuroBench establishes an iterative, community-driven initiative designed to evolve over time to ensure representation and relevance to neuromorphic research, analogous to the well-established MLPerf benchmark framework for machine learning [19, 20]. As a whole, NeuroBench intends to align the neuromorphic research community on standard benchmarking, providing a dynamically evolving platform to ensure ongoing relevance and facilitate advancements through workshops, competitions, and a centralized leaderboard.

As Figure 1 shows, the NeuroBench framework involves two tracks to enable agile algorithm and system development. As an emerging technology, neuromorphic hardware has not converged to a single platform which is commercially available, thus a large fraction of neuromorphic research explores algorithmic advancement on conventional systems which may not be optimal for performance. Thus, NeuroBench consists of an algorithm track for hardware-independent evaluation and a system track for fully deployed solutions. The algorithm track defines four novel benchmarks for neuromorphic methods across diverse domains, namely few-shot continual learning, computer vision, motor cortical decoding, and chaotic forecasting, and utilizes complexity metrics to analyze solution costs. Such hardware-independent benchmarking enables algorithmic exploration and prototyping, especially when simulating algorithm execution on non-neuromorphic platforms. Meanwhile, the system track defines standard protocols to measure the real-world speed and efficiency of neuromorphic hardware on benchmarks ranging from standard machine learning tasks to promising fields for neuromorphic systems, such as optimization. Both the algorithm and system track will be extended and co-developed as NeuroBench continues to expand.

The following Results section organizes descriptions of the algorithm track benchmark framework and its baseline results, as well as specifications of the system track benchmark framework and tasks. Further details regarding the benchmark metric formulations, task specifications, and baseline solutions can be found in the Methods section. Baseline results on NeuroBench benchmarks outline unexplored research opportunities in optimizing algorithmic architectures and training of sparse, stateful models to achieve greater performance and resource efficiency. As NeuroBench is intended to continually grow over time, the latest developments and opportunities to engage with the project are reported on the website.^a^aahttps://neurobench.ai

Results

The complete NeuroBench framework is shown in Figure 1. It includes two tracks with defined datasets, metrics, and modular evaluation components to enable flexible development. The algorithm track focuses on hardware-independent algorithm prototyping to identify promising methods. These in turn inform system design by highlighting target algorithms for optimization and relevant system workloads for benchmarking. The system track enables optimization and evaluation of performant implementations, providing feedback to refine algorithmic complexity modeling and analysis. The interplay between the tracks creates a virtuous cycle: algorithm innovations guide system implementation, while system-level insights accelerate further algorithmic progress. This approach allows NeuroBench to advance neuromorphic algorithm-system co-design.

In the next few sections, we describe the algorithm track, including general complexity metric definitions, benchmark tasks, and common infrastructure tooling. We apply the framework to report baseline results for each algorithm benchmark. Then, we specify protocols and tasks established in the system track to assess deployed neuromorphic performance across promising application workloads. By outlining both tracks, we provide a roadmap towards standardizing benchmark procedures in both hardware-independent and hardware-dependent settings.

Algorithm Track Benchmark Framework

The algorithm benchmark track aims to evaluate algorithms in a system-independent manner, separating algorithm performance from specific implementation details. The implementation platform can thus be ill-matched to the particular algorithm benchmark that it executes (e.g., SNN execution via dense matrix multiplication on a GPU), and the algorithm complexity and expected performance can be examined in a theoretical manner, motivating agile prototyping and functional analysis. Furthermore, minimal assumptions are made about the solutions tested, promoting inclusion of diverse algorithmic approaches.

The framework, as illustrated in Figure 2, is composed of inclusively-defined benchmark metrics, datasets and data loaders, and common harness infrastructure, shown in red. The metrics focus on assessing algorithm correctness on specific tasks as well as capturing general metrics that reflect the architectural complexity, computational demands, and storage requirements of the models. The datasets and data loaders specify the details of the tasks used for evaluation and ensure consistency across benchmarks. Finally, the harness infrastructure automates runtime execution and result output for the algorithm benchmark specified by the input interface, which consists of the user’s model and customizable components for data processing and desired metrics, shown in green and orange.

Algorithm Track Metrics

The algorithm track establishes solution-agnostic primary metrics which are generally relevant to all types of solutions, including artificial and spiking neural networks (ANNs, SNNs). Firstly, there are correctness metrics, which measure the quality of the model predictions on the particular task, such as accuracy, mean average precision (mAP), and mean-squared error (MSE). The correctness metrics are specified per task for each benchmark. Next, there are complexity metrics, which measure the computational demands of the algorithm. In the first iteration of the NeuroBench algorithm track, we assume a digital, time-stepped execution of the algorithm and define the following complexity metrics:

•

Footprint – A measure of the memory footprint, in bytes, required to represent a model, which reflects quantization, parameters, and buffering requirements. The metric summarizes (and can be further broken down into) synaptic weight count, weight precision, trainable neuron parameters, data buffers, etc. Zero weights are included, as they are distinguished in the connection sparsity metric.
•

Model Execution Rate – Execution rate, in Hz, of the model computation based on forward inference passes per second, measured in the time-stepped simulation timescale. The time is correlated to real-world data time. For example, if a model is designed to process data from an event camera with 50 ms input stride, the model execution rate is 20 Hz. This metric provides intuition into the deployed real-time responsiveness of a model, as well as its computational requirements.
•

Connection Sparsity – For a given model, the connection sparsity is the number of zero weights divided by the total number of weights, accumulated over all layers. 0 refers to no sparsity (fully connected) and 1 refers to full sparsity (no connections). This metric accounts for deliberate pruning and sparse network architectures.
•

Activation Sparsity – During execution, the average sparsity of neuron activations over all neurons in all model layers, for all timesteps of all tested samples, where 0 refers to no sparsity (i.e., all neurons are always activated), and 1 refers to the case where all neurons have a zero output.
•

Synaptic Operations – Average number of synaptic operations per model execution, based on neuron activations and the associated fanout synapses. This metric is further subdivided into dense, effective multiply-accumulate, and effective accumulate synaptic operations (Dense, Eff_MACs, Eff_ACs). Dense accounts for all zero and nonzero neuron activations and synaptic connections, and reflects the number of operations necessary on hardware that does not support sparsity. Eff_MACs and Eff_ACs only count effective synaptic operations by disregarding zero activations (e.g., produced by the ReLU function in an ANN or no spike in an SNN) and zero connections, thus reflecting operation cost on sparsity-aware hardware. Synaptic operations with non-binary activation are considered multiply-accumulates (MACs), while those with binary activation are considered accumulates (ACs).

Footprint and connection sparsity are classified as static metrics, which can be analytically determined from the model only. Activation sparsity, synaptic operations, and correctness are classified as workload metrics, which are dependent on execution or simulation of the model based on the benchmark data. Model execution rate is an exception, as it is a feature of the algorithm which neither needs to be calculated nor extracted from the model or its outputs, and thus is reported directly by the solution designer in benchmark results.

The complexity metrics are measured independently of the underlying hardware and therefore do not explicitly correlate with post-deployment latency or energy consumption. However, they provide valuable insight into algorithm performance and resource requirements, enabling high-level comparison and facilitating prototyping. For instance, the execution rate and number of synaptic operations can be taken together to estimate the speed and dynamic power of a model deployed to certain hardware, and the footprint and connection sparsity can be used to proxy hardware resource utilization.

Furthermore, the algorithm track can be extended with solution-specific secondary metrics, which can offer deeper insights by using information specific to particular types of solutions. For example, for algorithms geared towards analog hardware, noise robustness is an important solution-specific metric. In addition, approaches with complex neuron dynamics may warrant measuring the overall complexity of a neuron update (i.e., type and counts of operations necessary to simulate the update), which can be combined with the total number of neuron updates in a model pass to calculate the cost of state updates. Such solution-specific metrics are expected to be community-driven and will be included in future NeuroBench algorithm track releases.

Algorithm Track Benchmarks

The v1.0 iteration of the NeuroBench algorithm track includes four benchmarks for neuromorphic computing research. The benchmarks were chosen by the NeuroBench community to capture key ongoing challenges for neuromorphic algorithm design. The list of tasks highlights features which are relevant to neuromorphic research interests: few-shot continual learning, object detection utilizing the high dynamic range and temporal resolution of event cameras, sensorimotor decoding based on cortical signals, and low-dimensional predictive modeling useful for prototyping resource-constrained networks that are suitable for small mixed-signal systems. Benchmark tasks are listed below and summarized in Table 1. Detailed specifications of benchmark tasks are provided in the Methods section.


Task	Dataset	Correctness metric	Task description
Keyword FSCIL	MSWC [21]	Accuracy	Few-shot, continual learning of keyword classes.
Event Camera Object Detection	Prophesee 1MP Automotive [22]	COCO mAP	Detecting automotive objects from event camera video.
NHP Motor Prediction	Primate Reaching [23]	R²	Predicting fingertip velocity from cortical recordings.
Chaotic Function Prediction	Mackey-Glass time series [24]	sMAPE	Autoregressive modeling of chaotic functions.

Table 1: NeuroBench algorithm track v1.0 benchmarks.

•

Keyword Few-Shot Class-Incremental Learning (FSCIL) – Learning new tasks from a small amount of experiences while retaining knowledge of prior tasks is a hallmark of biological intelligence and a long-standing goal of general AI [25]. It is especially a key challenge to endow edge devices with the ability to adapt to their environments and users. This benchmark thus evaluates the capacity of a model to successively incorporate new keywords over multiple sessions (class-incremental), with only a handful of samples from the new classes to train with (few-shot). The FSCIL task is a recently established benchmark in the computer vision domain [26], but it has not yet been adapted to other data modalities. Aligning with a neuromorphic interest in temporal data modalities, this benchmark introduces a FSCIL task with streaming audio data using the large Multilingual Spoken Word Corpus (MSWC) [21] keyword classification dataset. The task is designed to be approached in two phases: pre-training and incremental learning. First, for pre-training, a set of 100 words spanning 5 base languages (English, German, Catalan, French, Kinyarwanda) with 500 training samples each are made available to train an initial model. Next, for incremental learning, the model undergoes 10 successive sessions to learn words from 10 new languages (Persian, Spanish, Russian, Welsh, Italian, Basque, Polish, Esparanto, Portuguese, Dutch) in a few-shot learning scenario. Each incremental session adds 10 words of the corresponding session language with only 5 training samples available per word. After each session, the model is tested in classification accuracy on all prior learned classes, including the 100 base pre-training classes and the few-shot-learned classes, therefore evaluating the FSCIL solution on its ability to learn new classes while retaining knowledge about the previously learned ones. Each session learns a new language, for a total knowledge base of 200 keywords by the end of the benchmark.
•

Event Camera Object Detection – Object detection is a widely-used computer vision task with applications in robotics, autonomous driving, and surveillance. Such scenarios at the edge may require high energy efficiency and real-time performance, which can be achieved via event-based vision sensors [27]. The event camera object detection benchmark uses the Prophesee 1 Megapixel automotive detection dataset [22], a large labeled object detection dataset with over 15 hours of event camera video from the front of a car driving in various scenarios. Predetermined training, validation, and testing splits include $11.2\text{\,}\mathrm{h}$ , $2.2\text{\,}\mathrm{h}$ , and $2.2\text{\,}\mathrm{h}$ of recording, respectively. Pedestrian, two-wheeler, and car object classes are used in evaluation, and correctness is measured using COCO mean average precision (mAP) [28].
•

Non-human Primate (NHP) Motor Prediction – Studying models which can accurately replicate features of biological computation presents opportunities in understanding sensorimotor behavior and developing closed-loop methods for future robotic agents. It also is foundational to the development of wearable or implantable neuro-prosthetic devices that can accurately generate motor activity from neural or muscle signals. This benchmark utilizes a dataset consisting of multi-channel recordings from the sensorimotor cortex of two non-human primates (NHP Indy and NHP Loco) during reaching movements, along with corresponding fingertip motion of the reach [23]. Six total sessions are included from the dataset, for a total of 8712 seconds of data. The task is to train a model to predict the two-dimensional components of finger velocity using recent neural data. The sessions are treated independently (i.e., models are trained separately for each session), and the data is split to allow the first 75% for training and validation and the last 25% for evaluation. Correctness of the predictions is evaluated by the coefficient of determination ( $R^{2}$ ) score against the true finger velocity targets, averaged over all six sessions.
•

Chaotic Function Prediction – The real-world data benchmarks presented thus far are high-dimensional and can require large networks to achieve high accuracy, raising challenges for solution types with limited I/O support and network capacity, such as mixed-signal edge prototype solutions. To address this, we include a synthetic benchmark based on prediction of one-dimensional Mackey-Glass time series [24], which can be effectively tackled by smaller networks. Mackey-Glass has been widely adopted as a benchmark for evaluating temporal predictors, including neuromorphic models [29, 30, 31]. The task involves prediction of the next timestep value $f(t+\Delta t)$ given the current timestep value $f(t)$ . The model is trained and validated using the first half of the time series, during which the ground truth state $f(t)$ are supplied to the model to predict the next timestep $f^{\prime}(t+\Delta t)$ . During the evaluation, the model uses its prior prediction $f^{\prime}(t)$ to generate each next value $f^{\prime}(t+\Delta t)$ , autoregressively forecasting the second half of the time series. Correctness is measured using symmetric mean absolute percentage error (sMAPE) of the generated time series against the target time series, a standard metric in forecasting [32]. The benchmark includes a set of 14 Mackey-Glass time series, which vary by the equation parameter $\tau$ , the delay constant. Lyapunov time $(L)$ , the expected predictability timescale for chaos [33], is used as the time unit for each time series. The total length of each series is 20 Lyapunov times, and 75 points are sampled per Lyapunov time ( $\Delta t$ = $L/75$ ).

Algorithm Track Benchmark Harness

The NeuroBench algorithm benchmarks are wrapped in a harness which standardizes the benchmark interfaces. The harness provides benchmark users with a consistent framework for loading data, processing data and model outputs, and calculating and reporting metrics, thereby ensuring fair and standard comparisons of the results. It is built with straightforward interfaces which are designed to be extended with new frameworks, algorithms, and tasks. The benchmark harness is open-source for use and development.^b^bbhttps://github.com/NeuroBench/neurobench

The components of the algorithm benchmark harness are summarized in Figure 2. Datasets are loaded in a common format and pass through Processors to be pre-processed. The Model generates predictions based on the processed data, and Accumulators post-process the predictions, for instance to accumulate spikes and transform to labels. Static metrics of algorithm footprint and connection sparsity are calculated via model analysis, while metrics of correctness, activation sparsity, and synaptic operations are calculated using predictions and model execution traces. For benchmark users, task evaluation simply involves utilizing the existing dataloaders, processors, and metrics within the harness and wrapping their own code to fit the standard interfaces.

Currently, the harness and all baseline models are built using PyTorch [34] or frameworks based on it, such as snnTorch [18] and SpikingJelly [35]. Due to its modular structure and simple interfaces, the harness can grow to be compatible with further neuromorphic tools such as Lava [36] and Fugu [37]. Furthermore, it also supports the extension of data and metric pipelines in order to implement additional benchmark tasks. Benchmarks outside of the NeuroBench v1.0 suite can make use of the harness infrastructure for open reproducibility, and also to garner interest in the community towards inclusion in a future NeuroBench version, through which the task will have long-term support and appear in affiliated leaderboards and workshop events.

Algorithm Track Limitations and Further Extensions

Before diving into the baseline results, it is worth discussing several possible improvements to the NeuroBench algorithm track framework in its current form. Specifically, the initial iteration of metrics is restricted to the assumption of digital, time-stepped algorithm execution. While complexity analysis of such prototypes can serve as an intermediate step for solutions intended for analog or continuous time deployment, the metric measurements are not yet defined for those execution settings. Informed by further benchmark implementations, future versions of NeuroBench will extend inclusiveness by expanding measurement protocols to include such algorithms.

Furthermore, the synaptic operations metric, intended to capture model computation cost, currently does not account for neuron updates. The dynamics of neuron models, including mechanisms like leakage and reset, can vary heavily in complexity. However, counting the number and type of operations from neuron updates, as well as estimating their overall costs, depends on the specific arithmetic or circuit implementation. Thus, they are not accounted for in the broader algorithmic complexity metrics. Solution-specific metrics that assume a particular implementation platform, as have been defined previously [38], can be used to estimate neuron update costs. These estimates can then be combined with the total number of neuron updates per model computation to measure overall neuron operation complexity during evaluation.

Data pre- and post-processing can also amount to significant costs not yet captured in the NeuroBench algorithm track metrics. Such costs are, however, captured in the deployed metrics of the system track, which accounts for data processing hardware as part of the overall system during performance and efficiency measurements. Data processing metrics will be added as a separate complexity category for the algorithm track benchmark in the future.

The v1.0 algorithm track benchmark suite is also intended to expand in the future. This could include covering further data modalities such as inertial measurement unit (IMU) sensing [39] and extending to closed-loop sensorimotor tasks to demonstrate embodied intelligence. As with the initial benchmarks, further tasks will undergo approval and development by the open NeuroBench community before being included in a future versioned benchmark suite.

Algorithm Track Baseline Results

In our first iteration of the algorithmic track, we report baseline algorithm performance on each benchmark using various model architectures, including artificial neural networks commonly used in deep learning, spiking neural networks, and reservoir networks. We evaluate each benchmark with two substantially different algorithm baselines. From these evaluations, we extract baseline comparisons, identify trends, and uncover motivations for future research. Except for the event camera object detection task, each benchmark utilizes a novel data split, and all tasks use novel metric measurement. The presented baselines are a snapshot of the solution search space and will be starting points for leaderboards, thereby calling for further research to push the state of the art for each task. Detailed specifications of each of the baselines can be found in the Methods section.

Keyword FSCIL

Baseline Accuracy (Base / Session Avg) Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs M5 ANN (97.09% / 89.27%) $6.03\times 10^{6}$ 1 0.0 0.783 $2.59\times 10^{7}$ $7.85\times 10^{6}$ 0 SNN (93.48% / 75.27%) $1.36\times 10^{7}$ 200 0.0 0.916 $3.39\times 10^{6}$ 0 $3.65\times 10^{5}$

Table 2: Baseline results for the keyword few-shot class-incremental learning task. Base accuracy refers to accuracy on the 100 base classes after pre-training while session average accuracy is the average accuracy over all sessions for the corresponding prototypical baseline. The detailed accuracy per session for the different baselines are shown in Figure 3.

The keyword FSCIL task has an ANN and SNN baseline, using different model architectures:

•

M5 ANN – The ANN baseline uses a tuned version of the M5 deep convolutional network architecture [40], with samples pre-processed into Mel-frequency cepstral coefficients (MFCC). The network contains four successive convolution-normalization-pooling layers, followed by a readout fully-connected layer. Each model execution (forward pass) uses the data from the full pre-processed sample, and convolution kernels are applied over the temporal dimension of the samples. This is reported as a 1 Hz model execution rate.
•

SNN – The SNN baseline uses a recurrent SNN with adaptive leaky integrate-and-fire (LIF) neurons and heterogeneous time constants [41]. The SNN consists of two recurrent adaptive LIF layers and one linear output layer. Audio samples are pre-processed to binary spike trains using Speech2Spikes [42], which relies on a Mel Spectrogram with the same parameters as the MFCC of the ANN baseline. Each input timestep to the model represents 5 ms of audio data, thus the model has a 200 Hz model execution rate. Output neuron activations are summed over time to produce the word class prediction.

After pre-training using standard batched training, the ANN and SNN baseline networks reach high accuracies on the base classes of 97.09% and 93.48%, respectively. As reported by the model execution rate metric, the SNN baseline computes each sample over 200 passes, using an order of magnitude fewer effective AC synaptic operations compared to the ANN baseline’s effective MACs per model execution. Considering both the model execution rate and synaptic operation metrics, the number of aggregated ACs over the length of the sample ( $200*3.65\times 10^{5}=7.30\times 10^{7}$ ) exceeds the Dense and effective MAC operations necessary for the ANN baseline, which spatially flattens the sample and processes it in one model execution. However, outside of the static-length keyword classification scenario, the low-cost per-execution temporal processing of SNNs can enable efficient, always-on, high-frequency prediction capabilities in deployed continuous audio recognition scenarios.

We present two approaches for the incremental stage for both the ANN and SNN baselines. The frozen models are locked after pre-training on base classes and have 0% accuracy on all new incremental classes, providing a reference for models with no learning or catastrophic forgetting of prior classes. The prototypical models employ a prototypical network [43] for incremental learning, which is a feature-based clustering approach that can be implemented with a simple linear readout layer on top of the pre-trained network backbone. Prototypical weights and biases of prior and incremental classes are directly defined based on the average features of the corresponding class and directly substitute pre-trained readout layer parameters. The complexity results in Table 2 thus empirically apply to both the frozen and prototypical models.

The test accuracy for the baseline models over all sessions, as well as the test accuracy on only the new incrementally-learned classes, are shown in Figure 3. Using prototypical networks, the ANN model reaches 89.27% accuracy on average over all sessions, demonstrating significant greater performance of 21.41 accuracy points with respect to the frozen model. The accuracy on new classes, averaged over all incremental sessions, is 79.61%. The SNN prototypical baseline, on the other hand, reaches 75.27% accuracy on average over all sessions, surpassing the frozen SNN performance by 9.97 accuracy points, with an average accuracy on new classes over all sessions of 57.23%.

The accuracy loss over the incremental sessions is similar between the ANN and SNN prototypical baselines. However, the lower overall accuracy of the SNN is largely due to the conversion from the original backpropagation-trained readout classifier, which is used in the frozen baseline, to the prototype readout classifier. On the base classes (session 0 in Figure 3), the ANN sees a drop of 2.37% between the frozen and prototypical baselines, while the SNN has a larger drop of 9.17%. The larger drop indicates that our particular SNN baseline has a less general feature extraction than the ANN. This may be due to the challenges of backpropagation through time for online temporal inference to learn to extract long-term temporal keyword features with the chosen spiking recurrent model. Additionally, the Speech2Spikes [42] pre-processing algorithm converting audio to spikes may also cause information loss. Overall, the keyword FSCIL benchmark presents opportunities for further research in learning methods, preprocessing, and model architectures for continual learning of temporal data.

Event Camera Object Detection

Baseline mAP Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs RED ANN 0.429 $9.13\times 10^{7}$ 20 0.0 0.634 $2.84\times 10^{11}$ $2.48\times 10^{11}$ 0 Hybrid 0.271 $1.21\times 10^{7}$ 20 0.0 0.613 $9.85\times 10^{10}$ $3.76\times 10^{10}$ $5.60\times 10^{8}$

Table 3: Baseline results for the event camera object detection task.

The event camera object detection task reports a prior baseline, the RED ANN, and a novel conversion of the architecture to a hybrid ANN-SNN model:

•

RED ANN – The RED architecture [22] consists of blocks of feed-forward squeeze-and-excite [44] convolutional layers followed by blocks of recurrent convolution-LSTM (ConvLSTM [45]) layers. A single-shot detection (SSD [46]) head is used to predict the location and class of the bounding box based on multi-scale outputs from the recurrent layers. Raw event data is binned into 50 ms and pre-processed into time surfaces.
•

Hybrid – The hybrid ANN-SNN architecture adopts feedforward LIF spiking neural layers to replace the ConvLSTM layers in RED, and shares the same feed-forward convolutional blocks as the RED. It uses the same input encoding method and SSD head as the RED model.

Results for the two networks can be found in Table 3. The RED ANN represents the current state-of-the-art correctness on the benchmark, at 0.429 mAP. The Hybrid network is a smaller network, reflected by the footprint and synaptic operations metrics measuring an order of magnitude smaller than for the RED ANN. The smaller size comes at the expense of lower correctness of 0.271 mAP.

For the RED ANN, the activation sparsity metric (0.634) represents zero activations by the ReLU function for each neuron. From this, one may expect that the number of effective operations (operations with a nonzero activation and nonzero weight) would be around 35% of dense operations, however the actual ratio is 87%. This is due to the presence of normalization layers applied to activations before synaptic weight multiplication. Furthermore, neurons with lower activation frequency in the network tend to have a smaller fanout than neurons with high activation frequency. Thus, while activation sparsity alone can provide a proxy for the cost of the network, architectural characteristics may impede actual computation reduction, and the synaptic operations must be considered in tandem.

The Hybrid network demonstrates a significant reduction in total effective operations against dense operations, outlining significant gains if deployed on specialized sparsity-aware hardware. However, for the particular network, the number of effective ACs, generated by the spiking neuron components, is two orders of magnitude smaller than the number of effective MACs within the ANN components. Such a hybrid network may not warrant specialized accumulation units, and the baseline motivates further research in hybrid networks with a larger proportion of spiking neuron activity compared to artificial neuron activity.

NHP Motor Prediction

Baseline $R^{2}$ Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs ANN 0.593 20824 250 0.0 0.683 4704 3836 0 0.558 33496 250 0.0 0.668 7776 6103 0 SNN 0.593 19648 250 0.0 0.997 4900 0 276 0.568 38848 250 0.0 0.999 9700 0 551

Table 4: Baseline results for the NHP motor prediction task, for NHP Indy (96-channel data, top), and NHP Loco (192-channel data, bottom).

Small fully-connected, feedforward networks were developed for the NHP motor prediction baselines:

•

ANN – In the ANN baseline, the cortical activity from the 50 most recent data samples is buffered to be used as network input. The network has two hidden layers and $2$ final outputs predicting $X$ and $Y$ velocities, with a fully-connected topology of $N_{ch}$ -32-48-2, where $N_{ch}$ refers to the channels of cortical data (96 for NHP Indy, and 192 for NHP Loco). Batch normalization is applied after each hidden layer.
•

SNN – The SNN uses the data samples directly as input to the network, without buffering. It has a hidden layer of 50 LIF neurons, for a fully connected topology of $N_{ch}$ -50-2 LIF neurons. The output neurons do not have a reset mechanism, and the membrane potential is directly read to produce the output velocities.

Table 4 shows the results for the ANN and SNN baselines, averaged between sessions from each NHP (Indy and Loco). The ANN and SNN are similar in footprint size and number of dense operations per model forward pass, and also reach comparable prediction quality based on $R^{2}$ score. Each model is small in footprint and operation count, demonstrating that this task can be solved by shallow edge networks, validating prior studies [47].

Between the baselines, the SNN realizes similar correctness at significantly reduced complexity compared to the ANN. Extremely high activation sparsity in the SNN (0.998) directly translates to low effective accumulate operations, demonstrating the adequacy of stateful, binary-activation neuron models for sparse regression tasks. Meanwhile, similarly to the RED ANN in the event camera object detection task, activation sparsity in the ANN baseline does not translate to effective operation efficiency, as batch normalization is applied to activations before multiplication with synaptic weights.

We conduct further exploration for increasing task accuracies with more complex ANN and SNN models: ANN_Flat and SNN_Flat. For these networks, 50 data samples of buffered input are split into $n_{p}=7$ accumulated bins. For ANN_Flat, the 7 bins are spatially flattened as input to the network, so its topology is ( $7\times N_{ch}$ )-32-48-2. SNN_Flat uses the $N_{ch}$ -32-48-2 topology, and the 7 bins are temporally flattened as input, presented to the network as separate input timesteps. Each prediction still uses the membrane potential of the output neurons after input timesteps, and the network is reset for each prediction. Layer normalization is also applied on the SNN_Flat inputs.

Figure 4 shows plots of complexity and predictive quality of all four baseline networks. Both flattened networks demonstrate significantly greater $R^{2}$ performance than the other two networks. However, the larger input dimension of the ANN_Flat network is reflected in its greater footprint, and the increased model timesteps and layer normalization sharply increase the effective operations of SNN_Flat by two orders of magnitude compared to the simpler SNN. Thus, while input flattening and normalization increase the quality of model predictions for ANNs and SNNs, each comes with a significant complexity trade-off.

Chaotic Function Prediction

Baseline sMAPE Footprint (bytes) Model Exec. Rate (Hz) Connection Sparsity Activation Sparsity SynOps (per model exec.) Dense Eff_MACs Eff_ACs ESN 14.79 $2.81\times 10^{5}$ - 0.876 0.0 $3.52\times 10^{4}$ $4.37\times 10^{3}$ 0 LSTM 13.37 $4.90\times 10^{5}$ - 0.0 0.530 $6.03\times 10^{4}$ $6.03\times 10^{4}$ 0

Table 5: Baseline results for the chaotic function prediction task. Execution rate is not reported as the data is a synthetic time series, with no real-time correlation.

The chaotic function prediction task has two recurrent ANN baselines, which feature distinct network architectures:

•

Long short-term memory (LSTM) – LSTMs are a class of recurrent ANN architectures [48], utilizing multiple gates for selective retention or omission of past information. The LSTM baseline consists of a single LSTM with a hidden state of 100 neurons, followed by a feed-forward layer to produce single-dimension output predictions. In addition, the LSTM baseline utilizes explicit memory by buffering 50 previous datapoints, spatially flattening them into 50 input channels.
•

Echo state network (ESN) – ESNs are randomized recurrent ANNs that belong to a class of algorithms known collectively as reservoir computing [49], featuring more biologically-inspired principles than LSTMs despite not being spiking networks. Standard ESNs have only one hidden layer (the reservoir), where synaptic connections projecting input data to the hidden layer and recurrent synaptic connections within the hidden layer are chosen randomly and stay fixed during the training. The model architecture for the ESN baseline has two neurons in the input layer, which projects the Mackey-Glass function input and additional constant bias input into a hidden layer of 186 neurons. Within the hidden layer, the probability of recurrent connections is set to 0.11.

The LSTM and ESN models were evaluated on a Mackey-Glass time series with $\tau=17$ . The model is evaluated over 30 instantiations of the system; in each instance the start point is shifted forward by half of the Lyapunov time. The model is re-initialized and re-trained on each instance, and the results are averaged over all 30 instances.

The ESN model is architecturally unique compared to the other ANN and SNN baselines. The connection sparsity metric ( $0.876$ ) reflects the high number of zero-weight connections across its reservoir hidden layer. Due to this sparsity, hardware with support for sparse synaptic representation by ignoring zero weights would require less memory to represent the network, thus decreasing the deployed footprint of the model. The high connection sparsity of the ESN leads to significant reduction in synaptic operations - the ESN uses an order of magnitude fewer effective operations ( $4.37\times 10^{3})$ than the LSTM ( $6.03\times 10^{4}$ ), while achieving comparable sMAPE. The activation sparsity of the ESN is 0 due to neurons using $\tanh(\cdot)$ , rather than ReLU activations.

Furthermore, we show the generalization and robustness capabilities of the particular ESN and LSTM models by applying them, with fixed hyperparameter sets, to other Mackey-Glass time series. Figure 5 shows the sMAPE score of the models over varied time series with the $\tau$ Mackey-Glass parameter varying between 17 and 30. The models were trained independently for each time series. As the Mackey-Glass $\tau$ parameter characterizes the time-delay of the system, its increase roughly corresponds to prediction difficulty, shown by the increasing sMAPE trend through the plot. Notably, the LSTM maintains an error that is relatively lower than that of the ESN for all $\tau>18$ . However, the LSTM uses explicit memory via input buffering, so it is conjectured that the historical data allows for greater robustness to the varying time series characteristics. The ESN uses only one previous timestep, so its memory is only implicitly retained within its hidden layer. While the ESN tunes well to the $\tau$ =17 case and demonstrates greatly reduced effective operations compared to the LSTM, the same set of hyperparameters does not generalize as well to other time series. Further research is motivated in explicit memory buffers versus implicit memory within the network state for trade-offs in single-series forecasting performance, complexity, and generalization capability.

Discussion and Opportunities for Further Research

Baseline results for the four v1.0 algorithm track tasks compare the correctness and complexities of various solution types. Compared to ANNs, SNNs and ESNs demonstrate complexity advantages such as smaller footprints, high sparsity, and accumulate rather than multiply-and-accumulate operations. Especially on the motor prediction and chaotic function prediction regression tasks, the SNN and ESN baselines already achieve competitive correctness at lower complexity than the ANN and LSTM counterparts. Further research opportunities in model architectures, data pre-processing and buffering, and training paradigms to achieve greater performance is enabled by the standard framework and tooling provided by NeuroBench.

System Track Benchmark Framework

While the algorithm track aims to benchmark solutions in a system-independent manner via complexity analysis, the NeuroBench system track aims to evaluate deployed latency, throughput, and efficiency of systems comprised of an algorithm deployed to a hardware platform. In order for the hallmarks of neuromorphic hardware to be aptly judged against conventional systems and foster the expansion of neuromorphic solutions, fair comparisons must be made between sufficiently mature neuromorphic systems and conventional systems solving the same tasks.

A key challenge for benchmarking neuromorphic hardware is that systems are implemented and deployed at vastly different scales to serve diverse applications, from cloud services (e.g. multi-chip platforms like Loihi [50] and SpiNNaker [51]) to embedded sensing intelligence (e.g. Speck [52] and SNP [53, 54]). This range is visualized in Figure 6. Existing benchmarks for conventional machine learning systems separate submissions between datacenter-level computing [19] and embedded processing [55]. Thus, rather than pursuing a one-size-fits-all suite of tasks for neuromorphic systems, the goal of the NeuroBench system track is to develop benchmarks at various scales and use cases, united under a common set of guidelines and measurement methodology, thereby embracing the diversity of neuromorphic approaches.

In this section, we present the system track guidelines outlining the metrics, tasks, and harness components, representing collective design between multiple owners and vendors of neuromorphic hardware. As the system track benchmark tooling is currently under development, insight from the algorithm track’s early results will be exploited towards a future release of the detailed harness documentation, benchmark procedure, and baselines for the system track by the end of 2024.

System Track Metrics

In order to be representative of the properties of a deployed system, the system benchmarks, like the algorithm benchmarks, are assessed at the task level for the overall system, as opposed to operation or kernel level assessment of individual components. Task-level benchmarks enable straightforward comparison between systems of any type with regard to their abilities to solve problems, and the overall system-level measurement describes the realistic capability and efficiency of a whole solution. For ease of comparison and benchmark result analysis, key features of the NeuroBench system track are consistent with features from the widely-adopted machine learning system benchmark MLPerf Inference [19].

Task Scenarios.

Where applicable, the NeuroBench system track will utilize benchmark task scenarios from MLPerf. MLPerf describes task scenarios under which system benchmarks are presented to the system under test (SUT). For large-scale batched processing, MLPerf defines the Offline and Server scenarios, which are latency-unconstrained and latency-constrained, respectively, with performance measured in terms of prediction throughput. For batch-size-1 processing, MLPerf defines Single-stream scenarios, for which performance is measured in terms of prediction latency.

However, the MLPerf task scenarios all consider data as discrete samples, and therefore do not fit with some key neuromorphic system applications. The NeuroBench system track thus defines two additional task scenarios: Real-time and Optimization. Real-time uses continuous data streams (e.g., from an event camera), which must be processed online by the SUT, in contrast with the Single-stream scenario, where samples are sent only once the SUT has completed processing. Optimization applications use heuristic methods for otherwise intractable problems, and thus do not have notions of sample throughput or sample latency. The Optimization scenario benchmarks will report latency for the SUT to reach multiple correctness thresholds, as many approaches find sucessively improved solutions over time.

For each task and scenario, the following metric categories are reported:

•

Correctness – Due to the tight coupling between an algorithm and its system implementation in many existing neuromorphic hardware solutions, the particular model used to solve the benchmark task is unconstrained. Therefore, correctness must be measured to verify the validity of the solution. No correctness thresholds are imposed on submissions, but the benchmark leaderboard will impose tiers of solution correctness on submissions to evaluate accuracy-efficiency trade-offs of system approaches.
•

Performance – Depending on the task scenario, the performance of the system is measured differently. The Offline and Server scenarios report throughput, while the Single-stream scenario reports latency. The Real-time scenario similarly reports latency, which for a large portion of execution (e.g., 90%, defined per-benchmark) must not exceed a real-time threshold, or else the submission is not considered valid. The Optimization scenario reports the time to solution at various solution quality thresholds, defined per-benchmark.
•

Efficiency – Conventional system benchmarks such as TOP500 [56] for HPC and MLPerf Inference [19] for deep learning do not require power measurement submission in the main benchmark, instead allowing for separate submissions to an adjacent power track (Green500 [57] and MLPerf Power [58], respectively). Not only has efficiency been usually considered as a second-order metric for conventional systems, it is also notoriously difficult to precisely measure. However, as energy efficiency is a key hallmark of biology and thus is a focus of neuromorphic research, power and energy consumption must be first-order metrics in the NeuroBench system track.

The Server, Offline, and Real-time scenarios report average power as their batched and online processing is continuous, the Single-stream scenario reports the energy per-sample, and the Optimization scenario reports total energy consumed at each solution quality threshold.

Benchmark submissions may perform separate runs to report performance and power in order to demonstrate system flexibility (e.g., a ‘performance-mode’ run optimal for latency and an ‘efficiency-mode’ run optimal for energy), however in all runs, both metrics must be reported.

Importantly for the NeuroBench system track, in measuring latency and efficiency, data pre- and post-processing must be taken into account. Neuromorphic methods will often consume and produce non-standard (e.g., event-based) data modalities, the processing of which may consume a significant amount of the overall execution latency and may not be computed on the neuromorphic hardware itself. As many instances of neuromorphic hardware cannot be deployed without such associated processing, it is essential that latency timing and efficiency measurement captures the cost of data processing, which stands in contrast with conventional system benchmarks that measure starting from pre-processed data [59].

System Track Benchmarks

Benchmarks for the system track will include tasks of interest to neuromorphic systems, from embedded to datacenter scales. The tasks are key application areas for existing systems, and they differ from the tasks in the algorithm track, which are more research-oriented. Towards future iterations of the NeuroBench suite, the algorithm and system tracks are intended to coalesce as both algorithms and systems mature. As the benchmark results identify properties of highly effective algorithms and systems, the algorithm and system tracks will converge to the same selection of tasks that are seen as the most impactful for future progress in the field. Two benchmark specifications for the system track are defined in this article.

•

Acoustic Scene Classification – The acoustic scene classification benchmark challenges systems to classify audio into predefined categories based on the environmental audio context. Such capabilities are key for hearable devices, which can utilize them to automatically adjust sound equalisation profiles, appropriately target microphone denoising, and support active noise cancellation. The application further challenges systems to fulfill technical requirements, such as always-on and real-time operation, and time series processing. Acoustic scenes provide a rich repertoire of features that are necessary for prediction, thus this tasks is a complement to keyword classification, which mainly focuses on shorter-term features (e.g., phonemes) with a relatively smaller feature repertoire.

The benchmark evaluates the classification capabilities of both neuromorphic systems and conventional computing platforms using datasets such as those from the DCASE challenge [60] (if permissible, subject to license). These datasets consist of a myriad of audio recordings from diverse environments, including airports, public parks, and buses, thus providing a comprehensive foundation for testing both application- and system-level performance.

The task will be presented under the Real-time or Single-stream task scenarios, providing a continuous audio stream or sliced samples to the SUT, during which the acoustic scene periodically changes. Classification probability will be sampled to determine the correctness of the prediction. The average power measured during inference is a key indicator of the efficiency and always-on capability of the SUT, and inference latency will be measured in terms of the prediction time relative to the onset of each new acoustic scene.
•

QUBO – As an Optimization scenario task, NeuroBench incorporates quadratic unconstrained binary optimization (QUBO). QUBO is a particularly beneficial first optimization task for NeuroBench for two reasons. First, the binary variables are inclusive to neuromorphic systems with purely binary spike communication. Second, real-world QUBO applications typically feature sparse cost matrices [61] which benefit from the sparse synaptic connectivity / matrix multiplication that neuromorphic systems are often optimized for [62]. The initial set of QUBO workloads in NeuroBench searches for the maximum independent (i.e. unconnected) set of nodes in graphs, a task that has wide applications across industry and academia, such as resource allocation in wireless networks, portfolio optimization, and task scheduling.

NeuroBench will provide a QUBO generator that can uniquely specify each workload by three specific parameters provided by the benchmark: the number of graph nodes, the sparsity of graph connections, and a random seed. The generator provides a large dataset for reliable statistics and allows scaling from modest workloads for small-scale and prototype systems to large workloads for larger-scale systems. An independent dataset is provided to tune the hyperparameters of the solver. The SUT is evaluated based on three metrics: The first success metric is the maximum supported workload size. The second and third metric is the time and energy required to obtain pre-defined levels of optimality, respectively. The solution optimality is measured by the size of the independent set of nodes found by the SUT.

System Track Harness

A diagram of the NeuroBench harness with extended support for the system track is shown in Figure 7. Blue boxes in the figure denote hardware-specific infrastructure components which will connect to the general top-level harness interfaces. Common interfaces within the runtime enforce that the harness can be modularly extended to various system backends.

Within the system runtime, intermediate representations (IR) are used to compile and map models. In ML development, the ONNX IR [63] has been an enabler of cross-platform portability, thus IR infrastructure within the harness is an important component of common tooling. A common high-level IR such as the Neuromorphic Intermediate Representation (NIR) [64] or Lava [36] can unify spatio-temporal graphs describing the neuromorphic models, and may be transformed into a more optimized low-level IR that is specific to the target hardware (e.g., sPyNNaker machine graph [65]). The inclusion of shared IRs within the harness allows for benchmarking equivalent models across multiple systems, which will highlight system optimizations for cross-compatibility. The IRs may also be hardware-specific, tightly integrated within the system to maximize performance.

From a top-level API, users will be able to run algorithm benchmarks or system benchmarks through the corresponding runtimes. By providing the runtimes side-by-side, the harness presents a single tool for both algorithm and system benchmarking. As the algorithm and system tracks mature to support the same tasks, the harness will accelerate benchmarking by facilitating quick prototype complexity analysis followed by deployed performance measurement with minimal implementation overhead.

Discussion

Benchmarking neuromorphic computing has faced challenges stemming from the diversity of neuromorphic approaches, the range of implementation and deployment tools, and rapid research evolution. NeuroBench addresses these challenges as a framework for the inclusive, actionable, and iterative benchmarking of neuromorphic solutions, by including novel tasks and metrics, open-source and extendable harness tools, and facilitating systematic growth via community collaboration. NeuroBench is supported and developed by a broad community of neuromorphic researchers to be a standard, agreed-upon benchmarking framework for neuromorphic technology.

Initial NeuroBench benchmarks span applications across domains of continual learning, computer vision, sensorimotor prediction, and time-series forecasting, as well as system implementations for audio and optimization settings. Baselines for each benchmark of the complete v1.0 algorithm track demonstrate the utility and validity of the metric framework, and offer a starting point to further algorithmic research in model architecture and training for greater performance and lowered complexity.

The initial NeuroBench algorithm and system tracks achieve the first steps of designing benchmarks for algorithms executed in a digital time-stepped fashion and systems in mature deployment stages. In order to expand inclusive and fair benchmarking to further approaches, the next milestones of the NeuroBench project will be to extend metrics and standard protocol designs to cover continuous-time execution and a wider range of hardware platforms including FPGAs, custom integrated circuits, as well as more exploratory platforms in simulation stage, such as memristive hardware.

Another important direction for NeuroBench is towards closed-loop benchmarks [15, 66]. Biological systems excel in interacting with dynamic environments, demonstrating high energy efficiency, real-time reaction, and versatility. As such, embodied intelligence with adaptive sensory and action capabilities are of interest to neuromorphic research. In closed-loop scenarios, the objective is to sense and act within an environment to complete a task, rather than to statically process a frozen dataset, thus the benchmark harness infrastructure and measurement protocols will be extended to facilitate such benchmarks.

All future NeuroBench expansion will be informed by collected results and continue to be driven by the interests and development of the broader community.

Methods

This section outlines details and specifications of the benchmark metrics, tasks, and baselines.

Specifications of the Algorithm Track Metrics

NeuroBench includes correctness and complexity metrics, the latter of which is divided in static and workload metrics. Static metrics do not depend on the model inference and input data, while the workload metrics do. Note that the defined metrics reflect only the model and model execution. Data pre-processors and post-processors are not taken into account in the v1.0 algorithm track results.

Footprint

The footprint metric reflects the memory footprint a model. It is distinct from execution memory, which may incur further usage, e.g. to store activations. It is computed for a model by accumulating the sizes of the model’s parameters and buffers, in bytes. Parameters store the model synaptic weights, and buffers include other inference memory requirements, such as the internal states of recurrent or spiking layers and buffers of recent input data, if the model must record data for input binning. Considering $n$ parameters, each requiring $p_{i}$ bytes, and $b$ buffers of size $q_{j}$ , the total model footprint is $\sum_{i=0}^{n}p_{i}+\sum_{j=0}^{b}q_{j}$ .

Model Execution Rate

Execution rate is a metric which is not directly computed by the harness, but should be reported by the user. The metric reflects the real-time correlation of the rate at which the model computes input data. If the model processes input with a temporal stride of $t$ seconds, then the rate should be reported as $t^{-1}$ Hz. Note the distinction between stride and bin window - input can be binned in overlapping windows, but execution rate depends on the temporal stride of window processing. As an example, a model may use 50 ms windows of input and compute every 10 ms, which would give an execution rate of 100 Hz.

This metric is currently not well-defined for models operating under event-based or continuous-time contexts. These limitations will be addressed in future benchmark versions.

Connection sparsity

The parameter matrices of each layer $l$ in a model, representing synaptic weights, are collected, and the number of zero weights $m_{l}$ and total weights $n_{l}$ are aggregated, with the connection sparsity defined as $\frac{\sum_{l}m_{l}}{\sum_{l}n_{l}}$ .

Activation sparsity

Activation sparsity is computed after the inference phase. The sparsity is calculated by accumulating the number of zero activations ( $z$ ), over all neuron layers ( $l$ ), timesteps ( $t$ ), and input samples ( $i$ ) and dividing by the total number of neurons ( $N$ ), $\frac{\sum_{l}\sum_{t}\sum_{i}z_{l,t}^{i}}{\sum_{t}\sum_{l}\sum_{i}N_{l,t}^{i}}$ . The outputs of ReLU functions and spikes from spiking neurons are considered activations.

Synaptic operations

Synaptic operations are the multiplication of weights by activation or input data, and are calculated using the inputs and weights of connection layers (e.g., torch.nn.Linear and torch.nn.Conv2d). Effective synaptic operations are operations where a non-zero weight is multiplied by a non-zero activation. Effective operations are further divided into multiply-accumulates (MACs), and accumulates (ACs), where accumulates correlate with activations or input data only containing values of [-1, 0, 1], and multiply-accumulates cover all other cases. The reported number of synaptic operations is the average number of synaptic operations required per model execution, the rate of which is defined by the model execution rate metric.

The number of effective synaptic operations is computed by performing the forward pass of a layer and counting the number of operations in which there is no zero multiplication. Practically, this is implemented in the harness by setting all non-zero weights in the layer and all the non-zero activations to 1, then performing the forward pass and summing the output to give the number of synaptic operations.

The number of dense synaptic operations is computed in a similar fashion, by setting all weights and activations to 1 and accumulating the output of the forward pass. Biases are not taken into account in the calculation of the synaptic operations, as they are added after weight multiplications and accumulation.

Note that processing of activations before the connection layer, for instance using batch normalization, can transform sparse activations into dense input at the connection layer, which will lead to high effective synaptic operations despite high activation sparsity. Furthermore, such processing can transform binary activations to non-binary data, causing effective operations to be MACs rather than ACs. When deployed to neuromorphic hardware, such algorithms that normalize activations before multiplication with synaptic weights may lose the benefits of sparse operation, e.g., an SNN with normalization following each spiking layer would require dense MAC weight calculation, no matter how few spikes were generated.

In some cases, algorithm execution may have distinct temporal sections of higher and lower synaptic operations, such as during initial caching versus continuous inference. For such algorithms, benchmark users may choose to distinguish synaptic operations and other complexity measurements between execution sections.