subscribe to arXiv mailings

arXiv:2409.19356 [pdf, other]

Steering Prediction via a Multi-Sensor System for Autonomous Racing

Authors: Zhuyun Zhou, Zongwei Wu, Florian Bolli, Rémi Boutteau, Fan Yang, Radu Timofte, Dominique Ginhac, Tobi Delbruck

Abstract: Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for auto… ▽ More Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code, dataset, and benchmark will be released to promote future research. △ Less

Submitted 28 September, 2024; originally announced September 2024.

arXiv:2409.09648 [pdf, other]

SciDVS: A Scientific Event Camera with 1.7% Temporal Contrast Sensitivity at 0.7 lux

Authors: Rui Graca, Sheng Zhou, Brian McReynolds, Tobi Delbruck

Abstract: This paper reports a Dynamic Vision Sensor (DVS) event camera that is 6x more sensitive at 14x lower illumination than existing commercial and prototype cameras. Event cameras output a sparse stream of brightness change events. Their high dynamic range (HDR), quick response, and high temporal resolution provide key advantages for scientific applications that involve low lighting conditions and spa… ▽ More This paper reports a Dynamic Vision Sensor (DVS) event camera that is 6x more sensitive at 14x lower illumination than existing commercial and prototype cameras. Event cameras output a sparse stream of brightness change events. Their high dynamic range (HDR), quick response, and high temporal resolution provide key advantages for scientific applications that involve low lighting conditions and sparse visual events. However, current DVS are hindered by low sensitivity, resulting from shot noise and pixel-to-pixel mismatch. Commercial DVS have a minimum brightness change threshold of >10%. Sensitive prototypes achieved as low as 1%, but required kilo-lux illumination. Our SciDVS prototype fabricated in a 180nm CMOS image sensor process achieves 1.7% sensitivity at chip illumination of 0.7 lx and 18 Hz bandwidth. Novel features of SciDVS are (1) an auto-centering in-pixel preamplifier providing intrascene HDR and increased sensitivity, (2) improved control of bandwidth to limit shot noise, and (3) optional pixel binning, allowing the user to trade spatial resolution for sensitivity. △ Less

Submitted 15 September, 2024; originally announced September 2024.

Comments: Presented at ESSERC 2024

arXiv:2408.12425 [pdf, other]

doi 10.21437/Interspeech.2024-958

Dynamic Gated Recurrent Neural Network for Compute-efficient Speech Enhancement

Authors: Longbiao Cheng, Ashutosh Pandey, Buye Xu, Tobi Delbruck, Shih-Chii Liu

Abstract: This paper introduces a new Dynamic Gated Recurrent Neural Network (DG-RNN) for compute-efficient speech enhancement models running on resource-constrained hardware platforms. It leverages the slow evolution characteristic of RNN hidden states over steps, and updates only a selected set of neurons at each step by adding a newly proposed select gate to the RNN model. This select gate allows the com… ▽ More This paper introduces a new Dynamic Gated Recurrent Neural Network (DG-RNN) for compute-efficient speech enhancement models running on resource-constrained hardware platforms. It leverages the slow evolution characteristic of RNN hidden states over steps, and updates only a selected set of neurons at each step by adding a newly proposed select gate to the RNN model. This select gate allows the computation cost of the conventional RNN to be reduced during network inference. As a realization of the DG-RNN, we further propose the Dynamic Gated Recurrent Unit (D-GRU) which does not require additional parameters. Test results obtained from several state-of-the-art compute-efficient RNN-based speech enhancement architectures using the DNS challenge dataset, show that the D-GRU based model variants maintain similar speech intelligibility and quality metrics comparable to the baseline GRU based models even with an average 50% reduction in GRU computes. △ Less

Submitted 24 September, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

Comments: Proceedings of Interspeech 2024

arXiv:2407.08681 [pdf, other]

Hardware Neural Control of CartPole and F1TENTH Race Car

Authors: Marcin Paluch, Florian Bolli, Xiang Deng, Antonio Rios Navarro, Chang Gao, Tobi Delbruck

Abstract: Nonlinear model predictive control (NMPC) has proven to be an effective control method, but it is expensive to compute. This work demonstrates the use of hardware FPGA neural network controllers trained to imitate NMPC with supervised learning. We use these Neural Controllers (NCs) implemented on inexpensive embedded FPGA hardware for high frequency control on physical cartpole and F1TENTH race ca… ▽ More Nonlinear model predictive control (NMPC) has proven to be an effective control method, but it is expensive to compute. This work demonstrates the use of hardware FPGA neural network controllers trained to imitate NMPC with supervised learning. We use these Neural Controllers (NCs) implemented on inexpensive embedded FPGA hardware for high frequency control on physical cartpole and F1TENTH race car. Our results show that the NCs match the control performance of the NMPCs in simulation and outperform it in reality, due to the faster control rate that is afforded by the quick FPGA NC inference. We demonstrate kHz control rates for a physical cartpole and offloading control to the FPGA hardware on the F1TENTH car. Code and hardware implementation for this paper are available at https:// github.com/SensorsINI/Neural-Control-Tools. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2405.03905 [pdf, other]

A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM

Authors: Qinyu Chen, Kwantae Kim, Chang Gao, Sheng Zhou, Taekwang Jang, Tobi Delbruck, Shih-Chii Liu

Abstract: This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network (ΔRNN) cla… ▽ More This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network (ΔRNN) classifier, achieves an 11-class Google Speech Command Dataset (GSCD) KWS accuracy of 90.5% and energy consumption of 36nJ/decision. At 87% temporal sparsity, computing latency and energy per inference are reduced by 2.4$\times$/3.4$\times$, respectively. The 65nm design occupies 0.78mm$^2$ and features two additional blocks, a compact 0.084mm$^2$ digital infinite-impulse-response (IIR)-based band-pass filter (BPF) audio feature extractor (FEx) and a 24kB 0.6V near-Vth weight SRAM with 6.6$\times$ lower read power compared to the standard SRAM. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2312.09391 [pdf, other]

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

Authors: Xi Chen, Chang Gao, Zuowen Wang, Longbiao Cheng, Sheng Zhou, Shih-Chii Liu, Tobi Delbruck

Abstract: Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Del… ▽ More Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of $\sim$80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted by the 38th Annual AAAI Conference on Artificial Intelligence (AAAI-24)

arXiv:2304.07543 [pdf, other]

Within-Camera Multilayer Perceptron DVS Denoising

Authors: A. Rios-Navarro, S. Guo, G Abarajithan, K. Vijayakumar, A. Linares-Barranco, T. Aarrestad, R. Kastner, T. Delbruck

Abstract: In-camera event denoising reduces the data rate of event cameras by filtering out noise at the source. A lightweight multilayer perceptron denoising filter (MLPF) provides state-of-the-art low-cost denoising accuracy. It processes a small neighborhood of pixels from the timestamp image around each event to discriminate signal and noise events. This paper proposes two digital logic implementations… ▽ More In-camera event denoising reduces the data rate of event cameras by filtering out noise at the source. A lightweight multilayer perceptron denoising filter (MLPF) provides state-of-the-art low-cost denoising accuracy. It processes a small neighborhood of pixels from the timestamp image around each event to discriminate signal and noise events. This paper proposes two digital logic implementations of the MLPF denoiser and quantifies their resource cost, power, and latency. The hardware MLPF quantizes the weights and hidden unit activations to 4 bits and has about 1k weights with about 40% sparsity. The Area-Under-Curve Receiver Operating Characteristic accuracy is nearly indistinguishable from that of the floating point network. The FPGA MLPF processes each event in 10 clock cycles. In FPGA, it uses 3.5k flip flops and 11.5k LUTs. Our ASIC implementation in 65nm digital technology for a 346x260 pixel camera occupies an area of 4.3mm^2 and consumes 4nJ of energy per event at event rates up to 25MHz. The MLPF can be easily integrated into an event camera using an FPGA or as an ASIC directly on the camera chip or in the same package. This denoising could dramatically reduce the energy consumed by the communication and host processor and open new areas of always-on event camera application under scavenged and battery power. Code: https://github.com/SensorsINI/dnd_hls △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: Accepted to 2023 CVPRW Workshop on Event-Based Vision

arXiv:2304.04706 [pdf, other]

Shining light on the DVS pixel: A tutorial and discussion about biasing and optimization

Authors: Rui Graça, Brian McReynolds, Tobi Delbruck

Abstract: The operation of the DVS event camera is controlled by the user through adjusting different bias parameters. These biases affect the response of the camera by controlling - among other parameters - the bandwidth, sensitivity, and maximum firing rate of the pixels. Besides determining the response of the camera to input signals, biases significantly impact its noise performance. Bias optimization i… ▽ More The operation of the DVS event camera is controlled by the user through adjusting different bias parameters. These biases affect the response of the camera by controlling - among other parameters - the bandwidth, sensitivity, and maximum firing rate of the pixels. Besides determining the response of the camera to input signals, biases significantly impact its noise performance. Bias optimization is a multivariate process depending on the task and the scene, to which the user's knowledge about pixel design and non-idealities can be of great importance. In this paper, we go step-by-step along the signal pathway of the DVS pixel, shining light on its low-level operation and non-idealities, comparing pixel level measurements with array level measurements, and discussing and how biasing and illumination affect the pixel's behavior. With the results and discussion presented, we aim to help DVS users achieve more hardware-aware camera utilization and modelling. △ Less

Submitted 11 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: Accepted at 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 4th International Workshop on Event-Based Vision

arXiv:2304.04019 [pdf, other]

Optimal biasing and physical limits of DVS event noise

Authors: Rui Graca, Brian McReynolds, Tobi Delbruck

Abstract: Under dim lighting conditions, the output of Dynamic Vision Sensor (DVS) event cameras is strongly affected by noise. Photon and electron shot-noise cause a high rate of non-informative events that reduce Signal to Noise ratio. DVS noise performance depends not only on the scene illumination, but also on the user-controllable biasing of the camera. In this paper, we explore the physical limits of… ▽ More Under dim lighting conditions, the output of Dynamic Vision Sensor (DVS) event cameras is strongly affected by noise. Photon and electron shot-noise cause a high rate of non-informative events that reduce Signal to Noise ratio. DVS noise performance depends not only on the scene illumination, but also on the user-controllable biasing of the camera. In this paper, we explore the physical limits of DVS noise, showing that the DVS photoreceptor is limited to a theoretical minimum of 2x photon shot noise, and we discuss how biasing the DVS with high photoreceptor bias and adequate source-follower bias approaches optimal noise performance. We support our conclusions with pixel-level measurements of a DAVIS346 and analysis of a theoretical pixel model. △ Less

Submitted 12 April, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

Comments: Accepted to the 2023 International Image Sensor Workshop (IISW)

arXiv:2304.03494 [pdf, other]

Exploiting Alternating DVS Shot Noise Event Pair Statistics to Reduce Background Activity

Authors: Brian McReynolds, Rui Graca, Tobi Delbruck

Abstract: Dynamic Vision Sensors (DVS) record "events" corresponding to pixel-level brightness changes, resulting in data-efficient representation of a dynamic visual scene. As DVS expand into increasingly diverse applications, non-ideal behaviors in their output under extreme sensing conditions are important to consider. Under low illumination (below ~10 lux) their output begins to be dominated by shot noi… ▽ More Dynamic Vision Sensors (DVS) record "events" corresponding to pixel-level brightness changes, resulting in data-efficient representation of a dynamic visual scene. As DVS expand into increasingly diverse applications, non-ideal behaviors in their output under extreme sensing conditions are important to consider. Under low illumination (below ~10 lux) their output begins to be dominated by shot noise events (SNEs) which increase the data output and obscure true signal. SNE rates can be controlled to some degree by tuning circuit parameters to reduce sensitivity or temporal response bandwidth at the cost of signal loss. Alternatively, an improved understanding of SNE statistics can be leveraged to develop novel techniques for minimizing uninformative sensor output. We first explain a fundamental observation about sequential pairing of opposite polarity SNEs based on pixel circuit logic and validate our theory using DVS recordings and simulations. Finally, we derive a practical result from this new understanding and demonstrate two novel biasing techniques to reduce SNEs by 50% and 80% respectively while still retaining sensitivity and/or temporal resolution. △ Less

Submitted 12 April, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: IISW 2023, paper R5.6

arXiv:2211.09893 [pdf, other]

Measuring diameters and velocities of artificial raindrops with a neuromorphic dynamic vision sensor disdrometer

Authors: Jan Steiner, Kire Micev, Asude Aydin, Jörg Rieckermann, Tobi Delbruck

Abstract: Hydrometers that can measure size and velocity distributions of precipitation are needed for research and corrections of rainfall estimates from weather radars and microwave links. Existing video disdrometers measure drop size distributions, but underestimate small raindrops and are impractical for widespread always-on IoT deployment. We propose an innovative method of measuring droplet size and v… ▽ More Hydrometers that can measure size and velocity distributions of precipitation are needed for research and corrections of rainfall estimates from weather radars and microwave links. Existing video disdrometers measure drop size distributions, but underestimate small raindrops and are impractical for widespread always-on IoT deployment. We propose an innovative method of measuring droplet size and velocity using a neuromorphic event camera. These dynamic vision sensors asynchronously output a sparse stream of pixel brightness changes. Droplets falling through the plane of focus create events generated by the motion of the droplet. Droplet size and speed are inferred from the stream of events. Using an improved hard disk arm actuator to reliably generate artificial raindrops, our experiments show small errors of 7% (maximum mean absolute percentage error) for droplet sizes from 0.3 to 2.5 mm and speeds from 1.3 m/s to 8.0 m/s. Each droplet requires the processing of only a few hundred to thousands of events, potentially enabling low-power always-on disdrometers that consume power proportional to the rainfall rate. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 7 pages and 2 figures, plus supplementary 12 pages and 10 figures. Submitted to Atmospheric Measurement Techniques. Data and code at https://drive.google.com/drive/folders/153C2YDQh-AFjdBd1kromg9BBv2esfq8e?usp=sharing

arXiv:2208.00693 [pdf, other]

doi 10.1109/JSSC.2022.3195610

A 23 $μ$W Keyword Spotting IC with Ring-Oscillator-Based Time-Domain Feature Extraction

Authors: Kwantae Kim, Chang Gao, Rui Graça, Ilya Kiselev, Hoi-Jun Yoo, Tobi Delbruck, Shih-Chii Liu

Abstract: This article presents the first keyword spotting (KWS) IC which uses a ring-oscillator-based time-domain processing technique for its analog feature extractor (FEx). Its extensive usage of time-encoding schemes allows the analog audio signal to be processed in a fully time-domain manner except for the voltage-to-time conversion stage of the analog front-end. Benefiting from fundamental building bl… ▽ More This article presents the first keyword spotting (KWS) IC which uses a ring-oscillator-based time-domain processing technique for its analog feature extractor (FEx). Its extensive usage of time-encoding schemes allows the analog audio signal to be processed in a fully time-domain manner except for the voltage-to-time conversion stage of the analog front-end. Benefiting from fundamental building blocks based on digital logic gates, it offers a better technology scalability compared to conventional voltage-domain designs. Fabricated in a 65 nm CMOS process, the prototyped KWS IC occupies 2.03mm$^{2}$ and dissipates 23 $μ$W power consumption including analog FEx and digital neural network classifier. The 16-channel time-domain FEx achieves 54.89 dB dynamic range for 16 ms frame shift size while consuming 9.3 $μ$W. The measurement result verifies that the proposed IC performs a 12-class KWS task on the Google Speech Command Dataset (GSCD) with >86% accuracy and 12.4 ms latency. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: 14 pages, 21 figures, 2 tables

arXiv:2202.13076 [pdf, other]

Utility and Feasibility of a Center Surround Event Camera

Authors: Tobi Delbruck, Chenghan Li, Rui Graca, Brian Mcreynolds

Abstract: Standard dynamic vision sensor (DVS) event cameras output a stream of spatially-independent log-intensity brightness change events so they cannot suppress spatial redundancy. Nearly all biological retinas use an antagonistic center-surround organization. This paper proposes a practical method of implementing a compact, energy-efficient Center Surround DVS (CSDVS) with a surround smoothing network… ▽ More Standard dynamic vision sensor (DVS) event cameras output a stream of spatially-independent log-intensity brightness change events so they cannot suppress spatial redundancy. Nearly all biological retinas use an antagonistic center-surround organization. This paper proposes a practical method of implementing a compact, energy-efficient Center Surround DVS (CSDVS) with a surround smoothing network that uses compact polysilicon resistors for lateral resistance. The paper includes behavioral simulation results for the CSDVS (see sites.google.com/view/csdvs/home). The CSDVS would significantly reduce events caused by low spatial frequencies, but amplify the informative high frequency spatiotemporal events. △ Less

Submitted 26 February, 2022; originally announced February 2022.

Comments: 5 pages, submitted to 29th IEEE International Conference on Image Processing (IEEE ICIP 2022)

ACM Class: B.7.1; I.4.1

arXiv:2112.01933 [pdf, other]

Bio-inspired Polarization Event Camera

Authors: Germain Haessig, Damien Joubert, Justin Haque, Yingkai Chen, Moritz Milde, Tobi Delbruck, Viktor Gruev

Abstract: The stomatopod (mantis shrimp) visual system has recently provided a blueprint for the design of paradigm-shifting polarization and multispectral imaging sensors, enabling solutions to challenging medical and remote sensing problems. However, these bioinspired sensors lack the high dynamic range (HDR) and asynchronous polarization vision capabilities of the stomatopod visual system, limiting tempo… ▽ More The stomatopod (mantis shrimp) visual system has recently provided a blueprint for the design of paradigm-shifting polarization and multispectral imaging sensors, enabling solutions to challenging medical and remote sensing problems. However, these bioinspired sensors lack the high dynamic range (HDR) and asynchronous polarization vision capabilities of the stomatopod visual system, limiting temporal resolution to \~12 ms and dynamic range to \~ 72 dB. Here we present a novel stomatopod-inspired polarization camera which mimics the sustained and transient biological visual pathways to save power and sample data beyond the maximum Nyquist frame rate. This bio-inspired sensor simultaneously captures both synchronous intensity frames and asynchronous polarization brightness change information with sub-millisecond latencies over a million-fold range of illumination. Our PDAVIS camera is comprised of 346x260 pixels, organized in 2-by-2 macropixels, which filter the incoming light with four linear polarization filters offset by 45 degrees. Polarization information is reconstructed using both low cost and latency event-based algorithms and more accurate but slower deep neural networks. Our sensor is used to image HDR polarization scenes which vary at high speeds and to observe dynamical properties of single collagen fibers in bovine tendon under rapid cyclical loads △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2109.08640 [pdf, other]

Unraveling the paradox of intensity-dependent DVS pixel noise

Authors: Rui Graca, Tobi Delbruck

Abstract: Dynamic vision sensor (DVS) event camera output is affected by noise, particularly in dim lighting conditions. A theory explaining how photon and electron noise affect DVS output events has so far not been developed. Moreover, there is no clear understanding of how DVS parameters and operating conditions affect noise. There is an apparent paradox between the real noise data observed from the DVS o… ▽ More Dynamic vision sensor (DVS) event camera output is affected by noise, particularly in dim lighting conditions. A theory explaining how photon and electron noise affect DVS output events has so far not been developed. Moreover, there is no clear understanding of how DVS parameters and operating conditions affect noise. There is an apparent paradox between the real noise data observed from the DVS output and the reported noise measurements of the logarithmic photoreceptor. While measurements of the logarithmic photoreceptor predict that the photoreceptor is approximately a first-order system with RMS noise voltage independent of the photocurrent, DVS output shows higher noise event rates at low light intensity. This paper unravels this paradox by showing how the DVS photoreceptor is a second-order system, and the assumption that it is first-order is generally not reasonable. As we show, at higher photocurrents, the photoreceptor amplifier dominates the frequency response, causing a drop in RMS noise voltage and noise event rate. We bring light to the noise performance of the DVS photoreceptor by presenting a theoretical explanation supported by both transistor-level simulation results and chip measurements. △ Less

Submitted 17 September, 2021; originally announced September 2021.

Comments: Presented in 2021 International Image Sensor Workshop (IISW)

Journal ref: 2021 International Image Sensor Workshop (IISW)

arXiv:2108.02297 [pdf, other]

doi 10.1109/TNNLS.2022.3180209

Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting Spatio-Temporal Sparsity

Authors: Chang Gao, Tobi Delbruck, Shih-Chii Liu

Abstract: Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time-sequential data such as speech recognition. Unlike previous LSTM accelerators that either exploit spatial weight sparsity or temporal activation sparsity, this paper proposes a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. Spatial sparsity… ▽ More Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time-sequential data such as speech recognition. Unlike previous LSTM accelerators that either exploit spatial weight sparsity or temporal activation sparsity, this paper proposes a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. Spatial sparsity is induced using a new Column-Balanced Targeted Dropout (CBTD) structured pruning method, producing structured sparse weight matrices for a balanced workload. The pruned networks running on Spartus hardware achieve weight sparsity levels of up to 96% and 94% with negligible accuracy loss on the TIMIT and the Librispeech datasets. To induce temporal sparsity in LSTM, we extend the previous DeltaGRU method to the DeltaLSTM method. Combining spatio-temporal sparsity with CBTD and DeltaLSTM saves on weight memory access and associated arithmetic operations. The Spartus architecture is scalable and supports real-time online speech recognition when implemented on small and large FPGAs. Spartus per-sample latency for a single DeltaLSTM layer of 1024 neurons averages 1 us. Exploiting spatio-temporal sparsity on our test LSTM network using the TIMIT dataset leads to 46X speedup of Spartus over its theoretical hardware performance to achieve 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/s/W power efficiency. △ Less

Submitted 13 June, 2022; v1 submitted 4 August, 2021; originally announced August 2021.

Comments: Accepted for publication in IEEE Transactions on Neural Networks and Learning Systems, 2022

Journal ref: IEEE Transactions on Neural Networks and Learning Systems, 2022

arXiv:2105.00409 [pdf, other]

Feedback control of event cameras

Authors: Tobi Delbruck, Rui Graca, Marcin Paluch

Abstract: Dynamic vision sensor event cameras produce a variable data rate stream of brightness change events. Event production at the pixel level is controlled by threshold, bandwidth, and refractory period bias current parameter settings. Biases must be adjusted to match application requirements and the optimal settings depend on many factors. As a first step towards automatic control of biases, this pape… ▽ More Dynamic vision sensor event cameras produce a variable data rate stream of brightness change events. Event production at the pixel level is controlled by threshold, bandwidth, and refractory period bias current parameter settings. Biases must be adjusted to match application requirements and the optimal settings depend on many factors. As a first step towards automatic control of biases, this paper proposes fixed-step feedback controllers that use measurements of event rate and noise. The controllers regulate the event rate within an acceptable range using threshold and refractory period control, and regulate noise using bandwidth control. Experiments demonstrate model validity and feedback control. △ Less

Submitted 2 May, 2021; originally announced May 2021.

Comments: Accepted at 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); Third International Workshop on Event-Based Vision

arXiv:2012.13600 [pdf, other]

doi 10.1109/JETCAS.2020.3040300

EdgeDRNN: Recurrent Neural Network Accelerator for Edge Inference

Authors: Chang Gao, Antonio Rios-Navarro, Xi Chen, Shih-Chii Liu, Tobi Delbruck

Abstract: Low-latency, low-power portable recurrent neural network (RNN) accelerators offer powerful inference capabilities for real-time applications such as IoT, robotics, and human-machine interaction. We propose a lightweight Gated Recurrent Unit (GRU)-based RNN accelerator called EdgeDRNN that is optimized for low-latency edge RNN inference with batch size of 1. EdgeDRNN adopts the spiking neural netwo… ▽ More Low-latency, low-power portable recurrent neural network (RNN) accelerators offer powerful inference capabilities for real-time applications such as IoT, robotics, and human-machine interaction. We propose a lightweight Gated Recurrent Unit (GRU)-based RNN accelerator called EdgeDRNN that is optimized for low-latency edge RNN inference with batch size of 1. EdgeDRNN adopts the spiking neural network inspired delta network algorithm to exploit temporal sparsity in RNNs. Weights are stored in inexpensive DRAM which enables EdgeDRNN to compute large multi-layer RNNs on the most inexpensive FPGA. The sparse updates reduce DRAM weight memory access by a factor of up to 10x and the delta can be varied dynamically to trade-off between latency and accuracy. EdgeDRNN updates a 5 million parameter 2-layer GRU-RNN in about 0.5ms. It achieves latency comparable with a 92W Nvidia 1080 GPU. It outperforms NVIDIA Jetson Nano, Jetson TX2 and Intel Neural Compute Stick 2 in latency by 5X. For a batch size of 1, EdgeDRNN achieves a mean effective throughput of 20.2GOp/s and a wall plug power efficiency that is over 4X higher than the commercial edge AI platforms. △ Less

Submitted 25 December, 2020; originally announced December 2020.

Journal ref: in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 10, no. 4, pp. 419-432, Dec. 2020

arXiv:2006.07722 [pdf, other]

v2e: From Video Frames to Realistic DVS Events

Authors: Yuhuang Hu, Shih-Chii Liu, Tobi Delbruck

Abstract: To help meet the increasing need for dynamic vision sensor (DVS) event camera data, this paper proposes the v2e toolbox that generates realistic synthetic DVS events from intensity frames. It also clarifies incorrect claims about DVS motion blur and latency characteristics in recent literature. Unlike other toolboxes, v2e includes pixel-level Gaussian event threshold mismatch, finite intensity-dep… ▽ More To help meet the increasing need for dynamic vision sensor (DVS) event camera data, this paper proposes the v2e toolbox that generates realistic synthetic DVS events from intensity frames. It also clarifies incorrect claims about DVS motion blur and latency characteristics in recent literature. Unlike other toolboxes, v2e includes pixel-level Gaussian event threshold mismatch, finite intensity-dependent bandwidth, and intensity-dependent noise. Realistic DVS events are useful in training networks for uncontrolled lighting conditions. The use of v2e synthetic events is demonstrated in two experiments. The first experiment is object recognition with N-Caltech 101 dataset. Results show that pretraining on various v2e lighting conditions improves generalization when transferred on real DVS data for a ResNet model. The second experiment shows that for night driving, a car detector trained with v2e events shows an average accuracy improvement of 40% compared to the YOLOv3 trained on intensity frames. △ Less

Submitted 19 April, 2021; v1 submitted 13 June, 2020; originally announced June 2020.

Comments: Accepted at 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); Third International Workshop on Event-Based Vision

arXiv:2005.08605 [pdf, other]

DDD20 End-to-End Event Camera Driving Dataset: Fusing Frames and Events with Deep Learning for Improved Steering Prediction

Authors: Yuhuang Hu, Jonathan Binas, Daniel Neil, Shih-Chii Liu, Tobi Delbruck

Abstract: Neuromorphic event cameras are useful for dynamic vision problems under difficult lighting conditions. To enable studies of using event cameras in automobile driving applications, this paper reports a new end-to-end driving dataset called DDD20. The dataset was captured with a DAVIS camera that concurrently streams both dynamic vision sensor (DVS) brightness change events and active pixel sensor (… ▽ More Neuromorphic event cameras are useful for dynamic vision problems under difficult lighting conditions. To enable studies of using event cameras in automobile driving applications, this paper reports a new end-to-end driving dataset called DDD20. The dataset was captured with a DAVIS camera that concurrently streams both dynamic vision sensor (DVS) brightness change events and active pixel sensor (APS) intensity frames. DDD20 is the longest event camera end-to-end driving dataset to date with 51h of DAVIS event+frame camera and vehicle human control data collected from 4000km of highway and urban driving under a variety of lighting conditions. Using DDD20, we report the first study of fusing brightness change events and intensity frame data using a deep learning approach to predict the instantaneous human steering wheel angle. Over all day and night conditions, the explained variance for human steering prediction from a Resnet-32 is significantly better from the fused DVS+APS frames (0.88) than using either DVS (0.67) or APS (0.77) data alone. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: Accepted in The 23rd IEEE International Conference on Intelligent Transportation Systems (Special Session: Beyond Traditional Sensing for Intelligent Transportation)

arXiv:2003.13006 [pdf, other]

Data-Driven Neuromorphic DRAM-based CNN and RNN Accelerators

Authors: Tobi Delbruck, Shih-Chii Liu

Abstract: The energy consumed by running large deep neural networks (DNNs) on hardware accelerators is dominated by the need for lots of fast memory to store both states and weights. This large required memory is currently only economically viable through DRAM. Although DRAM is high-throughput and low-cost memory (costing 20X less than SRAM), its long random access latency is bad for the unpredictable acces… ▽ More The energy consumed by running large deep neural networks (DNNs) on hardware accelerators is dominated by the need for lots of fast memory to store both states and weights. This large required memory is currently only economically viable through DRAM. Although DRAM is high-throughput and low-cost memory (costing 20X less than SRAM), its long random access latency is bad for the unpredictable access patterns in spiking neural networks (SNNs). In addition, accessing data from DRAM costs orders of magnitude more energy than doing arithmetic with that data. SNNs are energy-efficient if local memory is available and few spikes are generated. This paper reports on our developments over the last 5 years of convolutional and recurrent deep neural network hardware accelerators that exploit either spatial or temporal sparsity similar to SNNs but achieve SOA throughput, power efficiency and latency even with the use of DRAM for the required storage of the weights and states of large DNNs. △ Less

Submitted 29 March, 2020; originally announced March 2020.

Comments: To appear in 2019 IEEE Sig. Proc. Soc. Asilomar Conference on Signals, Systems, and Computers Session MP6b: Neuromorphic Computing (Invited)

ACM Class: I.5.5

arXiv:2003.10959 [pdf, other]

Learning to Exploit Multiple Vision Modalities by Using Grafted Networks

Authors: Yuhuang Hu, Tobi Delbruck, Shih-Chii Liu

Abstract: Novel vision sensors such as thermal, hyperspectral, polarization, and event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional… ▽ More Novel vision sensors such as thermal, hyperspectral, polarization, and event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. The self-supervised training uses only synchronously-recorded intensity frames and novel sensor data to maximize feature similarity between the pretrained network and the grafted network. We show that the enhanced grafted network reaches competitive average precision (AP50) scores to the pretrained network on an object detection task using thermal and event camera datasets, with no increase in inference costs. Particularly, the grafted network driven by thermal frames showed a relative improvement of 49.11% over the use of intensity frames. The grafted front end has only 5--8% of the total parameters and can be trained in a few hours on a single GPU equivalent to 5% of the time that would be needed to train the entire object detector from labeled data. NGA allows new vision sensors to capitalize on previously pretrained powerful deep models, saving on training cost and widening a range of applications for novel sensors. △ Less

Submitted 22 July, 2020; v1 submitted 24 March, 2020; originally announced March 2020.

Comments: Accepted at ECCV 2020, 14 pages

arXiv:2002.03197 [pdf, other]

doi 10.1109/ICRA40945.2020.9196984

Recurrent Neural Network Control of a Hybrid Dynamic Transfemoral Prosthesis with EdgeDRNN Accelerator

Authors: Chang Gao, Rachel Gehlhar, Aaron D. Ames, Shih-Chii Liu, Tobi Delbruck

Abstract: Lower leg prostheses could improve the life quality of amputees by increasing comfort and reducing energy to locomote, but currently control methods are limited in modulating behaviors based upon the human's experience. This paper describes the first steps toward learning complex controllers for dynamical robotic assistive devices. We provide the first example of behavioral cloning to control a po… ▽ More Lower leg prostheses could improve the life quality of amputees by increasing comfort and reducing energy to locomote, but currently control methods are limited in modulating behaviors based upon the human's experience. This paper describes the first steps toward learning complex controllers for dynamical robotic assistive devices. We provide the first example of behavioral cloning to control a powered transfemoral prostheses using a Gated Recurrent Unit (GRU) based recurrent neural network (RNN) running on a custom hardware accelerator that exploits temporal sparsity. The RNN is trained on data collected from the original prosthesis controller. The RNN inference is realized by a novel EdgeDRNN accelerator in real-time. Experimental results show that the RNN can replace the nominal PD controller to realize end-to-end control of the AMPRO3 prosthetic leg walking on flat ground and unforeseen slopes with comparable tracking accuracy. EdgeDRNN computes the RNN about 240 times faster than real time, opening the possibility of running larger networks for more complex tasks in the future. Implementing an RNN on this real-time dynamical system with impacts sets the ground work to incorporate other learned elements of the human-prosthesis system into prosthesis control. △ Less

Submitted 28 July, 2020; v1 submitted 8 February, 2020; originally announced February 2020.

Comments: Accepted at 2020 International Conference on Robotics and Automation (ICRA 2020)

Journal ref: 2020 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:1912.12193 [pdf, ps, other]

doi 10.1109/AICAS48895.2020.9074001

EdgeDRNN: Enabling Low-latency Recurrent Neural Network Edge Inference

Authors: Chang Gao, Antonio Rios-Navarro, Xi Chen, Tobi Delbruck, Shih-Chii Liu

Abstract: This paper presents a Gated Recurrent Unit (GRU) based recurrent neural network (RNN) accelerator called EdgeDRNN designed for portable edge computing. EdgeDRNN adopts the spiking neural network inspired delta network algorithm to exploit temporal sparsity in RNNs. It reduces off-chip memory access by a factor of up to 10x with tolerable accuracy loss. Experimental results on a 10 million paramete… ▽ More This paper presents a Gated Recurrent Unit (GRU) based recurrent neural network (RNN) accelerator called EdgeDRNN designed for portable edge computing. EdgeDRNN adopts the spiking neural network inspired delta network algorithm to exploit temporal sparsity in RNNs. It reduces off-chip memory access by a factor of up to 10x with tolerable accuracy loss. Experimental results on a 10 million parameter 2-layer GRU-RNN, with weights stored in DRAM, show that EdgeDRNN computes them in under 0.5 ms. With 2.42 W wall plug power on an entry level USB powered FPGA board, it achieves latency comparable with a 92 W Nvidia 1080 GPU. It outperforms NVIDIA Jetson Nano, Jetson TX2 and Intel Neural Compute Stick 2 in latency by 6X. For a batch size of 1, EdgeDRNN achieves a mean effective throughput of 20.2 GOp/s and a wall plug power efficiency that is over 4X higher than all other platforms. △ Less

Submitted 28 July, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

Comments: This paper has been accepted for publication at the IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genoa, 2020

Journal ref: 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)

arXiv:1906.08859 [pdf, other]

Closing the Accuracy Gap in an Event-Based Visual Recognition Task

Authors: Bodo Rückauer, Nicolas Känzig, Shih-Chii Liu, Tobi Delbruck, Yulia Sandamirskaya

Abstract: Mobile and embedded applications require neural networks-based pattern recognition systems to perform well under a tight computational budget. In contrast to commonly used synchronous, frame-based vision systems and CNNs, asynchronous, spiking neural networks driven by event-based visual input respond with low latency to sparse, salient features in the input, leading to high efficiency at run-time… ▽ More Mobile and embedded applications require neural networks-based pattern recognition systems to perform well under a tight computational budget. In contrast to commonly used synchronous, frame-based vision systems and CNNs, asynchronous, spiking neural networks driven by event-based visual input respond with low latency to sparse, salient features in the input, leading to high efficiency at run-time. The discrete nature of the event-based data streams makes direct training of asynchronous neural networks challenging. This paper studies asynchronous spiking neural networks, obtained by conversion from a conventional CNN trained on frame-based data. As an example, we consider a CNN trained to steer a robot to follow a moving target. We identify possible pitfalls of the conversion and demonstrate how the proposed solutions bring the classification accuracy of the asynchronous network to only 3\% below the performance of the original synchronous CNN, while requiring 12x fewer computations. While being applied to a simple task, this work is an important step towards low-power, fast, and embedded neural networks-based vision solutions for robotic applications. △ Less

Submitted 6 May, 2019; originally announced June 2019.

arXiv:1905.07419 [pdf, other]

Dynamic Vision Sensor integration on FPGA-based CNN accelerators for high-speed visual classification

Authors: Alejandro Linares-Barranco, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Tobi Delbruck

Abstract: Deep-learning is a cutting edge theory that is being applied to many fields. For vision applications the Convolutional Neural Networks (CNN) are demanding significant accuracy for classification tasks. Numerous hardware accelerators have populated during the last years to improve CPU or GPU based solutions. This technology is commonly prototyped and tested over FPGAs before being considered for AS… ▽ More Deep-learning is a cutting edge theory that is being applied to many fields. For vision applications the Convolutional Neural Networks (CNN) are demanding significant accuracy for classification tasks. Numerous hardware accelerators have populated during the last years to improve CPU or GPU based solutions. This technology is commonly prototyped and tested over FPGAs before being considered for ASIC fabrication for mass production. The use of commercial typical cameras (30fps) limits the capabilities of these systems for high speed applications. The use of dynamic vision sensors (DVS) that emulate the behavior of a biological retina is taking an incremental importance to improve this applications due to its nature, where the information is represented by a continuous stream of spikes and the frames to be processed by the CNN are constructed collecting a fixed number of these spikes (called events). The faster an object is, the more events are produced by DVS, so the higher is the equivalent frame rate. Therefore, these DVS utilization allows to compute a frame at the maximum speed a CNN accelerator can offer. In this paper we present a VHDL/HLS description of a pipelined design for FPGA able to collect events from an Address-Event-Representation (AER) DVS retina to obtain a normalized histogram to be used by a particular CNN accelerator, called NullHop. VHDL is used to describe the circuit, and HLS for computation blocks, which are used to perform the normalization of a frame needed for the CNN. Results outperform previous implementations of frames collection and normalization using ARM processors running at 800MHz on a Zynq7100 in both latency and power consumption. A measured 67% speedup factor is presented for a Roshambo CNN real-time experiment running at 160fps peak rate. △ Less

Submitted 17 May, 2019; originally announced May 2019.

Comments: 7 pages

arXiv:1904.08405 [pdf, other]

doi 10.1109/TPAMI.2020.3008413

Event-based Vision: A Survey

Authors: Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew Davison, Joerg Conradt, Kostas Daniilidis, Davide Scaramuzza

Abstract: Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of… ▽ More Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as low-latency, high speed, and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world. △ Less

Submitted 8 August, 2020; v1 submitted 17 April, 2019; originally announced April 2019.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:1903.07520 [pdf, other]

EV-IMO: Motion Segmentation Dataset and Learning Pipeline for Event Cameras

Authors: Anton Mitrokhin, Chengxi Ye, Cornelia Fermuller, Yiannis Aloimonos, Tobi Delbruck

Abstract: We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset - EV-IMO - which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion a… ▽ More We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset - EV-IMO - which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates pixel-wise independently moving object segmentation and computes per-object 3D translational velocities for moving objects. We also train a shallow network with just 40k parameters, which is able to compute depth and egomotion. Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects simultaneously in the camera field of view. The objects and the camera are tracked by the VICON motion capture system. By 3D scanning the room and the objects, accurate depth map ground truth and pixel-wise object masks are obtained, which are reliable even in poor lighting conditions and during fast motion. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that our approach far surpasses its rivals and is well suited for scene constrained robotics applications. △ Less

Submitted 12 January, 2020; v1 submitted 18 March, 2019; originally announced March 2019.

Comments: 8 pages, 6 figures. Submitted to 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019)

arXiv:1807.03128 [pdf]

PRED18: Dataset and Further Experiments with DAVIS Event Camera in Predator-Prey Robot Chasing

Authors: Diederik Paul Moeys, Daniel Neil, Federico Corradi, Emmett Kerr, Philip Vance, Gautham Das, Sonya A. Coleman, Thomas M. McGinnity, Dermot Kerr, Tobi Delbruck

Abstract: Machine vision systems using convolutional neural networks (CNNs) for robotic applications are increasingly being developed. Conventional vision CNNs are driven by camera frames at constant sample rate, thus achieving a fixed latency and power consumption tradeoff. This paper describes further work on the first experiments of a closed-loop robotic system integrating a CNN together with a Dynamic a… ▽ More Machine vision systems using convolutional neural networks (CNNs) for robotic applications are increasingly being developed. Conventional vision CNNs are driven by camera frames at constant sample rate, thus achieving a fixed latency and power consumption tradeoff. This paper describes further work on the first experiments of a closed-loop robotic system integrating a CNN together with a Dynamic and Active Pixel Vision Sensor (DAVIS) in a predator/prey scenario. The DAVIS, mounted on the predator Summit XL robot, produces frames at a fixed 15 Hz frame-rate and Dynamic Vision Sensor (DVS) histograms containing 5k ON and OFF events at a variable frame-rate ranging from 15-500 Hz depending on the robot speeds. In contrast to conventional frame-based systems, the latency and processing cost depends on the rate of change of the image. The CNN is trained offline on the 1.25h labeled dataset to recognize the position and size of the prey robot, in the field of view of the predator. During inference, combining the ten output classes of the CNN allows extracting the analog position vector of the prey relative to the predator with a mean 8.7% error in angular estimation. The system is compatible with conventional deep learning technology, but achieves a variable latency-power tradeoff that adapts automatically to the dynamics. Finally, investigations on the robustness of the algorithm, a human performance comparison and a deconvolution analysis are also explored. △ Less

Submitted 2 July, 2018; originally announced July 2018.

Comments: 8 pages

Journal ref: IEEE EBCCSP 2018

arXiv:1805.03988 [pdf, other]

ABMOF: A Novel Optical Flow Algorithm for Dynamic Vision Sensors

Authors: Min Liu, Tobi Delbruck

Abstract: Dynamic Vision Sensors (DVS), which output asynchronous log intensity change events, have potential applications in high-speed robotics, autonomous cars and drones. The precise event timing, sparse output, and wide dynamic range of the events are well suited for optical flow, but conventional optical flow (OF) algorithms are not well matched to the event stream data. This paper proposes an event-d… ▽ More Dynamic Vision Sensors (DVS), which output asynchronous log intensity change events, have potential applications in high-speed robotics, autonomous cars and drones. The precise event timing, sparse output, and wide dynamic range of the events are well suited for optical flow, but conventional optical flow (OF) algorithms are not well matched to the event stream data. This paper proposes an event-driven OF algorithm called adaptive block-matching optical flow (ABMOF). ABMOF uses time slices of accumulated DVS events. The time slices are adaptively rotated based on the input events and OF results. Compared with other methods such as gradient-based OF, ABMOF can efficiently be implemented in compact logic circuits. Results show that ABMOF achieves comparable accuracy to conventional standards such as Lucas-Kanade (LK). The main contributions of our paper are new adaptive time-slice rotation methods that ensure the generated slices have sufficient features for matching,including a feedback mechanism that controls the generated slices to have average slice displacement within the block search range. An LK method using our adapted slices is also implemented. The ABMOF accuracy is compared with this LK method on natural scene data including sparse and dense texture, high dynamic range, and fast motion exceeding 30,000 pixels per second.The paper dataset and source code are available from http://sensors.ini.uzh.ch/databases.html. △ Less

Submitted 10 May, 2018; originally announced May 2018.

Comments: 11 pages, 10 figures, Video of result: https://youtu.be/Ss-MciioqTk

arXiv:1711.04713 [pdf, other]

ADaPTION: Toolbox and Benchmark for Training Convolutional Neural Networks with Reduced Numerical Precision Weights and Activation

Authors: Moritz B. Milde, Daniel Neil, Alessandro Aimar, Tobi Delbruck, Giacomo Indiveri

Abstract: Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) are useful for many practical tasks in machine learning. Synaptic weights, as well as neuron activation functions within the deep network are typically stored with high-precision formats, e.g. 32 bit floating point. However, since storage capacity is limited and each memory access consumes power, both storage capacity and memory… ▽ More Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) are useful for many practical tasks in machine learning. Synaptic weights, as well as neuron activation functions within the deep network are typically stored with high-precision formats, e.g. 32 bit floating point. However, since storage capacity is limited and each memory access consumes power, both storage capacity and memory access are two crucial factors in these networks. Here we present a method and present the ADaPTION toolbox to extend the popular deep learning library Caffe to support training of deep CNNs with reduced numerical precision of weights and activations using fixed point notation. ADaPTION includes tools to measure the dynamic range of weights and activations. Using the ADaPTION tools, we quantized several CNNs including VGG16 down to 16-bit weights and activations with only 0.8% drop in Top-1 accuracy. The quantization, especially of the activations, leads to increase of up to 50% of sparsity especially in early and intermediate layers, which we exploit to skip multiplications with zero, thus performing faster and computationally cheaper inference. △ Less

Submitted 13 November, 2017; originally announced November 2017.

Comments: 10 pages, 5 figures

arXiv:1711.01458 [pdf, other]

DDD17: End-To-End DAVIS Driving Dataset

Authors: Jonathan Binas, Daniel Neil, Shih-Chii Liu, Tobi Delbruck

Abstract: Event cameras, such as dynamic vision sensors (DVS), and dynamic and active-pixel vision sensors (DAVIS) can supplement other autonomous driving sensors by providing a concurrent stream of standard active pixel sensor (APS) images and DVS temporal contrast events. The APS stream is a sequence of standard grayscale global-shutter image sensor frames. The DVS events represent brightness changes occu… ▽ More Event cameras, such as dynamic vision sensors (DVS), and dynamic and active-pixel vision sensors (DAVIS) can supplement other autonomous driving sensors by providing a concurrent stream of standard active pixel sensor (APS) images and DVS temporal contrast events. The APS stream is a sequence of standard grayscale global-shutter image sensor frames. The DVS events represent brightness changes occurring at a particular moment, with a jitter of about a millisecond under most lighting conditions. They have a dynamic range of >120 dB and effective frame rates >1 kHz at data rates comparable to 30 fps (frames/second) image sensors. To overcome some of the limitations of current image acquisition technology, we investigate in this work the use of the combined DVS and APS streams in end-to-end driving applications. The dataset DDD17 accompanying this paper is the first open dataset of annotated DAVIS driving recordings. DDD17 has over 12 h of a 346x260 pixel DAVIS sensor recording highway and city driving in daytime, evening, night, dry and wet weather conditions, along with vehicle speed, GPS position, driver steering, throttle, and brake captured from the car's on-board diagnostics interface. As an example application, we performed a preliminary end-to-end learning study of using a convolutional neural network that is trained to predict the instantaneous steering angle from DVS and APS visual data. △ Less

Submitted 4 November, 2017; originally announced November 2017.

Comments: Presented at the ICML 2017 Workshop on Machine Learning for Autonomous Vehicles

arXiv:1706.05415 [pdf, other]

Block-Matching Optical Flow for Dynamic Vision Sensor- Algorithm and FPGA Implementation

Authors: Min Liu, Tobi Delbruck

Abstract: Rapid and low power computation of optical flow (OF) is potentially useful in robotics. The dynamic vision sensor (DVS) event camera produces quick and sparse output, and has high dynamic range, but conventional OF algorithms are frame-based and cannot be directly used with event-based cameras. Previous DVS OF methods do not work well with dense textured input and are designed for implementation i… ▽ More Rapid and low power computation of optical flow (OF) is potentially useful in robotics. The dynamic vision sensor (DVS) event camera produces quick and sparse output, and has high dynamic range, but conventional OF algorithms are frame-based and cannot be directly used with event-based cameras. Previous DVS OF methods do not work well with dense textured input and are designed for implementation in logic circuits. This paper proposes a new block-matching based DVS OF algorithm which is inspired by motion estimation methods used for MPEG video compression. The algorithm was implemented both in software and on FPGA. For each event, it computes the motion direction as one of 9 directions. The speed of the motion is set by the sample interval. Results show that the Average Angular Error can be improved by 30\% compared with previous methods. The OF can be calculated on FPGA with 50\,MHz clock in 0.2\,us per event (11 clock cycles), 20 times faster than a Java software implementation running on a desktop PC. Sample data is shown that the method works on scenes dominated by edges, sparse features, and dense texture. △ Less

Submitted 16 June, 2017; originally announced June 2017.

Comments: Published in ISCAS 2017

arXiv:1706.01406 [pdf, other]

doi 10.1109/TNNLS.2018.2852335

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

Authors: Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck

Abstract: Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture… ▽ More Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the MAC units, and achieves a power efficiency of over 3TOp/s/W in a core area of 6.3mm$^2$. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real time interactive demonstrations. △ Less

Submitted 6 March, 2018; v1 submitted 5 June, 2017; originally announced June 2017.

arXiv:1612.05571 [pdf, other]

Delta Networks for Optimized Recurrent Network Computation

Authors: Daniel Neil, Jun Haeng Lee, Tobi Delbruck, Shih-Chii Liu

Abstract: Many neural networks exhibit stability in their activation patterns over time in response to inputs from sensors operating under real-world conditions. By capitalizing on this property of natural signals, we propose a Recurrent Neural Network (RNN) architecture called a delta network in which each neuron transmits its value only when the change in its activation exceeds a threshold. The execution… ▽ More Many neural networks exhibit stability in their activation patterns over time in response to inputs from sensors operating under real-world conditions. By capitalizing on this property of natural signals, we propose a Recurrent Neural Network (RNN) architecture called a delta network in which each neuron transmits its value only when the change in its activation exceeds a threshold. The execution of RNNs as delta networks is attractive because their states must be stored and fetched at every timestep, unlike in convolutional neural networks (CNNs). We show that a naive run-time delta network implementation offers modest improvements on the number of memory accesses and computes, but optimized training techniques confer higher accuracy at higher speedup. With these optimizations, we demonstrate a 9X reduction in cost with negligible loss of accuracy for the TIDIGITS audio digit recognition benchmark. Similarly, on the large Wall Street Journal speech recognition benchmark even existing networks can be greatly accelerated as delta networks, and a 5.7x improvement with negligible loss of accuracy can be obtained through training. Finally, on an end-to-end CNN trained for steering angle prediction in a driving dataset, the RNN cost can be reduced by a substantial 100X. △ Less

Submitted 16 December, 2016; originally announced December 2016.

arXiv:1610.08336 [pdf, other]

doi 10.1177/0278364917691115

The Event-Camera Dataset and Simulator: Event-based Data for Pose Estimation, Visual Odometry, and SLAM

Authors: Elias Mueggler, Henri Rebecq, Guillermo Gallego, Tobi Delbruck, Davide Scaramuzza

Abstract: New vision sensors, such as the Dynamic and Active-pixel Vision sensor (DAVIS), incorporate a conventional global-shutter camera and an event-based sensor in the same pixel array. These sensors have great potential for high-speed robotics and computer vision because they allow us to combine the benefits of conventional cameras with those of event-based sensors: low latency, high temporal resolutio… ▽ More New vision sensors, such as the Dynamic and Active-pixel Vision sensor (DAVIS), incorporate a conventional global-shutter camera and an event-based sensor in the same pixel array. These sensors have great potential for high-speed robotics and computer vision because they allow us to combine the benefits of conventional cameras with those of event-based sensors: low latency, high temporal resolution, and very high dynamic range. However, new algorithms are required to exploit the sensor characteristics and cope with its unconventional output, which consists of a stream of asynchronous brightness changes (called "events") and synchronous grayscale frames. For this purpose, we present and release a collection of datasets captured with a DAVIS in a variety of synthetic and real environments, which we hope will motivate research on new algorithms for high-speed and high-dynamic-range robotics and computer-vision applications. In addition to global-shutter intensity images and asynchronous events, we provide inertial measurements and ground-truth camera poses from a motion-capture system. The latter allows comparing the pose accuracy of ego-motion estimation algorithms quantitatively. All the data are released both as standard text files and binary files (i.e., rosbag). This paper provides an overview of the available data and describes a simulator that we release open-source to create synthetic event-camera data. △ Less

Submitted 8 November, 2017; v1 submitted 26 October, 2016; originally announced October 2016.

Comments: 7 pages, 4 figures, 3 tables

Journal ref: International Journal of Robotics Research, Vol. 36, Issue 2, pp. 142-149, Feb. 2017

arXiv:1608.08782 [pdf, other]

Training Deep Spiking Neural Networks using Backpropagation

Authors: Jun Haeng Lee, Tobi Delbruck, Michael Pfeiffer

Abstract: Deep spiking neural networks (SNNs) hold great potential for improving the latency and energy efficiency of deep neural networks through event-based computation. However, training such networks is difficult due to the non-differentiable nature of asynchronous spike events. In this paper, we introduce a novel technique, which treats the membrane potentials of spiking neurons as differentiable signa… ▽ More Deep spiking neural networks (SNNs) hold great potential for improving the latency and energy efficiency of deep neural networks through event-based computation. However, training such networks is difficult due to the non-differentiable nature of asynchronous spike events. In this paper, we introduce a novel technique, which treats the membrane potentials of spiking neurons as differentiable signals, where discontinuities at spike times are only considered as noise. This enables an error backpropagation mechanism for deep SNNs, which works directly on spike signals and membrane potentials. Thus, compared with previous methods relying on indirect training and conversion, our technique has the potential to capture the statics of spikes more precisely. Our novel framework outperforms all previously reported results for SNNs on the permutation invariant MNIST benchmark, as well as the N-MNIST benchmark recorded with event-based vision sensors. △ Less

Submitted 31 August, 2016; originally announced August 2016.

arXiv:1607.03468 [pdf, other]

doi 10.1109/TPAMI.2017.2769655

Event-based, 6-DOF Camera Tracking from Photometric Depth Maps

Authors: Guillermo Gallego, Jon E. A. Lund, Elias Mueggler, Henri Rebecq, Tobi Delbruck, Davide Scaramuzza

Abstract: Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high-speed motions or in scenes characterized by high dynamic range. These features, along with a very low power consumption, m… ▽ More Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. These cameras do not suffer from motion blur and have a very high dynamic range, which enables them to provide reliable visual information during high-speed motions or in scenes characterized by high dynamic range. These features, along with a very low power consumption, make event cameras an ideal complement to standard cameras for VR/AR and video game applications. With these applications in mind, this paper tackles the problem of accurate, low-latency tracking of an event camera from an existing photometric depth map (i.e., intensity plus depth information) built via classic dense reconstruction pipelines. Our approach tracks the 6-DOF pose of the event camera upon the arrival of each event, thus virtually eliminating latency. We successfully evaluate the method in both indoor and outdoor scenes and show that---because of the technological advantages of the event camera---our pipeline works in scenes characterized by high-speed motion, which are still unaccessible to standard cameras. △ Less

Submitted 31 October, 2017; v1 submitted 12 July, 2016; originally announced July 2016.

Comments: 12 pages, 13 figures. 2 tables. (in press)

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 2, pp. 2402-2412, Oct. 2018

arXiv:1606.09433 [pdf]

Steering a Predator Robot using a Mixed Frame/Event-Driven Convolutional Neural Network

Authors: Diederik Paul Moeys, Federico Corradi, Emmett Kerr, Philip Vance, Gautham Das, Daniel Neil, Dermot Kerr, Tobi Delbruck

Abstract: This paper describes the application of a Convolutional Neural Network (CNN) in the context of a predator/prey scenario. The CNN is trained and run on data from a Dynamic and Active Pixel Sensor (DAVIS) mounted on a Summit XL robot (the predator), which follows another one (the prey). The CNN is driven by both conventional image frames and dynamic vision sensor "frames" that consist of a constant… ▽ More This paper describes the application of a Convolutional Neural Network (CNN) in the context of a predator/prey scenario. The CNN is trained and run on data from a Dynamic and Active Pixel Sensor (DAVIS) mounted on a Summit XL robot (the predator), which follows another one (the prey). The CNN is driven by both conventional image frames and dynamic vision sensor "frames" that consist of a constant number of DAVIS ON and OFF events. The network is thus "data driven" at a sample rate proportional to the scene activity, so the effective sample rate varies from 15 Hz to 240 Hz depending on the robot speeds. The network generates four outputs: steer right, left, center and non-visible. After off-line training on labeled data, the network is imported on the on-board Summit XL robot which runs jAER and receives steering directions in real time. Successful results on closed-loop trials, with accuracies up to 87% or 92% (depending on evaluation criteria) are reported. Although the proposed approach discards the precise DAVIS event timing, it offers the significant advantage of compatibility with conventional deep learning technology without giving up the advantage of data-driven computing. △ Less

Submitted 30 June, 2016; originally announced June 2016.

Comments: Paper presented at the conference: Second International Conference on Event-Based Control, Communication and Signal Processing (EBCCSP) 2016, At Krakow, Poland

Showing 1–39 of 39 results for author: Delbruck, T