subscribe to arXiv mailings

Growing Efficient Accurate and Robust Neural Networks on the Edge

Authors: Vignesh Sundaresha, Naresh Shanbhag

Abstract: The ubiquitous deployment of deep learning systems on resource-constrained Edge devices is hindered by their high computational complexity coupled with their fragility to out-of-distribution (OOD) data, especially to naturally occurring common corruptions. Current solutions rely on the Cloud to train and compress models before deploying to the Edge. This incurs high energy and latency costs in tra… ▽ More The ubiquitous deployment of deep learning systems on resource-constrained Edge devices is hindered by their high computational complexity coupled with their fragility to out-of-distribution (OOD) data, especially to naturally occurring common corruptions. Current solutions rely on the Cloud to train and compress models before deploying to the Edge. This incurs high energy and latency costs in transmitting locally acquired field data to the Cloud while also raising privacy concerns. We propose GEARnn (Growing Efficient, Accurate, and Robust neural networks) to grow and train robust networks in-situ, i.e., completely on the Edge device. Starting with a low-complexity initial backbone network, GEARnn employs One-Shot Growth (OSG) to grow a network satisfying the memory constraints of the Edge device using clean data, and robustifies the network using Efficient Robust Augmentation (ERA) to obtain the final network. We demonstrate results on a NVIDIA Jetson Xavier NX, and analyze the trade-offs between accuracy, robustness, model size, energy consumption, and training time. Our results demonstrate the construction of efficient, accurate, and robust networks entirely on an Edge device. △ Less

Submitted 10 October, 2024; originally announced October 2024.

Comments: 10 pages

arXiv:2302.01375 [pdf, other]

On the Robustness of Randomized Ensembles to Adversarial Perturbations

Authors: Hassan Dbouk, Naresh R. Shanbhag

Abstract: Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts… ▽ More Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: "When are RECs useful?", "What are their limits?", and "How do we train them?". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong $\ell_\infty$ norm-bounded adversaries across various network architectures and datasets. Our code can be found at https://github.com/hsndbk4/BARRE. △ Less

Submitted 28 May, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

Comments: Published as a conference paper in ICML 2023

arXiv:2210.08974 [pdf]

Coordinated Science Laboratory 70th Anniversary Symposium: The Future of Computing

Authors: Klara Nahrstedt, Naresh Shanbhag, Vikram Adve, Nancy Amato, Romit Roy Choudhury, Carl Gunter, Nam Sung Kim, Olgica Milenkovic, Sayan Mitra, Lav Varshney, Yurii Vlasov, Sarita Adve, Rashid Bashir, Andreas Cangellaris, James DiCarlo, Katie Driggs-Campbell, Nick Feamster, Mattia Gazzola, Karrie Karahalios, Sanmi Koyejo, Paul Kwiat, Bo Li, Negar Mehr, Ravish Mehra, Andrew Miller , et al. (3 additional authors not shown)

Abstract: In 2021, the Coordinated Science Laboratory CSL, an Interdisciplinary Research Unit at the University of Illinois Urbana-Champaign, hosted the Future of Computing Symposium to celebrate its 70th anniversary. CSL's research covers the full computing stack, computing's impact on society and the resulting need for social responsibility. In this white paper, we summarize the major technological points… ▽ More In 2021, the Coordinated Science Laboratory CSL, an Interdisciplinary Research Unit at the University of Illinois Urbana-Champaign, hosted the Future of Computing Symposium to celebrate its 70th anniversary. CSL's research covers the full computing stack, computing's impact on society and the resulting need for social responsibility. In this white paper, we summarize the major technological points, insights, and directions that speakers brought forward during the Future of Computing Symposium. Participants discussed topics related to new computing paradigms, technologies, algorithms, behaviors, and research challenges to be expected in the future. The symposium focused on new computing paradigms that are going beyond traditional computing and the research needed to support their realization. These needs included stressing security and privacy, the end to end human cyber physical systems and with them the analysis of the end to end artificial intelligence needs. Furthermore, advances that enable immersive environments for users, the boundaries between humans and machines will blur and become seamless. Particular integration challenges were made clear in the final discussion on the integration of autonomous driving, robo taxis, pedestrians, and future cities. Innovative approaches were outlined to motivate the next generation of researchers to work on these challenges. The discussion brought out the importance of considering not just individual research areas, but innovations at the intersections between computing research efforts and relevant application domains, such as health care, transportation, energy systems, and manufacturing. △ Less

Submitted 4 October, 2022; originally announced October 2022.

arXiv:2206.06737 [pdf, other]

Adversarial Vulnerability of Randomized Ensembles

Authors: Hassan Dbouk, Naresh R. Shanbhag

Abstract: Despite the tremendous success of deep neural networks across various tasks, their vulnerability to imperceptible adversarial perturbations has hindered their deployment in the real world. Recently, works on randomized ensembles have empirically demonstrated significant improvements in adversarial robustness over standard adversarially trained (AT) models with minimal computational overhead, makin… ▽ More Despite the tremendous success of deep neural networks across various tasks, their vulnerability to imperceptible adversarial perturbations has hindered their deployment in the real world. Recently, works on randomized ensembles have empirically demonstrated significant improvements in adversarial robustness over standard adversarially trained (AT) models with minimal computational overhead, making them a promising solution for safety-critical resource-constrained applications. However, this impressive performance raises the question: Are these robustness gains provided by randomized ensembles real? In this work we address this question both theoretically and empirically. We first establish theoretically that commonly employed robustness evaluation methods such as adaptive PGD provide a false sense of security in this setting. Subsequently, we propose a theoretically-sound and efficient adversarial attack algorithm (ARC) capable of compromising random ensembles even in cases where adaptive PGD fails to do so. We conduct comprehensive experiments across a variety of network architectures, training schemes, datasets, and norms to support our claims, and empirically establish that randomized ensembles are in fact more vulnerable to $\ell_p$-bounded adversarial perturbations than even standard AT models. Our code can be found at https://github.com/hsndbk4/ARC. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: Published as a conference paper in ICML 2022

arXiv:2110.14871 [pdf, other]

Generalized Depthwise-Separable Convolutions for Adversarially Robust and Efficient Neural Networks

Authors: Hassan Dbouk, Naresh R. Shanbhag

Abstract: Despite their tremendous successes, convolutional neural networks (CNNs) incur high computational/storage costs and are vulnerable to adversarial perturbations. Recent works on robust model compression address these challenges by combining model compression techniques with adversarial training. But these methods are unable to improve throughput (frames-per-second) on real-life hardware while simul… ▽ More Despite their tremendous successes, convolutional neural networks (CNNs) incur high computational/storage costs and are vulnerable to adversarial perturbations. Recent works on robust model compression address these challenges by combining model compression techniques with adversarial training. But these methods are unable to improve throughput (frames-per-second) on real-life hardware while simultaneously preserving robustness to adversarial perturbations. To overcome this problem, we propose the method of Generalized Depthwise-Separable (GDWS) convolution -- an efficient, universal, post-training approximation of a standard 2D convolution. GDWS dramatically improves the throughput of a standard pre-trained network on real-life hardware while preserving its robustness. Lastly, GDWS is scalable to large problem sizes since it operates on pre-trained models and doesn't require any additional training. We establish the optimality of GDWS as a 2D convolution approximator and present exact algorithms for constructing optimal GDWS convolutions under complexity and error constraints. We demonstrate the effectiveness of GDWS via extensive experiments on CIFAR-10, SVHN, and ImageNet datasets. Our code can be found at https://github.com/hsndbk4/GDWS. △ Less

Submitted 6 November, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: NeurIPS 2021 (Spotlight)

arXiv:2105.14710 [pdf, other]

Robustifying $\ell_\infty$ Adversarial Training to the Union of Perturbation Models

Authors: Ameya D. Patil, Michael Tuttle, Alexander G. Schwing, Naresh R. Shanbhag

Abstract: Classical adversarial training (AT) frameworks are designed to achieve high adversarial accuracy against a single attack type, typically $\ell_\infty$ norm-bounded perturbations. Recent extensions in AT have focused on defending against the union of multiple perturbations but this benefit is obtained at the expense of a significant (up to $10\times$) increase in training complexity over single-att… ▽ More Classical adversarial training (AT) frameworks are designed to achieve high adversarial accuracy against a single attack type, typically $\ell_\infty$ norm-bounded perturbations. Recent extensions in AT have focused on defending against the union of multiple perturbations but this benefit is obtained at the expense of a significant (up to $10\times$) increase in training complexity over single-attack $\ell_\infty$ AT. In this work, we expand the capabilities of widely popular single-attack $\ell_\infty$ AT frameworks to provide robustness to the union of ($\ell_\infty, \ell_2, \ell_1$) perturbations while preserving their training efficiency. Our technique, referred to as Shaped Noise Augmented Processing (SNAP), exploits a well-established byproduct of single-attack AT frameworks -- the reduction in the curvature of the decision boundary of networks. SNAP prepends a given deep net with a shaped noise augmentation layer whose distribution is learned along with network parameters using any standard single-attack AT. As a result, SNAP enhances adversarial accuracy of ResNet-18 on CIFAR-10 against the union of ($\ell_\infty, \ell_2, \ell_1$) perturbations by 14%-to-20% for four state-of-the-art (SOTA) single-attack $\ell_\infty$ AT frameworks, and, for the first time, establishes a benchmark for ResNet-50 and ResNet-101 on ImageNet. △ Less

Submitted 11 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

arXiv:2012.13645 [pdf, other]

Fundamental Limits on Energy-Delay-Accuracy of In-memory Architectures in Inference Applications

Authors: Sujan Kumar Gonugondla, Charbel Sakr, Hassan Dbouk, Naresh R. Shanbhag

Abstract: This paper obtains fundamental limits on the computational precision of in-memory computing architectures (IMCs). An IMC noise model and associated SNR metrics are defined and their interrelationships analyzed to show that the accuracy of IMCs is fundamentally limited by the compute SNR ($\text{SNR}_{\text{a}}$) of its analog core, and that activation, weight and output precision needs to be assig… ▽ More This paper obtains fundamental limits on the computational precision of in-memory computing architectures (IMCs). An IMC noise model and associated SNR metrics are defined and their interrelationships analyzed to show that the accuracy of IMCs is fundamentally limited by the compute SNR ($\text{SNR}_{\text{a}}$) of its analog core, and that activation, weight and output precision needs to be assigned appropriately for the final output SNR $\text{SNR}_{\text{T}} \rightarrow \text{SNR}_{\text{a}}$. The minimum precision criterion (MPC) is proposed to minimize the ADC precision. Three in-memory compute models - charge summing (QS), current summing (IS) and charge redistribution (QR) - are shown to underlie most known IMCs. Noise, energy and delay expressions for the compute models are developed and employed to derive expressions for the SNR, ADC precision, energy, and latency of IMCs. The compute SNR expressions are validated via Monte Carlo simulations in a 65 nm CMOS process. For a 512 row SRAM array, it is shown that: 1) IMCs have an upper bound on their maximum achievable $\text{SNR}_{\text{a}}$ due to constraints on energy, area and voltage swing, and this upper bound reduces with technology scaling for QS-based architectures; 2) MPC enables $\text{SNR}_{\text{T}} \rightarrow \text{SNR}_{\text{a}}$ to be realized with minimal ADC precision; 3) QS-based (QR-based) architectures are preferred for low (high) compute SNR scenarios. △ Less

Submitted 25 December, 2020; originally announced December 2020.

Comments: 14 pages, 13 figures

arXiv:2007.09818 [pdf, other]

DBQ: A Differentiable Branch Quantizer for Lightweight Deep Neural Networks

Authors: Hassan Dbouk, Hetul Sanghvi, Mahesh Mehendale, Naresh Shanbhag

Abstract: Deep neural networks have achieved state-of-the art performance on various computer vision tasks. However, their deployment on resource-constrained devices has been hindered due to their high computational and storage complexity. While various complexity reduction techniques, such as lightweight network architecture design and parameter quantization, have been successful in reducing the cost of im… ▽ More Deep neural networks have achieved state-of-the art performance on various computer vision tasks. However, their deployment on resource-constrained devices has been hindered due to their high computational and storage complexity. While various complexity reduction techniques, such as lightweight network architecture design and parameter quantization, have been successful in reducing the cost of implementing these networks, these methods have often been considered orthogonal. In reality, existing quantization techniques fail to replicate their success on lightweight architectures such as MobileNet. To this end, we present a novel fully differentiable non-uniform quantizer that can be seamlessly mapped onto efficient ternary-based dot product engines. We conduct comprehensive experiments on CIFAR-10, ImageNet, and Visual Wake Words datasets. The proposed quantizer (DBQ) successfully tackles the daunting task of aggressively quantizing lightweight networks such as MobileNetV1, MobileNetV2, and ShuffleNetV2. DBQ achieves state-of-the art results with minimal training overhead and provides the best (pareto-optimal) accuracy-complexity trade-off. △ Less

Submitted 19 July, 2020; originally announced July 2020.

Comments: Published as a conference paper in ECCV 2020

arXiv:2005.02434 [pdf]

Nanotechnology-inspired Information Processing Systems of the Future

Authors: Randy Bryant, Mark Hill, Tom Kazior, Daniel Lee, Jie Liu, Klara Nahrstedt, Vijay Narayanan, Jan Rabaey, Hava Siegelmann, Naresh Shanbhag, Naveen Verma, H. -S. Philip Wong

Abstract: Nanoscale semiconductor technology has been a key enabler of the computing revolution. It has done so via advances in new materials and manufacturing processes that resulted in the size of the basic building block of computing systems - the logic switch and memory devices - being reduced into the nanoscale regime. Nanotechnology has provided increased computing functionality per unit volume, energ… ▽ More Nanoscale semiconductor technology has been a key enabler of the computing revolution. It has done so via advances in new materials and manufacturing processes that resulted in the size of the basic building block of computing systems - the logic switch and memory devices - being reduced into the nanoscale regime. Nanotechnology has provided increased computing functionality per unit volume, energy, and cost. In order for computing systems to continue to deliver substantial benefits for the foreseeable future to society at large, it is critical that the very notion of computing be examined in the light of nanoscale realities. In particular, one needs to ask what it means to compute when the very building block - the logic switch - no longer exhibits the level of determinism required by the von Neumann architecture. There needs to be a sustained and heavy investment in a nation-wide Vertically Integrated Semiconductor Ecosystem (VISE). VISE is a program in which research and development is conducted seamlessly across the entire compute stack - from applications, systems and algorithms, architectures, circuits and nanodevices, and materials. A nation-wide VISE provides clear strategic advantages in ensuring the US's global superiority in semiconductors. First, a VISE provides the highest quality seed-corn for nurturing transformative ideas that are critically needed today in order for nanotechnology-inspired computing to flourish. It does so by dramatically opening up new areas of semiconductor research that are inspired and driven by new application needs. Second, a VISE creates a very high barrier to entry from foreign competitors because it is extremely hard to establish, and even harder to duplicate. △ Less

Submitted 5 May, 2020; originally announced May 2020.

Comments: A Computing Community Consortium (CCC) workshop report, 18 pages

Report number: ccc2016report_3

arXiv:2002.09786 [pdf, other]

HarDNN: Feature Map Vulnerability Evaluation in CNNs

Authors: Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher W. Fletcher, Sarita V. Adve, Charbel Sakr, Naresh Shanbhag, Pavlo Molchanov, Michael B. Sullivan, Timothy Tsai, Stephen W. Keckler

Abstract: As Convolutional Neural Networks (CNNs) are increasingly being employed in safety-critical applications, it is important that they behave reliably in the face of hardware errors. Transient hardware errors may percolate undesirable state during execution, resulting in software-manifested errors which can adversely affect high-level decision making. This paper presents HarDNN, a software-directed ap… ▽ More As Convolutional Neural Networks (CNNs) are increasingly being employed in safety-critical applications, it is important that they behave reliably in the face of hardware errors. Transient hardware errors may percolate undesirable state during execution, resulting in software-manifested errors which can adversely affect high-level decision making. This paper presents HarDNN, a software-directed approach to identify vulnerable computations during a CNN inference and selectively protect them based on their propensity towards corrupting the inference output in the presence of a hardware error. We show that HarDNN can accurately estimate relative vulnerability of a feature map (fmap) in CNNs using a statistical error injection campaign, and explore heuristics for fast vulnerability assessment. Based on these results, we analyze the tradeoff between error coverage and computational overhead that the system designers can use to employ selective protection. Results show that the improvement in resilience for the added computation is superlinear with HarDNN. For example, HarDNN improves SqueezeNet's resilience by 10x with just 30% additional computations. △ Less

Submitted 25 February, 2020; v1 submitted 22 February, 2020; originally announced February 2020.

Comments: 14 pages, 5 figures, a short version accepted for publication in First Workshop on Secure and Resilient Autonomy (SARA) co-located with MLSys2020

arXiv:1901.06588 [pdf, other]

Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks

Authors: Charbel Sakr, Naigang Wang, Chia-Yu Chen, Jungwook Choi, Ankur Agrawal, Naresh Shanbhag, Kailash Gopalakrishnan

Abstract: Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conserv… ▽ More Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems. △ Less

Submitted 19 January, 2019; originally announced January 2019.

Comments: Published as a conference paper in ICLR 2019

arXiv:1812.11732 [pdf, other]

Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm

Authors: Charbel Sakr, Naresh Shanbhag

Abstract: The high computational and parameter complexity of neural networks makes their training very slow and difficult to deploy on energy and storage-constrained computing systems. Many network complexity reduction techniques have been proposed including fixed-point implementation. However, a systematic approach for designing full fixed-point training and inference of deep neural networks remains elusiv… ▽ More The high computational and parameter complexity of neural networks makes their training very slow and difficult to deploy on energy and storage-constrained computing systems. Many network complexity reduction techniques have been proposed including fixed-point implementation. However, a systematic approach for designing full fixed-point training and inference of deep neural networks remains elusive. We describe a precision assignment methodology for neural network training in which all network parameters, i.e., activations and weights in the feedforward path, gradients and weight accumulators in the feedback path, are assigned close to minimal precision. The precision assignment is derived analytically and enables tracking the convergence behavior of the full precision training, known to converge a priori. Thus, our work leads to a systematic methodology of determining suitable precision for fixed-point training. The near optimality (minimality) of the resulting precision assignment is validated empirically for four networks on the CIFAR-10, CIFAR-100, and SVHN datasets. The complexity reduction arising from our approach is compared with other fixed-point neural network designs. △ Less

Submitted 31 December, 2018; originally announced December 2018.

Comments: Published as a conference paper in ICLR 2019

arXiv:1710.07153 [pdf, other]

doi 10.1109/TCOMM.2018.2841406

Generalized Water-filling for Source-aware Energy-efficient SRAMs

Authors: Yongjune Kim, Mingu Kang, Lav R. Varshney, Naresh R. Shanbhag

Abstract: Conventional low-power static random access memories (SRAMs) reduce read energy by decreasing the bit-line voltage swings uniformly across the bit-line columns. This is because the read energy is proportional to the bit-line swings. On the other hand, bit-line swings are limited by the need to avoid decision errors especially in the most significant bits. We propose an information-theoretic approa… ▽ More Conventional low-power static random access memories (SRAMs) reduce read energy by decreasing the bit-line voltage swings uniformly across the bit-line columns. This is because the read energy is proportional to the bit-line swings. On the other hand, bit-line swings are limited by the need to avoid decision errors especially in the most significant bits. We propose an information-theoretic approach to determine optimal non-uniform bit-line swings by formulating convex optimization problems. For a given constraint on mean squared error of retrieved words, we consider criteria to minimize energy (for low-power SRAMs), maximize speed (for high-speed SRAMs), and minimize energy-delay product. These optimization problems can be interpreted as classical water-filling, ground-flattening and water-filling, and sand-pouring and water-filling, respectively. By leveraging these interpretations, we also propose greedy algorithms to obtain optimized discrete swings. Numerical results show that energy-optimal swing assignment reduces energy consumption by half at a peak signal-to-noise ratio of 30dB for an 8-bit accessed word. The energy savings increase to four times for a 16-bit accessed word. △ Less

Submitted 29 November, 2017; v1 submitted 19 October, 2017; originally announced October 2017.

arXiv:1702.06119 [pdf]

Shannon-inspired Statistical Computing to Enable Spintronics

Authors: Ameya D. Patil, Sasikanth Manipatruni, Dmitri Nikonov, Ian A. Young, Naresh R. Shanbhag

Abstract: Modern computing systems based on the von Neumann architecture are built from silicon complementary metal oxide semiconductor (CMOS) transistors that need to operate under practically error free conditions with 1 error in $10^{15}$ switching events. The physical dimensions of CMOS transistors have scaled down over the past five decades leading to exponential increases in functional density and ene… ▽ More Modern computing systems based on the von Neumann architecture are built from silicon complementary metal oxide semiconductor (CMOS) transistors that need to operate under practically error free conditions with 1 error in $10^{15}$ switching events. The physical dimensions of CMOS transistors have scaled down over the past five decades leading to exponential increases in functional density and energy consumption. Today, the energy and delay reductions from scaling have stagnated, motivating the search for a CMOS replacement. Of these, spintronics offers a path for enhancing the functional density and scaling the energy down to fundamental thermodynamic limits of 100kT to 1000kT. However, spintronic devices exhibit high error rates of 1 in 10 or more when operating at these limits, rendering them incompatible with deterministic nature of the von Neumann architecture. We show that a Shannon-inspired statistical computing framework can be leveraged to design a computer made from such stochastic spintronic logic gates to provide a computational accuracy close to that of a deterministic computer. This extraordinary result allowing a $10^{13}$ fold relaxation in acceptable error rates is obtained by engineering the error distribution coupled with statistical error compensation. △ Less

Submitted 19 February, 2017; originally announced February 2017.

arXiv:1611.03109 [pdf, other]

Energy-efficient Machine Learning in Silicon: A Communications-inspired Approach

Authors: Naresh R. Shanbhag

Abstract: This position paper advocates a communications-inspired approach to the design of machine learning systems on energy-constrained embedded `always-on' platforms. The communications-inspired approach has two versions - 1) a deterministic version where existing low-power communication IC design methods are repurposed, and 2) a stochastic version referred to as Shannon-inspired statistical information… ▽ More This position paper advocates a communications-inspired approach to the design of machine learning systems on energy-constrained embedded `always-on' platforms. The communications-inspired approach has two versions - 1) a deterministic version where existing low-power communication IC design methods are repurposed, and 2) a stochastic version referred to as Shannon-inspired statistical information processing employing information-based metrics, statistical error compensation (SEC), and retraining-based methods to implement ML systems on stochastic circuit/device fabrics operating at the limits of energy-efficiency. The communications-inspired approach has the potential to fully leverage the opportunities afforded by ML algorithms and applications in order to address the challenges inherent in their deployment on energy-constrained platforms. △ Less

Submitted 25 October, 2016; originally announced November 2016.

Comments: This paper was presented at the 2016 ICML Workshop on On-Device Intelligence, June 24, 2016

arXiv:1610.07501 [pdf, other]

A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor using Standard 6T SRAM Array

Authors: Mingu Kang, Sujan Gonugondla, Ameya Patil, Naresh Shanbhag

Abstract: This paper describes a multi-functional deep in-memory processor for inference applications. Deep in-memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to 5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (<1%) accuracy degradati… ▽ More This paper describes a multi-functional deep in-memory processor for inference applications. Deep in-memory processing is achieved by embedding pitch-matched low-SNR analog processing into a standard 6T 16KB SRAM array in 65 nm CMOS. Four applications are demonstrated. The prototype achieves up to 5.6X (9.7X estimated for multi-bank scenario) energy savings with negligible (<1%) accuracy degradation in all four applications as compared to the conventional architecture. △ Less

Submitted 24 October, 2016; originally announced October 2016.

arXiv:1607.07804 [pdf, other]

Error-Resilient Machine Learning in Near Threshold Voltage via Classifier Ensemble

Authors: Sai Zhang, Naresh Shanbhag

Abstract: In this paper, we present the design of error-resilient machine learning architectures by employing a distributed machine learning framework referred to as classifier ensemble (CE). CE combines several simple classifiers to obtain a strong one. In contrast, centralized machine learning employs a single complex block. We compare the random forest (RF) and the support vector machine (SVM), which are… ▽ More In this paper, we present the design of error-resilient machine learning architectures by employing a distributed machine learning framework referred to as classifier ensemble (CE). CE combines several simple classifiers to obtain a strong one. In contrast, centralized machine learning employs a single complex block. We compare the random forest (RF) and the support vector machine (SVM), which are representative techniques from the CE and centralized frameworks, respectively. Employing the dataset from UCI machine learning repository and architectural-level error models in a commercial 45 nm CMOS process, it is demonstrated that RF-based architectures are significantly more robust than SVM architectures in presence of timing errors due to process variations in near-threshold voltage (NTV) regions (0.3 V - 0.7 V). In particular, the RF architecture exhibits a detection accuracy (P_{det}) that varies by 3.2% while maintaining a median P_{det} > 0.9 at a gate level delay variation of 28.9% . In comparison, SVM exhibits a P_{det} that varies by 16.8%. Additionally, we propose an error weighted voting technique that incorporates the timing error statistics of the NTV circuit fabric to further enhance robustness. Simulation results confirm that the error weighted voting achieves a P_{det} that varies by only 1.4%, which is 12X lower compared to SVM. △ Less

Submitted 3 July, 2016; originally announced July 2016.

arXiv:1607.00669 [pdf, other]

Understanding the Energy and Precision Requirements for Online Learning

Authors: Charbel Sakr, Ameya Patil, Sai Zhang, Yongjune Kim, Naresh Shanbhag

Abstract: It is well-known that the precision of data, hyperparameters, and internal representations employed in learning systems directly impacts its energy, throughput, and latency. The precision requirements for the training algorithm are also important for systems that learn on-the-fly. Prior work has shown that the data and hyperparameters can be quantized heavily without incurring much penalty in clas… ▽ More It is well-known that the precision of data, hyperparameters, and internal representations employed in learning systems directly impacts its energy, throughput, and latency. The precision requirements for the training algorithm are also important for systems that learn on-the-fly. Prior work has shown that the data and hyperparameters can be quantized heavily without incurring much penalty in classification accuracy when compared to floating point implementations. These works suffer from two key limitations. First, they assume uniform precision for the classifier and for the training algorithm and thus miss out on the opportunity to further reduce precision. Second, prior works are empirical studies. In this article, we overcome both these limitations by deriving analytical lower bounds on the precision requirements of the commonly employed stochastic gradient descent (SGD) on-line learning algorithm in the specific context of a support vector machine (SVM). Lower bounds on the data precision are derived in terms of the the desired classification accuracy and precision of the hyperparameters used in the classifier. Additionally, lower bounds on the hyperparameter precision in the SGD training algorithm are obtained. These bounds are validated using both synthetic and the UCI breast cancer dataset. Additionally, the impact of these precisions on the energy consumption of a fixed-point SVM with on-line training is studied. △ Less

Submitted 26 August, 2016; v1 submitted 3 July, 2016; originally announced July 2016.

Comments: 14 pages, 5 figures 4 of which have 2 subfigures

arXiv:1607.00667 [pdf, other]

Reducing the Energy Cost of Inference via In-sensor Information Processing

Authors: Sai Zhang, Mingu Kang, Charbel Sakr, Naresh Shanbhag

Abstract: There is much interest in incorporating inference capabilities into sensor-rich embedded platforms such as autonomous vehicles, wearables, and others. A central problem in the design of such systems is the need to extract information locally from sensed data on a severely limited energy budget. This necessitates the design of energy-efficient sensory embedded system. A typical sensory embedded sys… ▽ More There is much interest in incorporating inference capabilities into sensor-rich embedded platforms such as autonomous vehicles, wearables, and others. A central problem in the design of such systems is the need to extract information locally from sensed data on a severely limited energy budget. This necessitates the design of energy-efficient sensory embedded system. A typical sensory embedded system enforces a physical separation between sensing and computational subsystems - a separation mandated by the differing requirements of the sensing and computational functions. As a consequence, the energy consumption in such systems tends to be dominated by the energy consumed in transferring data over the sensor-processor interface (communication energy) and the energy consumed in processing the data in digital processor (computational energy). In this article, we propose an in-sensor computing architecture which (mostly) eliminates the sensor-processor interface by embedding inference computations in the noisy sensor fabric in analog and retraining the hyperparameters in order to compensate for non-ideal computations. The resulting architecture referred to as the Compute Sensor - a sensor that computes in addition to sensing - represents a radical departure from the conventional. We show that a Compute Sensor for image data can be designed by embedding both feature extraction and classification functions in the analog domain in close proximity to the CMOS active pixel sensor (APS) array. Significant gains in energy-efficiency are demonstrated using behavioral and energy models in a commercial semiconductor process technology. In the process, the Compute Sensor creates a unique opportunity to develop machine learning algorithms for information extraction from data on a noisy underlying computational fabric. △ Less

Submitted 3 July, 2016; originally announced July 2016.

arXiv:1109.5600 [pdf, ps, other]

Some new approaches to infinite divisibility

Authors: Theofanis Sapatinas, Damodar N. Shanbhag, Arjun K. Gupta

Abstract: Using an approach based, amongst other things, on Proposition 1 of Kaluza (1928), Goldie (1967) and, using a different approach based especially on zeros of polynomials, Steutel (1967) have proved that each nondegenerate distribution function (d.f.) $F$ (on $\RR$, the real line), satisfying $F(0-) = 0$ and $F(x) = F(0) + (1-F(0)) G(x)$, $x > 0$, where $G$ is the d.f. corresponding to a mixture of… ▽ More Using an approach based, amongst other things, on Proposition 1 of Kaluza (1928), Goldie (1967) and, using a different approach based especially on zeros of polynomials, Steutel (1967) have proved that each nondegenerate distribution function (d.f.) $F$ (on $\RR$, the real line), satisfying $F(0-) = 0$ and $F(x) = F(0) + (1-F(0)) G(x)$, $x > 0$, where $G$ is the d.f. corresponding to a mixture of exponential distributions, is infinitely divisible. Indeed, Proposition 1 of Kaluza (1928) implies that any nondegenerate discrete probability distribution ${p_x: x= 0,1, ...}$ that is log-convex or, in particular, completely monotone, is compound geometric, and, hence, infinitely divisible. Steutel (1970), Shanbhag & Sreehari (1977) and Steutel & van Harn (2004, Chapter VI) have given certain extensions or variations of one or more of these results. Following a modified version of the C.R. Rao et al. (2009, Section 4) approach based on the Wiener-Hopf factorization, we establish some further results of significance to the literature on infinite divisibility. △ Less

Submitted 26 September, 2011; originally announced September 2011.

Comments: 18 pages, no figures, To appear in the Electronic Journal of Probability

MSC Class: Primary 60E05; Secondary 62E10

Journal ref: Electronic Journal of Probability, Vol. 16, 2359-2374 (2011)

arXiv:0909.5289 [pdf, other]

Moment properties of multivariate infinitely divisible laws and criteria for self-decomposability

Authors: Theofanis Sapatinas, Damodar N. Shanbhag

Abstract: Ramachandran (1969, Theorem 8) has shown that for any univariate infinitely divisible distribution and any positive real number $α$, an absolute moment of order $α$ relative to the distribution exists (as a finite number) if and only if this is so for a certain truncated version of the corresponding L$\acute{\rm e}$vy measure. A generalized version of this result in the case of multivariate infi… ▽ More Ramachandran (1969, Theorem 8) has shown that for any univariate infinitely divisible distribution and any positive real number $α$, an absolute moment of order $α$ relative to the distribution exists (as a finite number) if and only if this is so for a certain truncated version of the corresponding L$\acute{\rm e}$vy measure. A generalized version of this result in the case of multivariate infinitely divisible distributions, involving the concept of g-moments, is given by Sato (1999, Theorem 25.3). We extend Ramachandran's theorem to the multivariate case, keeping in mind the immediate requirements under appropriate assumptions of cumulant studies of the distributions referred to; the format of Sato's theorem just referred to obviously varies from ours and seems to be having a different agenda. Also, appealing to a further criterion based on the L$\acute{\rm e}$vy measure, we identify in a certain class of multivariate infinitely divisible distributions the distributions that are self-decomposable; this throws new light on structural aspects of certain multivariate distributions such as the multivariate generalized hyperbolic distributions studied by Barndorff-Nielsen (1977) and others. Various points of relevance to the study are also addressed through specific examples. △ Less

Submitted 29 September, 2009; originally announced September 2009.

Comments: 22 pages (To appear in: Journal of Multivariate Analysis)

MSC Class: 60E07 (primary); 60E05; 60G51; 62H10 (secondary)

Journal ref: Journal of Multivariate Analysis, Vol. 101, 500-511, (2010)

Showing 1–21 of 21 results for author: Shanbhag, N