-
Speech Understanding on Tiny Devices with A Learning Cache
Authors:
Afsara Benazir,
Zhiming Xu,
Felix Xiaozhu Lin
Abstract:
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to…
▽ More
This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.
△ Less
Submitted 8 May, 2024; v1 submitted 29 November, 2023;
originally announced November 2023.
-
Turbocharge Speech Understanding with Pilot Inference
Authors:
Rongxiang Wang,
Felix Xiaozhu Lin
Abstract:
Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up…
▽ More
Modern speech understanding (SU) runs a sophisticated pipeline: ingesting streaming voice input, the pipeline executes encoder-decoder based deep neural networks repeatedly; by doing so, the pipeline generates tentative outputs (called hypotheses), and periodically scores the hypotheses. This paper sets to accelerate SU on resource-constrained edge devices. It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity. While the approach is well-known, we address SU's unique challenges with novel techniques: (1) late contextualization, which executes a model's attentive encoder in parallel to the input ingestion; (2) pilot inference, which mitigates the SU pipeline's temporal load imbalance; (3) autoregression offramps, which evaluate offloading decisions based on pilot inferences and hypotheses. Our techniques are compatible with existing speech models, pipelines, and frameworks; they can be applied independently or in combination. Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.
△ Less
Submitted 10 October, 2024; v1 submitted 22 November, 2023;
originally announced November 2023.
-
Secure and Effective Data Appraisal for Machine Learning
Authors:
Xu Ouyang,
Changhong Yang,
Felix Xiaozhu Lin,
Yangfeng Ji
Abstract:
Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models…
▽ More
Essential for an unfettered data market is the ability to discreetly select and evaluate training data before finalizing a transaction between the data owner and model owner. To safeguard the privacy of both data and model, this process involves scrutinizing the target model through Multi-Party Computation (MPC). While prior research has posited that the MPC-based evaluation of Transformer models is excessively resource-intensive, this paper introduces an innovative approach that renders data selection practical. The contributions of this study encompass three pivotal elements: (1) a groundbreaking pipeline for confidential data selection using MPC, (2) replicating intricate high-dimensional operations with simplified low-dimensional MLPs trained on a limited subset of pertinent data, and (3) implementing MPC in a concurrent, multi-phase manner. The proposed method is assessed across an array of Transformer models and NLP/CV benchmarks. In comparison to the direct MPC-based evaluation of the target model, our approach substantially reduces the time required, from thousands of hours to mere tens of hours, with only a nominal 0.20% dip in accuracy when training with the selected data.
△ Less
Submitted 24 January, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Federated Few-Shot Learning for Mobile NLP
Authors:
Dongqi Cai,
Shangguang Wang,
Yaozong Wu,
Felix Xiaozhu Lin,
Mengwei Xu
Abstract:
Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an…
▽ More
Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an inadequacy of data labels is known as a few-shot scenario; it becomes the key blocker for mobile NLP applications.
For the first time, this work investigates federated NLP in the few-shot scenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling and prompt learning, we first establish a training pipeline that delivers competitive accuracy when only 0.05% (fewer than 100) of the training data is labeled and the remaining is unlabeled. To instantiate the workflow, we further present a system FeS, addressing the high execution cost with novel designs. (1) Curriculum pacing, which injects pseudo labels to the training workflow at a rate commensurate to the learning progress; (2) Representational diversity, a mechanism for selecting the most learnable data, only for which pseudo labels will be generated; (3) Co-planning of a model's training depth and layer capacity. Together, these designs reduce the training delay, client energy, and network traffic by up to 46.0$\times$, 41.2$\times$ and 3000.0$\times$, respectively. Through algorithm/system co-design, FFNLP demonstrates that FL can apply to challenging settings where most training samples are unlabeled.
△ Less
Submitted 19 August, 2023; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Towards Practical Few-shot Federated NLP
Authors:
Dongqi Cai,
Yaozong Wu,
Haitao Yuan,
Shangguang Wang,
Felix Xiaozhu Lin,
Mengwei Xu
Abstract:
Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is…
▽ More
Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is often scarce, presenting an additional challenge. To address these challenges, we first introduce a data generator for federated few-shot learning tasks, which encompasses the quantity and skewness of scarce labeled data in a realistic setting. Subsequently, we propose AUG-FedPrompt, a prompt-based federated learning system that exploits abundant unlabeled data for data augmentation. Our experiments indicate that AUG-FedPrompt can perform on par with full-set fine-tuning with a limited amount of labeled data. However, such competitive performance comes at a significant system cost.
△ Less
Submitted 19 August, 2023; v1 submitted 30 November, 2022;
originally announced December 2022.
-
Plasma lensing near the eclipses of the Black Widow pulsar B1957+20
Authors:
Fang Xi Lin,
Robert Main,
Dylan Jow,
Dongzi Li,
Ue-Li Pen,
Marten H. van Kerkwijk
Abstract:
Recently, several eclipsing millisecond pulsars have been shown to experience strong and apparent weak lensing from the outflow of their ionized companions. Lensing can be a powerful probe of the ionized plasma, with the strongest lenses potentially resolving emission regions of pulsars. Understanding lensing in the `laboratory-like' conditions of an eclipsing pulsar may be analogously applied to…
▽ More
Recently, several eclipsing millisecond pulsars have been shown to experience strong and apparent weak lensing from the outflow of their ionized companions. Lensing can be a powerful probe of the ionized plasma, with the strongest lenses potentially resolving emission regions of pulsars. Understanding lensing in the `laboratory-like' conditions of an eclipsing pulsar may be analogously applied to fast radio bursts, many of which reside in dense, magnetized environments. We examined variable dispersion measure (DM), absorption, scattering, and flux density in the original Black Widow pulsar PSR B1957+20 through an eclipse at the Arecibo Observatory at 327 MHz. We discovered clear evidence of the two regimes of lensing, strong and apparent weak. We show that the flux density variations in the apparently weak lensing regime can be modeled directly from variations of DM, using geometric optics. The mean effective velocities in the ingress, $954\pm 99$ km/s, and egress $604\pm 47$ km/s cannot be explained by orbital motions alone, but are consistent with significant outflow velocity of material from the companion. We also show that geometric optics can predict when and where the lensing regime-change between weak and strong occurs, and argue that the apparent weak lensing is due to averaging many images. Our framework can be applied in any source with variable electron columns, measuring their relative velocities and distances. In other eclipsing pulsars, this provides a unique opportunity to measure companion outflow velocity, predict regions of weak and strong lensing, and in principle independently constrain orbital inclinations.
△ Less
Submitted 23 November, 2022; v1 submitted 29 August, 2022;
originally announced August 2022.
-
DM-power: an algorithm for high precision dispersion measure with application to fast radio bursts
Authors:
Hsiu-Hsien Lin,
Robert Main,
Ue-Li Pen,
Robert Wharton,
Marlon Luis Bause,
Suryarao Bethapudi,
Dongzi Li,
Fang Xi Lin,
Visweshwar Ram Marthi,
Laura G Spitler
Abstract:
We present DM-power, a new method for precisely determining the dispersion measure (DM) of radio bursts, and apply it to the Fast Radio Burst (FRB) source FRB~20180916B. Motivated by the complex structure on multiple time scales seen in FRBs, DM-power optimizes the DM by combining measurements at multiple Fourier frequencies in the power spectrum of the burst. By optimally weighting the measuremen…
▽ More
We present DM-power, a new method for precisely determining the dispersion measure (DM) of radio bursts, and apply it to the Fast Radio Burst (FRB) source FRB~20180916B. Motivated by the complex structure on multiple time scales seen in FRBs, DM-power optimizes the DM by combining measurements at multiple Fourier frequencies in the power spectrum of the burst. By optimally weighting the measurements at each Fourier frequency, DM-power finds a burst DM that effectively incorporates information on many different burst timescales. We validate this technique on simulated Gaussian pulse profiles with a precision down to $σ_{\rm DM} \sim 0.001~{\rm pc~cm}^{-3}$, and then apply it to bursts from pulsar B0329+54 and FRB~20180916B. The precision of these DM measurements are sufficient to measure a statistically significant variation in DM over a $\approx 2$ hr span. While this variation could be the result of electron density variations along the line of sight, it is more like that the observed variation is the result of intrinsic frequency-dependent burst structure that can mimic a dispersive delay.
△ Less
Submitted 21 June, 2023; v1 submitted 29 August, 2022;
originally announced August 2022.
-
Efficient NLP Model Finetuning via Multistage Data Filtering
Authors:
Xu Ouyang,
Shahina Mohd Azam Ansari,
Felix Xiaozhu Lin,
Yangfeng Ji
Abstract:
As model finetuning is central to the modern NLP, we set to maximize its efficiency. Motivated by redundancy in training examples and the sheer sizes of pretrained models, we exploit a key opportunity: training only on important data. To this end, we set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are two: (1) automatically deter…
▽ More
As model finetuning is central to the modern NLP, we set to maximize its efficiency. Motivated by redundancy in training examples and the sheer sizes of pretrained models, we exploit a key opportunity: training only on important data. To this end, we set to filter training examples in a streaming fashion, in tandem with training the target model. Our key techniques are two: (1) automatically determine a training loss threshold for skipping backward training passes; (2) run a meta predictor for further skipping forward training passes. We integrate the above techniques in a holistic, three-stage training process. On a diverse set of benchmarks, our method reduces the required training examples by up to 5.3$\times$ and training time by up to 6.8$\times$, while only seeing minor accuracy degradation. Our method is effective even when training one epoch, where each training example is encountered only once. It is simple to implement and is compatible with the existing finetuning techniques. Code is available at: https://github.com/xo28/efficient- NLP-multistage-training
△ Less
Submitted 18 May, 2023; v1 submitted 28 July, 2022;
originally announced July 2022.
-
STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining
Authors:
Liwei Guo,
Wonkyo Choe,
Felix Xiaozhu Lin
Abstract:
Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the wh…
▽ More
Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays.
To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy.
Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.
△ Less
Submitted 31 January, 2023; v1 submitted 11 July, 2022;
originally announced July 2022.
-
Protecting File Activities via Deception for ARM TrustZone
Authors:
Liwei Guo,
Kaiyang Zhao,
Yiying Zhang,
Felix Xiaozhu Lin
Abstract:
A TrustZone TEE often invokes an external filesystem. While filedata can be encrypted, the revealed file activities can leak secrets. To hide the file activities from the filesystem and its OS, we propose Enigma, a deception-based defense injecting sybil file activities as the cover of the actual file activities.
Enigma contributes three new designs. (1) To make the deception credible, the TEE g…
▽ More
A TrustZone TEE often invokes an external filesystem. While filedata can be encrypted, the revealed file activities can leak secrets. To hide the file activities from the filesystem and its OS, we propose Enigma, a deception-based defense injecting sybil file activities as the cover of the actual file activities.
Enigma contributes three new designs. (1) To make the deception credible, the TEE generates sybil calls by replaying file calls from the TEE code under protection. (2) To make sybil activities cheap, the TEE requests the OS to run K filesystem images simultaneously. Concealing the disk, the TEE backs only one image with the actual disk while backing other images by only storing their metadata. (3) To protect filesystem image identities, the TEE shuffles the images frequently, preventing the OS from observing any image for long.
Enigma works with unmodified filesystems shipped withLinux. On a low-cost Arm SoC with EXT4 and F2FS, our system can concurrently run as many as 50 filesystem images with 1% of disk overhead per additional image. Compared to common obfuscation for hiding addresses in a flat space, Enigma hides file activities with richer semantics. Its cost is lower by one order of magnitude while achieving the same level of probabilistic security guarantees.
△ Less
Submitted 24 May, 2022; v1 submitted 22 May, 2022;
originally announced May 2022.
-
FedAdapter: Efficient Federated Learning for Modern NLP
Authors:
Dongqi Cai,
Yaozong Wu,
Shangguang Wang,
Felix Xiaozhu Lin,
Mengwei Xu
Abstract:
Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towa…
▽ More
Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and mobile resources. To automate adapter configuration, we propose FedAdapter, a framework that enhances the existing FedNLP with two novel designs. First, FedAdapter progressively upgrades the adapter configuration throughout a training session; the principle is to quickly learn shallow knowledge by only training fewer and smaller adapters at the model's top layers, and incrementally learn deep knowledge by incorporating deeper and larger adapters. Second, FedAdapter continuously profiles future adapter configurations by allocating participant devices to trial groups. Extensive experiments show that FedAdapter can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.
△ Less
Submitted 8 May, 2023; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Safe and Practical GPU Acceleration in TrustZone
Authors:
Heejin Park,
Felix Xiaozhu Lin
Abstract:
We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach's key missing piece -- the recording environment, which needs both strong security and access to d…
▽ More
We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach's key missing piece -- the recording environment, which needs both strong security and access to diverse mobile GPUs. To this end, we present a novel architecture called CODY, in which a mobile device (which possesses the GPU hardware) and a trustworthy cloud service (which runs the GPU software) exercise the GPU hardware/software in a collaborative, distributed fashion. To overcome numerous network round trips and long delays, CODY contributes optimizations specific to mobile GPUs: register access deferral, speculation, and metastate-only synchronization. With these optimizations, recording a compute workload takes only tens of seconds, which is up to 95% less than a naive approach; replay incurs 25% lower delays compared to insecure, native execution.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Minimum Viable Device Drivers for ARM TrustZone
Authors:
Liwei Guo,
Felix Xiaozhu Lin
Abstract:
While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices.
Driverlets address two key challenges: corre…
▽ More
While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices.
Driverlets address two key challenges: correctness and expressiveness, for which they build on a key construct called interaction template. The interaction template ensures faithful reproduction of recorded IO jobs (albeit on new IO data); it accepts dynamic input values; it tolerates nondeterministic device behaviors.
We demonstrate driverlets on a series of sophisticated devices, making them accessible to TrustZone for the first time to our knowledge. Our experiments show that driverlets are secure, easy to build, and incur acceptable overhead (1.4x -2.7x compared to native drivers). Driverlets fill a critical gap in the TrustZone TEE, realizing its long-promised vision of secure IO.
△ Less
Submitted 15 March, 2022; v1 submitted 15 October, 2021;
originally announced October 2021.
-
Discovery and modelling of broad-scale plasma lensing in black-widow pulsar J2051$-$0827
Authors:
F. X. Lin,
R. A. Main,
J. P. W. Verbiest,
M. Kramer,
G. Shaifullah
Abstract:
We report on an unusually bright observation of PSR J2051$-$0827 recorded during a regular monitoring campaign of black-widow pulsar systems with the Effelsberg 100-m telescope. Through fortunate coincidence, a particularly bright scintillation maximum is simultaneous with the eclipse by the companion, enabling precise measurements of variations in the flux density, dispersion measure (DM), and sc…
▽ More
We report on an unusually bright observation of PSR J2051$-$0827 recorded during a regular monitoring campaign of black-widow pulsar systems with the Effelsberg 100-m telescope. Through fortunate coincidence, a particularly bright scintillation maximum is simultaneous with the eclipse by the companion, enabling precise measurements of variations in the flux density, dispersion measure (DM), and scattering strength throughout the eclipse. The flux density is highly variable throughout the eclipse, with a peak 1.7 times the average away from the eclipse, and yet does not significantly decrease on average. We recover the flux density variations from the measured DM variations using geometric optics, with a relative velocity as the only free parameter. We measure an effective velocity of (470 $\pm$ 10) km/s, consistent with the relative orbital motion of the companion, suggesting that the outflow velocity of the lensing material is low, or is directly along the line of sight. The 2 per cent uncertainty on the effective velocity is a formal error; systematics related to our current model are likely to dominate, and we detail several extensions to the model to be considered in a full treatment of lensing. This is a demonstration of the causal link between DM and lensing; the flux density variations can be predicted directly through the derivatives of DM. Going forward, this approach can be applied to investigate the dynamics of other eclipsing systems, and to investigate the physical nature of scintillation and lensing in the ionized interstellar medium.
△ Less
Submitted 22 July, 2021; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Profile changes associated with DM events in PSR J1713+0747
Authors:
Fang Xi Lin,
Hsiu-Hsien Lin,
Jing Luo,
Robert Main,
James McKee,
Ue-Li Pen,
Dana Simard,
Marten H. van Kerkwijk
Abstract:
Propagation effects in the interstellar medium and intrinsic profile changes can cause variability in the timing of pulsars, which limits the accuracy of fundamental science done via pulsar timing. One of the best timing pulsars, PSR J1713+0747, has gone through two `dip' events in its dispersion measure (DM) time series. If these events reflect real changes in electron column density, they should…
▽ More
Propagation effects in the interstellar medium and intrinsic profile changes can cause variability in the timing of pulsars, which limits the accuracy of fundamental science done via pulsar timing. One of the best timing pulsars, PSR J1713+0747, has gone through two `dip' events in its dispersion measure (DM) time series. If these events reflect real changes in electron column density, they should lead to multiple imaging. We show that the events are are well fitted by an underdense corrugated sheet model, and look for associated variability in the pulse profile using principal component analysis. We find that there are transient pulse profile variations, but they vary in concert with the dispersion measure, unlike what is expected from lensing due to a corrugated sheet. The change is consistent in shape across profiles from both the Greenbank and Arecibo radio observatories, and its amplitude appears to be achromatic across the 820-MHz, 1.4-GHz, and 2.3-GHz bands, again unlike expected from interference between lensed images. This result is puzzling. We note that some of the predicted lensing effects would need higher time and frequency resolution data than used in this analysis. Future events appear likely, and storing baseband data or keeping multiple time-frequency resolutions will allow more in-depth study of propagation effects and hence improvements to pulsar timing accuracy.
△ Less
Submitted 6 October, 2021; v1 submitted 17 June, 2021;
originally announced June 2021.
-
GPUReplay: A 50-KB GPU Stack for Client ML
Authors:
Heejin Park,
Felix Xiaozhu Lin
Abstract:
GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The…
▽ More
GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (50 KB of executable), robust (replaying long executions without divergence), portable (running in a commodity OS, in TEE, and baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We show that GPUReplay works with a variety of integrated GPU hardware, GPU APIs, ML frameworks, and 33 neural network (NN) implementations for inference or training. The code is available at https://github.com/bakhi/GPUReplay.
△ Less
Submitted 3 April, 2022; v1 submitted 4 May, 2021;
originally announced May 2021.
-
Imaginary images and Stokes phenomena in the weak plasma lensing of coherent sources
Authors:
Dylan L. Jow,
Fang Xi Lin,
Emily Tyhurst,
Ue-Li Pen
Abstract:
The study of astrophysical plasma lensing, such as in the case of extreme scattering events, has typically been conducted using the geometric limit of optics, neglecting wave effects. However, for the lensing of coherent sources such as pulsars and fast radio bursts (FRBs), wave effects can play an important role. Asymptotic methods, such as the so-called Eikonal limit, also known as the stationar…
▽ More
The study of astrophysical plasma lensing, such as in the case of extreme scattering events, has typically been conducted using the geometric limit of optics, neglecting wave effects. However, for the lensing of coherent sources such as pulsars and fast radio bursts (FRBs), wave effects can play an important role. Asymptotic methods, such as the so-called Eikonal limit, also known as the stationary phase approximation, have been used to include first-order wave effects; however, these methods fail at Stokes lines. Stokes lines are generic features of a variety of lens models, and are regions in parameter space where imaginary images begin to contribute to the overall intensity modulation of lensed sources. Using the mathematical framework of Picard-Lefschetz theory to compute diffraction integrals, we argue that these imaginary images contain as much information as their geometric counterparts, and may potentially be observable in data. Thus, weak-lensing events where these imaginary images are present can be as useful for inferring lens parameters as strong-lensing events in which multiple geometric images are present.
△ Less
Submitted 13 October, 2021; v1 submitted 15 March, 2021;
originally announced March 2021.
-
Enabling Large Neural Networks on Tiny Microcontrollers with Swapping
Authors:
Hongyu Miao,
Felix Xiaozhu Lin
Abstract:
Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to exe…
▽ More
Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to execute NNs out of core: dynamically swapping NN data chunks between an MCU's tiny SRAM and its large, low-cost external flash. Out-of-core NNs on MCUs raise multiple concerns: execution slowdown, storage wear out, energy consumption, and data security. We present a study showing that none is a showstopper; the key benefit -- MCUs being able to run large NNs with full accuracy and generality -- triumphs the overheads. Our findings suggest that MCUs can play a much greater role in edge intelligence.
△ Less
Submitted 1 September, 2021; v1 submitted 14 January, 2021;
originally announced January 2021.
-
Clique: Spatiotemporal Object Re-identification at the City Scale
Authors:
Tiantu Xu,
Kaiwen Shen,
Yang Fu,
Humphrey Shi,
Felix Xiaozhu Lin
Abstract:
Object re-identification (ReID) is a key application of city-scale cameras. While classic ReID tasks are often considered as image retrieval, we treat them as spatiotemporal queries for locations and times in which the target object appeared. Spatiotemporal reID is challenged by the accuracy limitation in computer vision algorithms and the colossal videos from city cameras. We present Clique, a pr…
▽ More
Object re-identification (ReID) is a key application of city-scale cameras. While classic ReID tasks are often considered as image retrieval, we treat them as spatiotemporal queries for locations and times in which the target object appeared. Spatiotemporal reID is challenged by the accuracy limitation in computer vision algorithms and the colossal videos from city cameras. We present Clique, a practical ReID engine that builds upon two new techniques: (1) Clique assesses target occurrences by clustering fuzzy object features extracted by ReID algorithms, with each cluster representing the general impression of a distinct object to be matched against the input; (2) to search in videos, Clique samples cameras to maximize the spatiotemporal coverage and incrementally adds cameras for processing on demand. Through evaluation on 25 hours of videos from 25 cameras, Clique reached a high accuracy of 0.87 (recall at 5) across 70 queries and runs at 830x of video realtime in achieving high accuracy.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Grand Challenges in Resilience: Autonomous System Resilience through Design and Runtime Measures
Authors:
Saurabh Bagchi,
Vaneet Aggarwal,
Somali Chaterji,
Fred Douglis,
Aly El Gamal,
Jiawei Han,
Brian J. Henz,
Hank Hoffmann,
Suman Jana,
Milind Kulkarni,
Felix Xiaozhu Lin,
Karen Marais,
Prateek Mittal,
Shaoshuai Mou,
Xiaokang Qiu,
Gesualdo Scutari
Abstract:
A set of about 80 researchers, practitioners, and federal agency program managers participated in the NSF-sponsored Grand Challenges in Resilience Workshop held on Purdue campus on March 19-21, 2019. The workshop was divided into three themes: resilience in cyber, cyber-physical, and socio-technical systems. About 30 attendees in all participated in the discussions of cyber resilience. This articl…
▽ More
A set of about 80 researchers, practitioners, and federal agency program managers participated in the NSF-sponsored Grand Challenges in Resilience Workshop held on Purdue campus on March 19-21, 2019. The workshop was divided into three themes: resilience in cyber, cyber-physical, and socio-technical systems. About 30 attendees in all participated in the discussions of cyber resilience. This article brings out the substantive parts of the challenges and solution approaches that were identified in the cyber resilience theme. In this article, we put forward the substantial challenges in cyber resilience in a few representative application domains and outline foundational solutions to address these challenges. These solutions fall into two broad themes: resilience-by-design and resilience-by-reaction. We use examples of autonomous systems as the application drivers motivating cyber resilience. We focus on some autonomous systems in the near horizon (autonomous ground and aerial vehicles) and also a little more distant (autonomous rescue and relief).
For resilience-by-design, we focus on design methods in software that are needed for our cyber systems to be resilient. In contrast, for resilience-by-reaction, we discuss how to make systems resilient by responding, reconfiguring, or recovering at runtime when failures happen. We also discuss the notion of adaptive execution to improve resilience, execution transparently and adaptively among available execution platforms (mobile/embedded, edge, and cloud). For each of the two themes, we survey the current state, and the desired state and ways to get there. We conclude the paper by looking at the research challenges we will have to solve in the short and the mid-term to make the vision of resilient autonomous systems a reality.
△ Less
Submitted 9 May, 2020; v1 submitted 25 December, 2019;
originally announced December 2019.
-
Approximate Query Service on Autonomous IoT Cameras
Authors:
Mengwei Xu,
Xiwen Zhang,
Yunxin Liu,
Gang Huang,
Xuanzhe Liu,
Felix Xiaozhu Lin
Abstract:
Elf is a runtime for an energy-constrained camera to continuously summarize video scenes as approximate object counts. Elf's novelty centers on planning the camera's count actions under energy constraint. (1) Elf explores the rich action space spanned by the number of sample image frames and the choice of per-frame object counters; it unifies errors from both sources into one single bounded error.…
▽ More
Elf is a runtime for an energy-constrained camera to continuously summarize video scenes as approximate object counts. Elf's novelty centers on planning the camera's count actions under energy constraint. (1) Elf explores the rich action space spanned by the number of sample image frames and the choice of per-frame object counters; it unifies errors from both sources into one single bounded error. (2) To decide count actions at run time, Elf employs a learning-based planner, jointly optimizing for past and future videos without delaying result materialization. Tested with more than 1,000 hours of videos and under realistic energy constraints, Elf continuously generates object counts within only 11% of the true counts on average. Alongside the counts, Elf presents narrow errors shown to be bounded and up to 3.4x smaller than competitive baselines. At a higher level, Elf makes a case for advancing the geographic frontier of video analytics.
△ Less
Submitted 5 May, 2020; v1 submitted 2 September, 2019;
originally announced September 2019.
-
Video Analytics with Zero-streaming Cameras
Authors:
Mengwei Xu,
Tiantu Xu,
Yunxin Liu,
Felix Xiaozhu Lin
Abstract:
Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain "cold" without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system…
▽ More
Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain "cold" without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system called DIVA. It addresses two key challenges: to best use limited camera resource during video capture; to rapidly explore massive videos during query execution. DIVA contributes two unconventional techniques. (1) When capturing videos, a camera builds sparse yet accurate landmark frames, from which it learns reliable knowledge for accelerating future queries. (2) When executing a query, a camera processes frames in multiple passes with increasingly more expensive operators. As such, DIVA presents and keeps refining inexact query results throughout the query's execution. On diverse queries over 15 videos lasting 720 hours in total, DIVA runs at more than 100x video realtime and outperforms competitive alternative designs. To our knowledge, DIVA is the first system for querying large videos stored on low-cost remote cameras.
△ Less
Submitted 17 June, 2021; v1 submitted 28 April, 2019;
originally announced April 2019.
-
Let the Cloud Watch Over Your IoT File Systems
Authors:
Liwei Guo,
Yiying Zhang,
Felix Xiaozhu Lin
Abstract:
Smart devices produce security-sensitive data and keep them in on-device storage for persistence. The current storage stack on smart devices, however, offers weak security guarantees: not only because the stack depends on a vulnerable commodity OS, but also because smart device deployment is known weak on security measures.
To safeguard such data on smart devices, we present a novel storage stac…
▽ More
Smart devices produce security-sensitive data and keep them in on-device storage for persistence. The current storage stack on smart devices, however, offers weak security guarantees: not only because the stack depends on a vulnerable commodity OS, but also because smart device deployment is known weak on security measures.
To safeguard such data on smart devices, we present a novel storage stack architecture that i) protects file data in a trusted execution environment (TEE); ii) outsources file system logic and metadata out of TEE; iii) running a metadata-only file system replica in the cloud for continuously verifying the on-device file system behaviors. To realize the architecture, we build Overwatch, aTrustZone-based storage stack. Overwatch addresses unique challenges including discerning metadata at fine grains, hiding network delays, and coping with cloud disconnection. On a suite of three real-world applications, Overwatch shows moderate security overheads.
△ Less
Submitted 17 February, 2019;
originally announced February 2019.
-
Precipitation during high temperature aging of Al-Cu alloys: a multiscale analysis based on first principles calculations
Authors:
H. Liu,
I. Papadimitriou,
F. X. Lin,
J. LLorca
Abstract:
Precipitation during high temperature aging of Al-Cu alloys is analyzed by means of the integration of classical nucleation theory and phase-field simulations into a multiscale modelling approach based on well-established thermodynamics principles. In particular, thermal stability of $θ''$, $θ'$ and $θ$ precipitates was assessed from first principles calculations of the Helmholtz free energy while…
▽ More
Precipitation during high temperature aging of Al-Cu alloys is analyzed by means of the integration of classical nucleation theory and phase-field simulations into a multiscale modelling approach based on well-established thermodynamics principles. In particular, thermal stability of $θ''$, $θ'$ and $θ$ precipitates was assessed from first principles calculations of the Helmholtz free energy while homogeneous and heterogeneous nucleation of $θ''$ and $θ'$ was analysed using classical nucleation theory. Precipitate growth was finally computed by means of mesoscopic phase-field model. The model parameters that determine quantitatively the driving forces for each transformation were obtained by means of first principles calculations and computational thermodynamics. The predictions of the models were in good agreement with experimental results and provided a comprehensive understanding of the precipitation pathway in Al-Cu alloys. It is envisaged that the strategy presented in this investigation can be used in the future to design optimum microstructures based on the information of the different energy contributions obtained from first principles calculations.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory
Authors:
Hongyu Miao,
Myeongjae Jeon,
Gennady Pekhimenko,
Kathryn S. McKinley,
Felix Xiaozhu Lin
Abstract:
Stream analytics have an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytic…
▽ More
Stream analytics have an insatiable demand for memory and performance. Emerging hybrid memories combine commodity DDR4 DRAM with 3D-stacked High Bandwidth Memory (HBM) DRAM to meet such demands. However, achieving this promise is challenging because (1) HBM is capacity-limited and (2) HBM boosts performance best for sequential access and high parallelism workloads. At first glance, stream analytics appear a particularly poor match for HBM because they have high capacity demands and data grouping operations, their most demanding computations, use random access. This paper presents the design and implementation of StreamBox-HBM, a stream analytics engine that exploits hybrid memories to achieve scalable high performance. StreamBox-HBM performs data grouping with sequential access sorting algorithms in HBM, in contrast to random access hashing algorithms commonly used in DRAM. StreamBox-HBM solely uses HBM to store Key Pointer Array (KPA) data structures that contain only partial records (keys and pointers to full records) for grouping operations. It dynamically creates and manages prodigious data and pipeline parallelism, choosing when to allocate KPAs in HBM. It dynamically optimizes for both the high bandwidth and limited capacity of HBM, and the limited bandwidth and high capacity of standard DRAM. StreamBox-HBM achieves 110 million records per second and 238 GB/s memory bandwidth while effectively utilizing all 64 cores of Intel's Knights Landing, a commercial server with hybrid memory. It outperforms stream engines with sequential access algorithms without KPAs by 7x and stream engines with random access algorithms by an order of magnitude in throughput. To the best of our knowledge, StreamBox-HBM is the first stream engine optimized for hybrid memories.
△ Less
Submitted 28 January, 2019; v1 submitted 4 January, 2019;
originally announced January 2019.
-
A First Look at Deep Learning Apps on Smartphones
Authors:
Mengwei Xu,
Jiawei Liu,
Yuanqiang Liu,
Felix Xiaozhu Lin,
Yunxin Liu,
Xuanzhe Liu
Abstract:
We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions:…
▽ More
We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R\&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, the protection of these models, and validation of research ideas on these models.
△ Less
Submitted 12 January, 2021; v1 submitted 8 November, 2018;
originally announced December 2018.
-
Transkernel: Bridging Monolithic Kernels to Peripheral Cores
Authors:
Liwei Guo,
Shuang Zhai,
Yi Qiao,
Felix Xiaozhu Lin
Abstract:
Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off…
▽ More
Smart devices see a large number of ephemeral tasks driven by background activities. In order to execute such a task, the OS kernel wakes up the platform beforehand and puts it back to sleep afterwards. In doing so, the kernel operates various IO devices and orchestrates their power state transitions. Such kernel executions are inefficient as they mismatch typical CPU hardware. They are better off running on a low-power, microcontroller-like core, i.e., peripheral core, relieving CPU from the inefficiency.
We therefore present a new OS structure, in which a lightweight virtual executor called transkernel offloads specific phases from a monolithic kernel. The transkernel translates stateful kernel execution through cross-ISA, dynamic binary translation (DBT); it emulates a small set of stateless kernel services behind a narrow, stable binary interface; it specializes for hot paths; it exploits ISA similarities for lowering DBT cost.
Through an ARM-based prototype, we demonstrate transkernel's feasibility and benefit. We show that while cross-ISA DBT is typically used under the assumption of efficiency loss, it can enable efficiency gain, even on off-the-shelf hardware.
△ Less
Submitted 5 June, 2019; v1 submitted 12 November, 2018;
originally announced November 2018.
-
VStore: A Data Store for Analytics on Large Videos
Authors:
Tiantu Xu,
Luis Materon Botelho,
Felix Xiaozhu Lin
Abstract:
We present VStore, a data store for supporting fast, resource-efficient analytics over large archival videos. VStore manages video ingestion, storage, retrieval, and consumption. It controls video formats along the video data path. It is challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for mul…
▽ More
We present VStore, a data store for supporting fast, resource-efficient analytics over large archival videos. VStore manages video ingestion, storage, retrieval, and consumption. It controls video formats along the video data path. It is challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for multiple resource types. It explores an idea called backward derivation of configuration: in the opposite direction along the video data path, VStore passes the video quantity and quality expected by analytics backward to retrieval, to storage, and to ingestion. In this process, VStore derives an optimal set of video formats, optimizing for different resources in a progressive manner. VStore automatically derives large, complex configurations consisting of more than one hundred knobs over tens of video formats. In response to queries, VStore selects video formats catering to the executed operators and the target accuracy. It streams video data from disks through decoder to operators. It runs queries as fast as 362x of video realtime.
△ Less
Submitted 17 February, 2019; v1 submitted 3 October, 2018;
originally announced October 2018.
-
Constraining small scale magnetic fields through plasma lensing: Application to the Black widow eclipsing pulsar binary
Authors:
Dongzi Li,
Fang Xi Lin,
Robert Main,
Ue-Li Pen,
Marten H. van Kerkwijk,
I-Sheng Yang
Abstract:
In regions with strongly varying electron density, radio emission can be magnified significantly by plasma lensing. In the presence of magnetic fields, magnification in time and frequency will be different for two circular polarizations. We show how these effects can be used to measure or constrain the magnetic field parallel to the line of sight, $B_\parallel$, as well as its spatial structure,…
▽ More
In regions with strongly varying electron density, radio emission can be magnified significantly by plasma lensing. In the presence of magnetic fields, magnification in time and frequency will be different for two circular polarizations. We show how these effects can be used to measure or constrain the magnetic field parallel to the line of sight, $B_\parallel$, as well as its spatial structure, $σ_{B_\parallel}$, in the lensing region. In addition, we discuss how generalized Faraday rotation can constrain the strength of the perpendicular field, $B_\perp$. We attempt to make such measurements for the Black Widow pulsar, PSR~B1957+20, in which plasma lensing was recently discovered. For this system, pressure equilibrium suggests $B\gtrsim 20\,$G at the interface between the pulsar and companion winds, where the radio eclipse starts and ends, and where most lensing occurs. We find no evidence for large-scale magnetic fields, with, on average, $B_\parallel=0.02\pm0.09\,$G over the egress lensing region. From individual lensing events, we strongly constrain small scale magnetic structure to $σ_B<10\,$mG, thus excluding scenarios with a strong but rapidly varying field. Finally, from the lack of reduction of average circular polarization in the same region, we rule out a strong, quasi-transverse field. We cannot identify any plausible scenario in which a large magnetic field in this system is concealed, leaving the nature of the interface between the pulsar and companion winds an enigma. Our method can be applied to other sources showing plasma lensing, including other eclipsing pulsars and fast radio bursts, to study the local properties of the magnetic field.
△ Less
Submitted 27 September, 2018;
originally announced September 2018.
-
StreamBox-TZ: Secure Stream Analytics at the Edge with TrustZone
Authors:
Heejin Park,
Shuang Zhai,
Long Lu,
Felix Xiaozhu Lin
Abstract:
While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted. This approa…
▽ More
While it is compelling to process large streams of IoT data on the cloud edge, doing so exposes the data to a sophisticated, vulnerable software stack on the edge and hence security threats. To this end, we advocate isolating the data and its computations in a trusted execution environment (TEE) on the edge, shielding them from the remaining edge software stack which we deem untrusted. This approach faces two major challenges: (1) executing high-throughput, low-delay stream analytics in a single TEE, which is constrained by a low trusted computing base (TCB) and limited physical memory; (2) verifying execution of stream analytics as the execution involves untrusted software components on the edge. In response, we present StreamBox-TZ (SBT), a stream analytics engine for an edge platform that offers strong data security, verifiable results, and good performance. SBT contributes a data plane designed and optimized for a TEE based on ARM TrustZone. It supports continuous remote attestation for analytics correctness and result freshness while incurring low overhead. SBT only adds 42.5 KB executable to the TCB (16% of the entire TCB). On an octa core ARMv8 platform, it delivers the state-of-the-art performance by processing input events up to 140 MB/sec (12M events/sec) with sub-second delay. The overhead incurred by SBT's security mechanism is less than 25%.
△ Less
Submitted 5 June, 2019; v1 submitted 2 August, 2018;
originally announced August 2018.
-
Pulsar emission amplified and resolved by plasma lensing in an eclipsing binary
Authors:
Robert Main,
I-Sheng Yang,
Victor Chan,
Dongzi Li,
Fang Xi Lin,
Nikhil Mahajan,
Ue-Li Pen,
Keith Vanderlinde,
Marten H. van Kerkwijk
Abstract:
Radio pulsars scintillate because their emission travels through the ionized interstellar medium via multiple paths, which interfere with each other. It has long been realized that the scattering screens responsible for the scintillation could be used as `interstellar lenses' to localize pulsar emission regions. Most scattering screens, however, only marginally resolve emission components, limitin…
▽ More
Radio pulsars scintillate because their emission travels through the ionized interstellar medium via multiple paths, which interfere with each other. It has long been realized that the scattering screens responsible for the scintillation could be used as `interstellar lenses' to localize pulsar emission regions. Most scattering screens, however, only marginally resolve emission components, limiting results to statistical inferences and detections of small positional shifts. Since screens situated close to the source have better resolution, it should be easier to resolve emission regions of pulsars located in high density environments such as supernova remnants or binaries in which the pulsar's companion has an ionized outflow. Here, we report events of extreme plasma lensing in the `Black Widow' pulsar, PSR~B1957+20, near the phase in its 9.2 hour orbit in which its emission is eclipsed by its companion's outflow. During the lensing events, the flux is enhanced by factors of up to 70--80 at specific frequencies. The strongest events clearly resolve the emission regions: they affect the narrow main pulse and parts of the wider interpulse differently. We show that the events arise naturally from density fluctuations in the outer regions of the outflow, and infer a resolution of our lenses comparable to the pulsar's radius, about 10\,km. Furthermore, the distinct frequency structures imparted by the lensing are reminiscent of what is observed for the repeating fast radio burst FRB 121102, providing observational support for the idea that this source is observed through, and thus at times strongly magnified by, plasma lenses.
△ Less
Submitted 23 May, 2018;
originally announced May 2018.
-
DeepCache: Principled Cache for Mobile Deep Vision
Authors:
Mengwei Xu,
Mengze Zhu,
Yunxin Liu,
Felix Xiaozhu Lin,
Xuanzhe Liu
Abstract:
We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the inpu…
▽ More
We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the input of a model, DeepCache discovers video temporal locality by exploiting the video's internal structure, for which it borrows proven heuristics from video compression; into the model, DeepCache propagates regions of reusable results by exploiting the model's internal structure. Notably, DeepCache eschews applying video heuristics to model internals which are not pixels but high-dimensional, difficult-to-interpret data. Our implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices. Our experiments show that DeepCache saves inference execution time by 18% on average and up to 47%. DeepCache reduces system energy consumption by 20% on average.
△ Less
Submitted 30 March, 2020; v1 submitted 1 December, 2017;
originally announced December 2017.
-
Draining our Glass: An Energy and Heat Characterization of Google Glass
Authors:
Robert LiKamWa,
Zhen Wang,
Aaron Carroll,
Felix Xiaozhu Lin,
Lin Zhong
Abstract:
The Google Glass is a mobile device designed to be worn as eyeglasses. This form factor enables new usage possibilities, such as hands-free video chats and instant web search. However, its shape also hampers its potential: (1) battery size, and therefore lifetime, is limited by a need for the device to be lightweight, and (2) high-power processing leads to significant heat, which should be limited…
▽ More
The Google Glass is a mobile device designed to be worn as eyeglasses. This form factor enables new usage possibilities, such as hands-free video chats and instant web search. However, its shape also hampers its potential: (1) battery size, and therefore lifetime, is limited by a need for the device to be lightweight, and (2) high-power processing leads to significant heat, which should be limited, due to the Glass' compact form factor and close proximity to the user's skin. We use the Glass in a case study of the power and thermal characteristics of optical head-mounted display devices. We share insights and implications to limit power consumption to increase the safety and utility of head-mounted devices.
△ Less
Submitted 26 March, 2014;
originally announced April 2014.
-
Guadalupe: a browser design for heterogeneous hardware
Authors:
Zhen Wang,
Felix Xiaozhu Lin,
Lin Zhong,
Mansoor Chishtie
Abstract:
Mobile systems are embracing heterogeneous architectures by getting more types of cores and more specialized cores, which allows applications to be faster and more efficient. We aim at exploiting the hardware heterogeneity from the browser without requiring any changes to either the OS or the web applications. Our design, Guadalupe, can use hardware processing units with different degrees of capab…
▽ More
Mobile systems are embracing heterogeneous architectures by getting more types of cores and more specialized cores, which allows applications to be faster and more efficient. We aim at exploiting the hardware heterogeneity from the browser without requiring any changes to either the OS or the web applications. Our design, Guadalupe, can use hardware processing units with different degrees of capability for matched browser services. It starts with a weak hardware unit, determines if and when a strong unit is needed, and seamlessly migrates to the strong one when necessary. Guadalupe not only makes more computing resources available to mobile web browsing but also improves its energy proportionality. Based on Chrome for Android and TI OMAP4, We provide a prototype browser implementation for resource loading and rendering. Compared to Chrome for Android, we show that Guadalupe browser for rendering can increase other 3D application's frame rate by up to 767% and save 4.7% of the entire system's energy consumption. More importantly, by using the two cases, we demonstrate that Guadalupe creates the great opportunity for many browser services to get better resource utilization and energy proportionality by exploiting hardware heterogeneity.
△ Less
Submitted 19 December, 2012;
originally announced December 2012.
-
How Far Can Client-Only Solutions Go for Mobile Browser Speed?
Authors:
Zhen Wang,
Felix Xiaozhu Lin,
Lin Zhong,
Mansoor Chishtie
Abstract:
Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedente…
▽ More
Mobile browser is known to be slow because of the bottleneck in resource loading. Client-only solutions to improve resource loading are attractive because they are immediately deployable, scalable, and secure. We present the first publicly known treatment of client-only solutions to understand how much they can improve mobile browser speed without infrastructure support. Leveraging an unprecedented set of web usage data collected from 24 iPhone users continuously over one year, we examine the three fundamental, orthogonal approaches a client-only solution can take: caching, prefetching, and speculative loading, which is first proposed and studied in this work. Speculative loading predicts and speculatively loads the subresources needed to open a web page once its URL is given. We show that while caching and prefetching are highly limited for mobile browsing, speculative loading can be significantly more effective. Empirically, we show that client-only solutions can improve the browser speed by about 1.4 second on average for web sites visited by the 24 iPhone users. We also report the design, realization, and evaluation of speculative loading in a WebKit-based browser called Tempo. On average, Tempo can reduce browser delay by 1 second (~20%).
△ Less
Submitted 15 December, 2011;
originally announced December 2011.
-
Transparent Programming of Heterogeneous Smartphones for Sensing
Authors:
Felix Xiaozhu Lin,
Zhen Wang,
Robert LiKamWa,
Lin Zhong
Abstract:
Sensing on smartphones is known to be power-hungry. It has been shown that this problem can be solved by adding an ultra low-power processor to execute simple, frequent sensor data processing. While very effective in saving energy, this resulting heterogeneous, distributed architecture poses a significant challenge to application development.
We present Reflex, a suite of runtime and compilation…
▽ More
Sensing on smartphones is known to be power-hungry. It has been shown that this problem can be solved by adding an ultra low-power processor to execute simple, frequent sensor data processing. While very effective in saving energy, this resulting heterogeneous, distributed architecture poses a significant challenge to application development.
We present Reflex, a suite of runtime and compilation techniques to conceal the heterogeneous, distributed nature from developers. The Reflex automatically transforms the developer's code for distributed execution with the help of the Reflex runtime. To create a unified system illusion, Reflex features a novel software distributed shared memory (DSM) design that leverages the extreme architectural asymmetry between the low-power processor and the powerful central processor to achieve both energy efficiency and performance.
We report a complete realization of Reflex for heterogeneous smartphones with Maemo/Linux as the central kernel. Using a tri-processor hardware prototype and sensing applications reported in recent literature, we evaluate the Reflex realization for programming transparency, energy efficiency, and performance. We show that Reflex supports a programming style that is very close to contemporary smartphone programming. It allows existing sensing applications to be ported with minor source code changes. Reflex reduces the system power in sensing by up to 83%, and its runtime system only consumes 10% local memory on a typical ultra-low power processor.
△ Less
Submitted 11 March, 2011;
originally announced March 2011.