Operating Systems (cs.OS)

Design and demonstration of an operating system for executing applications on quantum network nodes
Carlo Delle Donne, Mariagrazia Iuliano, Bart van der Vecht, Guilherme Maciel Ferreira, Hana Jirovská, Thom van der Steenhoven, Axel Dahlberg, Matt Skrzypczyk, Dario Fioretto, Markus Teller, Pavel Filippov, Alejandro Rodríguez-Pardo Montblanch, Julius Fischer, Benjamin van Ommen, Nicolas Demetriou, Dominik Leichtle, Luka Music, Harold Ollivier, Ingmar te Raa, Wojciech Kozlowski, et al (5)
Jul 29 2024 quant-ph cs.NI cs.OS arXiv:2407.18306v1

@misc{2407.18306, author = {Carlo Delle Donne and Mariagrazia Iuliano and Bart van der Vecht and Guilherme Maciel Ferreira and Hana Jirovská and Thom van der Steenhoven and Axel Dahlberg and Matt Skrzypczyk and Dario Fioretto and Markus Teller and Pavel Filippov and Alejandro Rodríguez-Pardo Montblanch and Julius Fischer and Benjamin van Ommen and Nicolas Demetriou and Dominik Leichtle and Luka Music and Harold Ollivier and Ingmar te Raa and Wojciech Kozlowski and Tim Taminiau and Przemysław Pawełczak and Tracy Northup and Ronald Hanson and Stephanie Wehner}, title = {{D}esign and demonstration of an operating system for executing applications on quantum network nodes}, year = {2024}, eprint = {2407.18306}, note = {arXiv:2407.18306v1} }
PDF
The goal of future quantum networks is to enable new internet applications that are impossible to achieve using solely classical communication. Up to now, demonstrations of quantum network applications and functionalities on quantum processors have been performed in ad-hoc software that was specific to the experimental setup, programmed to perform one single task (the application experiment) directly into low-level control devices using expertise in experimental physics. Here, we report on the design and implementation of the first architecture capable of executing quantum network applications on quantum processors in platform-independent high-level software. We demonstrate the architecture's capability to execute applications in high-level software, by implementing it as a quantum network operating system -- QNodeOS -- and executing test programs including a delegated computation from a client to a server on two quantum network nodes based on nitrogen-vacancy (NV) centers in diamond. We show how our architecture allows us to maximize the use of quantum network hardware, by multitasking different applications on a quantum network for the first time. Our architecture can be used to execute programs on any quantum processor platform corresponding to our system model, which we illustrate by demonstrating an additional driver for QNodeOS for a trapped-ion quantum network node based on a single $^{40}\text{Ca}^+$ atom. Our architecture lays the groundwork for computer science research in the domain of quantum network programming, and paves the way for the development of software that can bring quantum network technology to society.
Transparent and Efficient Live Migration across Heterogeneous Hosts with Wharf
Yiwei Yang, Aibo Hu, Yusheng Zheng, Brian Zhao, Xinqi Zhang, Andrew Quinn
Oct 22 2024 cs.OS arXiv:2410.15894v1

@misc{2410.15894, author = {Yiwei Yang and Aibo Hu and Yusheng Zheng and Brian Zhao and Xinqi Zhang and Andrew Quinn}, title = {{T}ransparent and {E}fficient {L}ive {M}igration across {H}eterogeneous {H}osts with {W}harf}, year = {2024}, eprint = {2410.15894}, note = {arXiv:2410.15894v1} }
PDF
Live migration allows a user to move a running application from one machine (a source) to another (a destination) without restarting it. The technique has proven useful for diverse tasks including load balancing, managing system updates, improving data locality, and improving system resilience. Unfortunately, current live migration solutions fail to meet today's computing needs. First, most techniques do not support heterogeneous source and destination hosts, as they require the two machines to have the same instruction set architecture (ISA) or use the same operating system (OS), which hampers numerous live migration usecases. Second, many techniques are not transparent, as they require that applications be written in a specific high-level language or call specific library functions, which imposes barriers to entry for many users. We present a new lightweight abstraction, called a vessel, that supports transparent heterogeneous live migration. A vessel maintains a machine-independent encoding of a process's state, using WebAssembly abstractions, allowing it to be executed on nearly-arbitrary ISAs. A vessel virtualizes all of its OS state, using the WebAssembly System Interface (WASI), allowing it to execute on nearly arbitrary OS. We introduce docks and software systems that execute and migrate vessels. Docks face two key challenges: First, maintaining a machine-independent encoding at all points in a process is extremely expensive. So, docks instead ensure that a vessel is guaranteed to eventually reach a machine-independent point and delay the initiation of vessel migration until the vessel reaches such a point. Second, a dock may receive a vessel migration that originates from a dock executing on a different OS.
Reinforcement Learning for Dynamic Memory Allocation
Arisrei Lim, Abhiram Maddukuri
Oct 22 2024 cs.LG cs.OS arXiv:2410.15492v1

@misc{2410.15492, author = {Arisrei Lim and Abhiram Maddukuri}, title = {{R}einforcement {L}earning for {D}ynamic {M}emory {A}llocation}, year = {2024}, eprint = {2410.15492}, note = {arXiv:2410.15492v1} }
PDF
In recent years, reinforcement learning (RL) has gained popularity and has been applied to a wide range of tasks. One such popular domain where RL has been effective is resource management problems in systems. We look to extend work on RL for resource management problems by considering the novel domain of dynamic memory allocation management. We consider dynamic memory allocation to be a suitable domain for RL since current algorithms like first-fit, best-fit, and worst-fit can fail to adapt to changing conditions and can lead to fragmentation and suboptimal efficiency. In this paper, we present a framework in which an RL agent continuously learns from interactions with the system to improve memory management tactics. We evaluate our approach through various experiments using high-level and low-level action spaces and examine different memory allocation patterns. Our results show that RL can successfully train agents that can match and surpass traditional allocation strategies, particularly in environments characterized by adversarial request patterns. We also explore the potential of history-aware policies that leverage previous allocation requests to enhance the allocator's ability to handle complex request patterns. Overall, we find that RL offers a promising avenue for developing more adaptive and efficient memory allocation strategies, potentially overcoming limitations of hardcoded allocation algorithms.
Optimizing over FP/EDF Execution Times: Known Results and Open Problems
Enrico Bini
Oct 21 2024 cs.OS arXiv:2410.14381v2

@misc{2410.14381, author = {Enrico Bini}, title = {{O}ptimizing over {FP}/{EDF} {E}xecution {T}imes: {K}nown {R}esults and {O}pen {P}roblems}, year = {2024}, eprint = {2410.14381}, note = {arXiv:2410.14381v2} }
PDF
In many use cases the execution time of tasks is unknown and can be chosen by the designer to increase or decrease the application features depending on the availability of processing capacity. If the application has real-time constraints, such as deadlines, then the necessary and sufficient schedulability test must allow the execution times to be left unspecified. By doing so, the designer can then perform optimization of the execution times by picking the schedulable values that minimize any given cost. In this paper, we review existing results on the formulation of both the Fixed Priority and Earliest Deadline First exact schedulability constraints. The reviewed formulations are expressed by a combination of linear constraints, which enables then optimization routines.
FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
Tianyuan Wu, Wei Wang, Yinghao Yu, Siran Yang, Wenchao Wu, Qinkai Duan, Guodong Yang, Jiamang Wang, Lin Qu, Liping Zhang
Oct 17 2024 cs.DC cs.OS arXiv:2410.12588v1

@misc{2410.12588, author = {Tianyuan Wu and Wei Wang and Yinghao Yu and Siran Yang and Wenchao Wu and Qinkai Duan and Guodong Yang and Jiamang Wang and Lin Qu and Liping Zhang}, title = {{FALCON}: {P}inpointing and {M}itigating {S}tragglers for {L}arge-{S}cale {H}ybrid-{P}arallel {T}raining}, year = {2024}, eprint = {2410.12588}, note = {arXiv:2410.12588v1} }
PDF
Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention. We have applied FALCON to detect human-labeled fail-slows in a production cluster with over 99% accuracy. Cluster deployment further demonstrates that FALCON effectively handles manually injected fail-slows, mitigating the training slowdown by 60.1%.
AsyncFS: Metadata Updates Made Asynchronous for Distributed Filesystems with In-Network Coordination
Jingwei Xu, Mingkai Dong, Qiulin Tian, Ziyi Tian, Tong Xin, Haibo Chen
Oct 14 2024 cs.DC cs.OS cs.PF arXiv:2410.08618v1

@misc{2410.08618, author = {Jingwei Xu and Mingkai Dong and Qiulin Tian and Ziyi Tian and Tong Xin and Haibo Chen}, title = {{A}sync{FS}: {M}etadata {U}pdates {M}ade {A}synchronous for {D}istributed {F}ilesystems with {I}n-{N}etwork {C}oordination}, year = {2024}, eprint = {2410.08618}, note = {arXiv:2410.08618v1} }
PDF
Distributed filesystems typically employ synchronous metadata updates, facing inherent challenges for access efficiency, load balancing, and directory contention, especially under dynamic and skewed workloads. This paper argues that synchronous updates are overly conservative for distributed filesystems. We propose AsyncFS with asynchronous metadata updates, allowing operations to return early and defer directory updates until respective read to enable latency hiding and conflict resolution. The key challenge is efficiently maintaining the synchronous semantics of metadata updates. To address this, AsyncFS is co-designed with a programmable switch, leveraging the constrained on-switch resources to holistically track directory states in the network with negligible cost. This allows AsyncFS to timely aggregate and efficiently apply delayed updates using batching and consolidation before directory reads. Evaluation shows that AsyncFS achieves up to 13.34$\times$ and 3.85$\times$ higher throughput, and 61.6% and 57.3% lower latency than two state-of-the-art distributed filesystems, InfiniFS and CFS-KV, respectively, on skewed workloads. For real-world workloads, AsyncFS improves end-to-end throughput by 21.1$\times$, 1.1$\times$ and 30.1% over Ceph, IndexFS and CFS-KV, respectively.
SoK: Software Compartmentalization
Hugo Lefeuvre, Nathan Dautenhahn, David Chisnall, Pierre Olivier
Oct 14 2024 cs.CR cs.OS arXiv:2410.08434v1

@misc{2410.08434, author = {Hugo Lefeuvre and Nathan Dautenhahn and David Chisnall and Pierre Olivier}, title = {{S}o{K}: {S}oftware {C}ompartmentalization}, year = {2024}, eprint = {2410.08434}, note = {arXiv:2410.08434v1} }
PDF
Decomposing large systems into smaller components with limited privileges has long been recognized as an effective means to minimize the impact of exploits. Despite historical roots, demonstrated benefits, and a plethora of research efforts in academia and industry, the compartmentalization of software is still not a mainstream practice. This paper investigates why, and how this status quo can be improved. Noting that existing approaches are fraught with inconsistencies in terminology and analytical methods, we propose a unified model for the systematic analysis, comparison, and directing of compartmentalization approaches. We use this model to review 211 research efforts and analyze 61 mainstream compartmentalized systems, confronting them to understand the limitations of both research and production works. Among others, our findings reveal that mainstream efforts largely rely on manual methods, custom abstractions, and legacy mechanisms, poles apart from recent research. We conclude with recommendations: compartmentalization should be solved holistically; progress is needed towards simplifying the definition of compartmentalization policies; towards better challenging our threat models in the light of confused deputies and hardware limitations; as well as towards bridging the gaps we pinpoint between research and mainstream needs. This paper not only maps the historical and current landscape of compartmentalization, but also sets forth a framework to foster their evolution and adoption.
Overcoming Autoware-Ubuntu Incompatibility in Autonomous Driving Systems-Equipped Vehicles: Lessons Learned
Dada Zhang, Md Ruman Islam, Pei-Chi Huang, Chun-Hsing Ho
Oct 10 2024 cs.RO cs.OS cs.SE arXiv:2410.06492v1

@misc{2410.06492, author = {Dada Zhang and Md Ruman Islam and Pei-Chi Huang and Chun-Hsing Ho}, title = {{O}vercoming {A}utoware-{U}buntu {I}ncompatibility in {A}utonomous {D}riving {S}ystems-{E}quipped {V}ehicles: {L}essons {L}earned}, year = {2024}, eprint = {2410.06492}, note = {arXiv:2410.06492v1} }
PDF
Autonomous vehicles have been rapidly developed as demand that provides safety and efficiency in transportation systems. As autonomous vehicles are designed based on open-source operating and computing systems, there are numerous resources aimed at building an operating platform composed of Ubuntu, Autoware, and Robot Operating System (ROS). However, no explicit guidelines exist to help scholars perform trouble-shooting due to incompatibility between the Autoware platform and Ubuntu operating systems installed in autonomous driving systems-equipped vehicles (i.e., Chrysler Pacifica). The paper presents an overview of integrating the Autoware platform into the autonomous vehicle's interface based on lessons learned from trouble-shooting processes for resolving incompatible issues. The trouble-shooting processes are presented based on resolving the incompatibility and integration issues of Ubuntu 20.04, Autoware.AI, and ROS Noetic software installed in an autonomous driving systems-equipped vehicle. Specifically, the paper focused on common incompatibility issues and code-solving protocols involving Python compatibility, Compute Unified Device Architecture (CUDA) installation, Autoware installation, and simulation in Autoware.AI. The objective of the paper is to provide an explicit and detail-oriented presentation to showcase how to address incompatibility issues among an autonomous vehicle's operating interference. The lessons and experience presented in the paper will be useful for researchers who encountered similar issues and could follow up by performing trouble-shooting activities and implementing ADS-related projects in the Ubuntu, Autoware, and ROS operating systems.
Serverless Cold Starts and Where to Find Them
Artjom Joosen, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Luke Darlow, Jianfeng Wang, Qiwen Deng, Adam Barker
Oct 10 2024 cs.DC cs.OS cs.PF arXiv:2410.06145v1

@misc{2410.06145, author = {Artjom Joosen and Ahmed Hassan and Martin Asenov and Rajkarn Singh and Luke Darlow and Jianfeng Wang and Qiwen Deng and Adam Barker}, title = {{S}erverless {C}old {S}tarts and {W}here to {F}ind {T}hem}, year = {2024}, eprint = {2410.06145}, note = {arXiv:2410.06145v1} }
PDF
This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release
Global Scheduling of Weakly-Hard Real-Time Tasks using Job-Level Priority Classes
V. Gabriel Moyano, Zain A. H. Hammadeh, Selma Saidi, Daniel Lüdtke
Oct 03 2024 cs.OS arXiv:2410.01528v1

@misc{2410.01528, author = {V.~Gabriel Moyano and Zain A.~H.~Hammadeh and Selma Saidi and Daniel Lüdtke}, title = {{G}lobal {S}cheduling of {W}eakly-{H}ard {R}eal-{T}ime {T}asks using {J}ob-{L}evel {P}riority {C}lasses}, year = {2024}, eprint = {2410.01528}, note = {arXiv:2410.01528v1} }
PDF
Real-time systems are intrinsic components of many pivotal applications, such as self-driving vehicles, aerospace and defense systems. The trend in these applications is to incorporate multiple tasks onto fewer, more powerful hardware platforms, e.g., multi-core systems, mainly for reducing cost and power consumption. Many real-time tasks, like control tasks, can tolerate occasional deadline misses due to robust algorithms. These tasks can be modeled using the weakly-hard model. Literature shows that leveraging the weakly-hard model can relax the over-provisioning associated with designed real-time systems. However, a wide-range of the research focuses on single-core platforms. Therefore, we strive to extend the state-of-the-art of scheduling weakly-hard real-time tasks to multi-core platforms. We present a global job-level fixed priority scheduling algorithm together with its schedulability analysis. The scheduling algorithm leverages the tolerable continuous deadline misses to assigning priorities to jobs. The proposed analysis extends the Response Time Analysis (RTA) for global scheduling to test the schedulability of tasks. Hence, our analysis scales with the number of tasks and number of cores because, unlike literature, it depends neither on Integer Linear Programming nor reachability trees. Schedulability analyses show that the schedulability ratio is improved by 40% comparing to the global Rate Monotonic (RM) scheduling and up to 60% more than the global EDF scheduling, which are the state-of-the-art schedulers on the RTEMS real-time operating system. Our evaluation on industrial embedded multi-core platform running RTEMS shows that the scheduling overhead of our proposal does not exceed 60 Nanosecond.
The eBPF Runtime in the Linux Kernel
Bolaji Gbadamosi, Luigi Leonardi, Tobias Pulls, Toke Høiland-Jørgensen, Simone Ferlin-Reiter, Simo Sorce, Anna Brunström
Oct 02 2024 cs.OS cs.CE arXiv:2410.00026v2

@misc{2410.00026, author = {Bolaji Gbadamosi and Luigi Leonardi and Tobias Pulls and Toke Høiland-Jørgensen and Simone Ferlin-Reiter and Simo Sorce and Anna Brunström}, title = {{T}he e{BPF} {R}untime in the {L}inux {K}ernel}, year = {2024}, eprint = {2410.00026}, note = {arXiv:2410.00026v2} }
PDF
Extended Berkeley Packet Filter (eBPF) is a runtime that enables users to load programs into the operating system (OS) kernel, like Linux or Windows, and execute them safely and efficiently at designated kernel hooks. Each program passes through a verifier that reasons about the safety guarantees for execution. Hosting a safe virtual machine runtime within the kernel makes it dynamically programmable. Unlike the popular approach of bypassing or completely replacing the kernel, eBPF gives users the flexibility to modify the kernel on the fly, rapidly experiment and iterate, and deploy solutions to achieve their workload-specific needs, while working in concert with the kernel. In this paper, we present the first comprehensive description of the design and implementation of the eBPF runtime in the Linux kernel. We argue that eBPF today provides a mature and safe programming environment for the kernel. It has seen wide adoption since its inception and is increasingly being used not just to extend, but program entire components of the kernel, while preserving its runtime integrity. We outline the compelling advantages it offers for real-world production usage, and illustrate current use cases. Finally, we identify its key challenges, and discuss possible future directions.
Energy-Efficient Computation with DVFS using Deep Reinforcement Learning for Multi-Task Systems in Edge Computing
Xinyi Li, Ti Zhou, Haoyu Wang, Man Lin
Oct 01 2024 cs.OS cs.LG arXiv:2409.19434v2

@misc{2409.19434, author = {Xinyi Li and Ti Zhou and Haoyu Wang and Man Lin}, title = {{E}nergy-{E}fficient {C}omputation with {DVFS} using {D}eep {R}einforcement {L}earning for {M}ulti-{T}ask {S}ystems in {E}dge {C}omputing}, year = {2024}, eprint = {2409.19434}, note = {arXiv:2409.19434v2} }
PDF
Periodic soft real-time systems have broad applications in many areas, such as IoT. Finding an optimal energy-efficient policy that is adaptable to underlying edge devices while meeting deadlines for tasks has always been challenging. This research studies generalized systems with multi-task, multi-deadline scenarios with reinforcement learning-based DVFS for energy saving. This work addresses the limitation of previous work that models a periodic system as a single task and single-deadline scenario, which is too simplified to cope with complex situations. The method encodes time series information in the Linux kernel into information that is easy to use for reinforcement learning, allowing the system to generate DVFS policies to adapt system patterns based on the general workload. For encoding, we present two different methods for comparison. Both methods use only one performance counter: system utilization and the kernel only needs minimal information from the userspace. Our method is implemented on Jetson Nano Board (2GB) and is tested with three fixed multitask workloads, which are three, five, and eight tasks in the workload, respectively. For randomness and generalization, we also designed a random workload generator to build different multitask workloads to test. Based on the test results, our method could save 3%-10% power compared to Linux built-in governors.
Exploring Time-Space trade-offs for synchronized in Lilliput
Dave Dice, Alex Kogan
Sep 30 2024 cs.OS arXiv:2409.18342v1

@misc{2409.18342, author = {Dave Dice and Alex Kogan}, title = {{E}xploring {T}ime-{S}pace trade-offs for synchronized in {L}illiput}, year = {2024}, eprint = {2409.18342}, note = {arXiv:2409.18342v1} }
PDF
In the context of Project Lilliput, which attempts to reduce the size of object header in the HotSpot Java Virtual Machine (JVM), we explore a curated set of synchronization algorithms. Each of the algorithms could serve as a potential replacement implementation for the "synchronized" construct in HotSpot. Collectively, the algorithms illuminate trade-offs in space-time properties. The key design decisions are where to locate synchronization metadata (monitor fields), how to map from an object to those fields, and the lifecycle of the monitor information. The reader is assumed to be familiar with current HotSpot implementation of "synchronized" as well as the Compact Java Monitors (CJM) design and Project Lilliput.
FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search
Bing Tian, Haikun Liu, Yuhang Tang, Shihai Xiao, Zhuohui Duan, Xiaofei Liao, Xuecang Zhang, Junhua Zhu, Yu Zhang
Sep 26 2024 cs.IR cs.DB cs.OS arXiv:2409.16576v1

@misc{2409.16576, author = {Bing Tian and Haikun Liu and Yuhang Tang and Shihai Xiao and Zhuohui Duan and Xiaofei Liao and Xuecang Zhang and Junhua Zhu and Yu Zhang}, title = {{F}usion{ANNS}: {A}n {E}fficient {CPU}/{GPU} {C}ooperative {P}rocessing {A}rchitecture for {B}illion-scale {A}pproximate {N}earest {N}eighbor {S}earch}, year = {2024}, eprint = {2409.16576}, note = {arXiv:2409.16576v1} }
PDF
Approximate nearest neighbor search (ANNS) has emerged as a crucial component of database and AI infrastructure. Ever-increasing vector datasets pose significant challenges in terms of performance, cost, and accuracy for ANNS services. None of modern ANNS systems can address these issues simultaneously. We present FusionANNS, a high-throughput, low-latency, cost-efficient, and high-accuracy ANNS system for billion-scale datasets using SSDs and only one entry-level GPU. The key idea of FusionANNS lies in CPU/GPU collaborative filtering and re-ranking mechanisms, which significantly reduce I/O operations across CPUs, GPU, and SSDs to break through the I/O performance bottleneck. Specifically, we propose three novel designs: (1) multi-tiered indexing to avoid data swapping between CPUs and GPU, (2) heuristic re-ranking to eliminate unnecessary I/Os and computations while guaranteeing high accuracy, and (3) redundant-aware I/O deduplication to further improve I/O efficiency. We implement FusionANNS and compare it with the state-of-the-art SSD-based ANNS system--SPANN and GPU-accelerated in-memory ANNS system--RUMMY. Experimental results show that FusionANNS achieves 1) 9.4-13.1X higher query per second (QPS) and 5.7-8.8X higher cost efficiency compared with SPANN; 2) and 2-4.9X higher QPS and 2.3-6.8X higher cost efficiency compared with RUMMY, while guaranteeing low latency and high accuracy.
Assessing FIFO and Round Robin Scheduling:Effects on Data Pipeline Performance and Energy Usage
Malobika Roy Choudhury, Akshat Mehrotra
Sep 25 2024 cs.OS arXiv:2409.15704v1

@misc{2409.15704, author = {Malobika Roy Choudhury and Akshat Mehrotra}, title = {{A}ssessing {FIFO} and {R}ound {R}obin {S}cheduling:{E}ffects on {D}ata {P}ipeline {P}erformance and {E}nergy {U}sage}, year = {2024}, eprint = {2409.15704}, note = {arXiv:2409.15704v1} }
PDF
In the case of compute-intensive machine learning, efficient operating system scheduling is crucial for performance and energy efficiency. This paper conducts a comparative study over FIFO(First-In-First-Out) and RR(Round-Robin) scheduling policies with the application of real-time machine learning training processes and data pipelines on Ubuntu-based systems. Knowing a few patterns of CPU usage and energy consumption, we identify which policy (the exclusive or the shared) provides higher performance and/or lower energy consumption for typical modern workloads. Results of this study would help in providing better operating system schedulers for modern systems like Ubuntu, working to improve performance and reducing energy consumption in compute intensive workloads.
Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization
Jinshu Liu, Hamid Hadian, Hanchen Xu, Daniel S. Berger, Huaicheng Li
Sep 24 2024 cs.OS arXiv:2409.14317v1

@misc{2409.14317, author = {Jinshu Liu and Hamid Hadian and Hanchen Xu and Daniel S.~Berger and Huaicheng Li}, title = {{D}issecting {CXL} {M}emory {P}erformance at {S}cale: {A}nalysis, {M}odeling, and {O}ptimization}, year = {2024}, eprint = {2409.14317}, note = {arXiv:2409.14317v1} }
PDF
We present SupMario, a characterization framework designed to thoroughly analyze, model, and optimize CXL memory performance. SupMario is based on extensive evaluation of 265 workloads spanning 4 real CXL devices within 7 memory latency configurations across 4 processor platforms. SupMario uncovers many key insights, including detailed workload performance at sub-us memory latencies (140-410 ns), CXL tail latencies, CPU tolerance to CXL latencies, CXL performance root-cause analysis and precise performance prediction models. In particular, SupMario performance models rely solely on 12 CPU performance counters and accurately fit over 99% and 91%-94% workloads with a 10% misprediction target for NUMA and CXL memory, respectively. We demonstrate the practical utility of SupMario characterization findings, models, and insights by applying them to popular CXL memory management schemes, such as page interleaving and tiering policies, to identify system inefficiencies during runtime. We introduce a novel ``bestshot'' page interleaving policy and a regulated page tiering policy (Alto) tailored for memory bandwidth- and latency-sensitive workloads. In bandwidth bound scenarios, our ``best-shot'' interleaving, guided by our novel performance prediction model, achieves close-to optimal scenarios by exploiting the aggregate system and CXL/NUMA memory bandwidth. For latency sensitive workloads, Alto, driven by our key insight of utilizing ``amortized'' memory latency to regulate unnecessary page migrations, achieves up to 177% improvement over state-of-the-art memory tiering systems like TPP, as demonstrated through extensive evaluation with 8 real-world applications.
Flexible Swapping for the Cloud
Milan Pandurov, Lukas Humbel, Dmitry Sepp, Adamos Ttofari, Leon Thomm, Do Le Quoc, Siddharth Chandrasekaran, Sharan Santhanam, Chuan Ye, Shai Bergman, Wei Wang, Sven Lundgren, Konstantinos Sagonas, Alberto Ros
Sep 23 2024 cs.DC cs.OS arXiv:2409.13327v1

@misc{2409.13327, author = {Milan Pandurov and Lukas Humbel and Dmitry Sepp and Adamos Ttofari and Leon Thomm and Do Le Quoc and Siddharth Chandrasekaran and Sharan Santhanam and Chuan Ye and Shai Bergman and Wei Wang and Sven Lundgren and Konstantinos Sagonas and Alberto Ros}, title = {{F}lexible {S}wapping for the {C}loud}, year = {2024}, eprint = {2409.13327}, note = {arXiv:2409.13327v1} }
PDF
Memory has become the primary cost driver in cloud data centers. Yet, a significant portion of memory allocated to VMs in public clouds remains unused. To optimize this resource, "cold" memory can be reclaimed from VMs and stored on slower storage or compressed, enabling memory overcommit. Current overcommit systems rely on general-purpose OS swap mechanisms, which are not optimized for virtualized workloads, leading to missed memory-saving opportunities and ineffective use of optimizations like prefetchers. This paper introduces a userspace memory management framework designed for VMs. It enables custom policies that have full control over the virtual machines' memory using a simple userspace API, supports huge page-based swapping to satisfy VM performance requirements, is easy to deploy by leveraging Linux/KVM, and supports zero-copy I/O virtualization with shared VM memory. Our evaluation demonstrates that an overcommit system based on our framework outperforms the state-of-the-art solutions on both micro-benchmarks and commonly used cloud workloads. Specifically our implementation outperforms the Linux Kernel baseline implementation by up to 25% while saving a similar amount of memory. We also demonstrate the benefits of custom policies by implementing workload-specific reclaimers and prefetchers that save $10\%$ additional memory, improve performance in a limited memory scenario by 30% over the Linux baseline, and recover faster from hard limit releases.
Analysis of Synchronization Mechanisms in Operating Systems
Oluwatoyin Kode, Temitope Oyemade
Sep 18 2024 cs.OS arXiv:2409.11271v1

@misc{2409.11271, author = {Oluwatoyin Kode and Temitope Oyemade}, title = {{A}nalysis of {S}ynchronization {M}echanisms in {O}perating {S}ystems}, year = {2024}, eprint = {2409.11271}, note = {arXiv:2409.11271v1} }
PDF
This research analyzed the performance and consistency of four synchronization mechanisms-reentrant locks, semaphores, synchronized methods, and synchronized blocks-across three operating systems: macOS, Windows, and Linux. Synchronization ensures that concurrent processes or threads access shared resources safely, and efficient synchronization is vital for maintaining system performance and reliability. The study aimed to identify the synchronization mechanism that balances efficiency, measured by execution time, and consistency, assessed by variance and standard deviation, across platforms. The initial hypothesis proposed that mutex-based mechanisms, specifically synchronized methods and blocks, would be the most efficient due to their simplicity. However, empirical results showed that reentrant locks had the lowest average execution time (14.67ms), making them the most efficient mechanism, but with the highest variability (standard deviation of 1.15). In contrast, synchronized methods, blocks, and semaphores exhibited higher average execution times (16.33ms for methods and 16.67ms for blocks) but with greater consistency (variance of 0.33). The findings indicated that while reentrant locks were faster, they were more platform-dependent, whereas mutex-based mechanisms provided more predictable performance across all operating systems. The use of virtual machines for Windows and Linux was a limitation, potentially affecting the results. Future research should include native testing and explore additional synchronization mechanisms and higher concurrency levels. These insights help developers and system designers optimize synchronization strategies for either performance or stability, depending on the application's requirements.
eBPF-mm: Userspace-guided memory management in Linux with eBPF
Konstantinos Mores, Stratos Psomadakis, Georgios Goumas
Sep 18 2024 cs.OS cs.AR arXiv:2409.11220v1

@misc{2409.11220, author = {Konstantinos Mores and Stratos Psomadakis and Georgios Goumas}, title = {e{BPF}-mm: {U}serspace-guided memory management in {L}inux with e{BPF}}, year = {2024}, eprint = {2409.11220}, note = {arXiv:2409.11220v1} }
PDF
We leverage eBPF in order to implement custom policies in the Linux memory subsystem. Inspired by CBMM, we create a mechanism that provides the kernel with hints regarding the benefit of promoting a page to a specific size. We introduce a new hook point in Linux page fault handling path for eBPF programs, providing them the necessary context to determine the page size to be used. We then develop a framework that allows users to define profiles for their applications and load them into the kernel. A profile consists of memory regions of interest and their expected benefit from being backed by 4KB, 64KB and 2MB pages. In our evaluation, we profiled our workloads to identify hot memory regions using DAMON.
Skip TLB flushes for reused pages within mmap's
Frederic Schimmelpfennig, André Brinkmann, Hossein Asadi, Reza Salkhordeh
Sep 18 2024 cs.OS cs.DC arXiv:2409.10946v1

@misc{2409.10946, author = {Frederic Schimmelpfennig and André Brinkmann and Hossein Asadi and Reza Salkhordeh}, title = {{S}kip {TLB} flushes for reused pages within mmap's}, year = {2024}, eprint = {2409.10946}, note = {arXiv:2409.10946v1} }
PDF
Memory access efficiency is significantly enhanced by caching recent address translations in the CPUs' Translation Lookaside Buffers (TLBs). However, since the operating system is not aware of which core is using a particular mapping, it flushes TLB entries across all cores where the application runs whenever addresses are unmapped, ensuring security and consistency. These TLB flushes, known as TLB shootdowns, are costly and create a performance and scalability bottleneck. A key contributor to TLB shootdowns is memory-mapped I/O, particularly during mmap-munmap cycles and page cache evictions. Often, the same physical pages are reassigned to the same process post-eviction, presenting an opportunity for the operating system to reduce the frequency of TLB shootdowns. We demonstrate, that by slightly extending the mmap function, TLB shootdowns for these "recycled pages" can be avoided. Therefore we introduce and implement the "fast page recycling" (FPR) feature within the mmap system call. FPR-mmaps maintain security by only triggering TLB shootdowns when a page exits its recycling cycle and is allocated to a different process. To ensure consistency when FPR-mmap pointers are used, we made minor adjustments to virtual memory management to avoid the ABA problem. Unlike previous methods to mitigate shootdown effects, our approach does not require any hardware modifications and operates transparently within the existing Linux virtual memory framework. Our evaluations across a variety of CPU, memory, and storage setups, including persistent memory and Optane SSDs, demonstrate that FPR delivers notable performance gains, with improvements of up to 28% in real-world applications and 92% in micro-benchmarks. Additionally, we show that TLB shootdowns are a significant source of bottlenecks, previously misattributed to other components of the Linux kernel.
BULKHEAD: Secure, Scalable, and Efficient Kernel Compartmentalization with PKS
Yinggang Guo, Zicheng Wang, Weiheng Bai, Qingkai Zeng, Kangjie Lu
Sep 17 2024 cs.CR cs.OS arXiv:2409.09606v1

@misc{2409.09606, author = {Yinggang Guo and Zicheng Wang and Weiheng Bai and Qingkai Zeng and Kangjie Lu}, title = {{BULKHEAD}: {S}ecure, {S}calable, and {E}fficient {K}ernel {C}ompartmentalization with {PKS}}, year = {2024}, eprint = {2409.09606}, doi = {10.14722/ndss.2025.23328}, note = {arXiv:2409.09606v1} }
PDF
The endless stream of vulnerabilities urgently calls for principled mitigation to confine the effect of exploitation. However, the monolithic architecture of commodity OS kernels, like the Linux kernel, allows an attacker to compromise the entire system by exploiting a vulnerability in any kernel component. Kernel compartmentalization is a promising approach that follows the least-privilege principle. However, existing mechanisms struggle with the trade-off on security, scalability, and performance, given the challenges stemming from mutual untrustworthiness among numerous and complex components. In this paper, we present BULKHEAD, a secure, scalable, and efficient kernel compartmentalization technique that offers bi-directional isolation for unlimited compartments. It leverages Intel's new hardware feature PKS to isolate data and code into mutually untrusted compartments and benefits from its fast compartment switching. With untrust in mind, BULKHEAD introduces a lightweight in-kernel monitor that enforces multiple important security invariants, including data integrity, execute-only memory, and compartment interface integrity. In addition, it provides a locality-aware two-level scheme that scales to unlimited compartments. We implement a prototype system on Linux v6.1 to compartmentalize loadable kernel modules (LKMs). Extensive evaluation confirms the effectiveness of our approach. As the system-wide impacts, BULKHEAD incurs an average performance overhead of 2.44% for real-world applications with 160 compartmentalized LKMs. While focusing on a specific compartment, ApacheBench tests on ipv6 show an overhead of less than 2%. Moreover, the performance is almost unaffected by the number of compartments, which makes it highly scalable.
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
Sep 13 2024 cs.AR cs.OS arXiv:2409.08141v1

@misc{2409.08141, author = {Anastasiia Ruzhanskaia and Pengcheng Xu and David Cock and Timothy Roscoe}, title = {{R}ethinking {P}rogrammed {I}/{O} for {F}ast {D}evices, {C}heap {C}ores, and {C}oherent {I}nterconnects}, year = {2024}, eprint = {2409.08141}, note = {arXiv:2409.08141v1} }
PDF
Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should be based on Direct Memory Access (DMA), descriptor rings, and interrupts: DMA offloads transfers from the CPU, descriptor rings provide buffering and queuing, and interrupts facilitate asynchronous interaction between cores and device with a lightweight notification mechanism. In this paper we question this wisdom in the light of modern hardware and workloads, particularly in cloud servers. We argue that the assumptions that led to this model are obsolete, and in many use-cases use of programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, actually results in a more efficient system. We quantitatively demonstrate these advantages using three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting for serverless functions. Moreover, we show that while these advantages are significant over a modern PCIe peripheral bus, a truly cache-coherent interconnect offers significant additional efficiency gains.
SafeBPF: Hardware-assisted Defense-in-depth for eBPF Kernel Extensions
Soo Yee Lim, Tanya Prasad, Xueyuan Han, Thomas Pasquier
Sep 13 2024 cs.CR cs.OS arXiv:2409.07508v1

@misc{2409.07508, author = {Soo Yee Lim and Tanya Prasad and Xueyuan Han and Thomas Pasquier}, title = {{S}afe{BPF}: {H}ardware-assisted {D}efense-in-depth for e{BPF} {K}ernel {E}xtensions}, year = {2024}, eprint = {2409.07508}, note = {arXiv:2409.07508v1} }
PDF
The eBPF framework enables execution of user-provided code in the Linux kernel. In the last few years, a large ecosystem of cloud services has leveraged eBPF to enhance container security, system observability, and network management. Meanwhile, incessant discoveries of memory safety vulnerabilities have left the systems community with no choice but to disallow unprivileged eBPF programs, which unfortunately limits eBPF use to only privileged users. To improve run-time safety of the framework, we introduce SafeBPF, a general design that isolates eBPF programs from the rest of the kernel to prevent memory safety vulnerabilities from being exploited. We present a pure software implementation using a Software-based Fault Isolation (SFI) approach and a hardware-assisted implementation that leverages ARM's Memory Tagging Extension (MTE). We show that SafeBPF incurs up to 4% overhead on macrobenchmarks while achieving desired security properties.
The HitchHiker's Guide to High-Assurance System Observability Protection with Efficient Permission Switches
Chuqi Zhang, Jun Zeng, Yiming Zhang, Adil Ahmad, Fengwei Zhang, Hai Jin, Zhenkai Liang
Sep 10 2024 cs.CR cs.OS arXiv:2409.04484v1

@misc{2409.04484, author = {Chuqi Zhang and Jun Zeng and Yiming Zhang and Adil Ahmad and Fengwei Zhang and Hai Jin and Zhenkai Liang}, title = {{T}he {H}itch{H}iker's {G}uide to {H}igh-{A}ssurance {S}ystem {O}bservability {P}rotection with {E}fficient {P}ermission {S}witches}, year = {2024}, eprint = {2409.04484}, doi = {10.1145/3658644.3690188}, note = {arXiv:2409.04484v1} }
PDF
Protecting system observability records (logs) from compromised OSs has gained significant traction in recent times, with several note-worthy approaches proposed. Unfortunately, none of the proposed approaches achieve high performance with tiny log protection delays. They also leverage risky environments for protection (\eg many use general-purpose hypervisors or TrustZone, which have large TCB and attack surfaces). HitchHiker is an attempt to rectify this problem. The system is designed to ensure (a) in-memory protection of batched logs within a short and configurable real-time deadline by efficient hardware permission switching, and (b) an end-to-end high-assurance environment built upon hardware protection primitives with debloating strategies for secure log protection, persistence, and management. Security evaluations and validations show that HitchHiker reduces log protection delay by 93.3--99.3% compared to the state-of-the-art, while reducing TCB by 9.4--26.9X. Performance evaluations show HitchHiker incurs a geometric mean of less than 6% overhead on diverse real-world programs, improving on the state-of-the-art approach by 61.9--77.5%.
Head-First Memory Allocation on Best-Fit with Space-Fitting
Adam Noto Hakarsa
Sep 06 2024 cs.OS arXiv:2409.03488v1

@misc{2409.03488, author = {Adam Noto Hakarsa}, title = {{H}ead-{F}irst {M}emory {A}llocation on {B}est-{F}it with {S}pace-{F}itting}, year = {2024}, eprint = {2409.03488}, note = {arXiv:2409.03488v1} }
PDF
Although best-fit is known to be slow, it excels at optimizing memory space utilization. Interestingly, by keeping the free memory region at the top of the memory, the process of memory allocation and deallocation becomes approximately 34.86% faster while also maintaining external fragmentation at minimum.
FlexBSO: Flexible Block Storage Offload for Datacenters
Vojtech Aschenbrenner, John Shawger, Sadman Sakib
Sep 05 2024 cs.NI cs.OS arXiv:2409.02381v1

@misc{2409.02381, author = {Vojtech Aschenbrenner and John Shawger and Sadman Sakib}, title = {{F}lex{BSO}: {F}lexible {B}lock {S}torage {O}ffload for {D}atacenters}, year = {2024}, eprint = {2409.02381}, note = {arXiv:2409.02381v1} }
PDF
Efficient virtualization of CPU and memory is standardized and mature. Capabilities such as Intel VT-x [3] have been added by manufacturers for efficient hypervisor support. In contrast, virtualization of a block device and its presentation to the virtual machines on the host can be done in multiple ways. Indeed, hyperscalers develop in-house solutions to improve performance and cost-efficiency of their storage solutions for datacenters. Unfortunately, these storage solutions are based on specialized hardware and software which are not publicly available. The traditional solution is to expose virtual block device to the VM through a paravirtualized driver like virtio [2]. virtio provides significantly better performance than real block device driver emulation because of host OS and guest OS cooperation. The IO requests are then fulfilled by the host OS either with a local block device such as an SSD drive or with some form of disaggregated storage over the network like NVMe-oF or iSCSI. There are three main problems to the traditional solution. 1) Cost. IO operations consume host CPU cycles due to host OS involvement. These CPU cycles are doing useless work from the application point of view. 2) Inflexibility. Any change of the virtualized storage stack requires host OS and/or guest OS cooperation and cannot be done silently in production. 3) Performance. IO operations are causing recurring VM EXITs to do the transition from non-root mode to root mode on the host CPU. This results into excessive IO performance impact. We propose FlexBSO, a hardware-assisted solution, which solves all the mentioned issues. Our prototype is based on the publicly available Bluefield-2 SmartNIC with NVIDIA SNAP support, hence can be deployed without any obstacles.
Foreactor: Exploiting Storage I/O Parallelism with Explicit Speculation
Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
Sep 04 2024 cs.OS arXiv:2409.01580v1

@misc{2409.01580, author = {Guanzhou Hu and Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau}, title = {{F}oreactor: {E}xploiting {S}torage {I}/{O} {P}arallelism with {E}xplicit {S}peculation}, year = {2024}, eprint = {2409.01580}, note = {arXiv:2409.01580v1} }
PDF
We introduce explicit speculation, a variant of I/O speculation technique where I/O system calls can be parallelized under the guidance of explicit application code knowledge. We propose a formal abstraction -- the foreaction graph -- which describes the exact pattern of I/O system calls in an application function as well as any necessary computation associated to produce their argument values. I/O system calls can be issued ahead of time if the graph says it is safe and beneficial to do so. With explicit speculation, serial applications can exploit storage I/O parallelism without involving expensive prediction or checkpointing mechanisms. Based on explicit speculation, we implement Foreactor, a library framework that allows application developers to concretize foreaction graphs and enable concurrent I/O with little or no modification to application source code. Experimental results show that Foreactor is able to improve the performance of both synthetic benchmarks and real applications by significant amounts (29%-50%).
CyberCortex.AI: An AI-based Operating System for Autonomous Robotics and Complex Automation
Sorin Grigorescu, Mihai Zaha
Sep 04 2024 cs.RO cs.AI cs.OS arXiv:2409.01241v3

@misc{2409.01241, author = {Sorin Grigorescu and Mihai Zaha}, title = {{C}yber{C}ortex.{AI}: {A}n {AI}-based {O}perating {S}ystem for {A}utonomous {R}obotics and {C}omplex {A}utomation}, year = {2024}, eprint = {2409.01241}, howpublished = {Journal of Field Robotics, August 2024, pp. 1-19}, doi = {10.1002/rob.22426}, note = {arXiv:2409.01241v3} }
PDF
The underlying framework for controlling autonomous robots and complex automation applications are Operating Systems (OS) capable of scheduling perception-and-control tasks, as well as providing real-time data communication to other robotic peers and remote cloud computers. In this paper, we introduce CyberCortex AI, a robotics OS designed to enable heterogeneous AI-based robotics and complex automation applications. CyberCortex AI is a decentralized distributed OS which enables robots to talk to each other, as well as to High Performance Computers (HPC) in the cloud. Sensory and control data from the robots is streamed towards HPC systems with the purpose of training AI algorithms, which are afterwards deployed on the robots. Each functionality of a robot (e.g. sensory data acquisition, path planning, motion control, etc.) is executed within a so-called DataBlock of Filters shared through the internet, where each filter is computed either locally on the robot itself, or remotely on a different robotic system. The data is stored and accessed via a so-called Temporal Addressable Memory (TAM), which acts as a gateway between each filter's input and output. CyberCortex AI has two main components: i) the CyberCortex AI inference system, which is a real-time implementation of the DataBlock running on the robots' embedded hardware, and ii) the CyberCortex AI dojo, which runs on an HPC computer in the cloud, and it is used to design, train and deploy AI algorithms. We present a quantitative and qualitative performance analysis of the proposed approach using two collaborative robotics applications: i) a forest fires prevention system based on an Unitree A1 legged robot and an Anafi Parrot 4K drone, as well as ii) an autonomous driving system which uses CyberCortex AI for collaborative perception and motion control.
Tide: A Split OS Architecture for Control Plane Offloading
Jack Tigar Humphries, Neel Natu, Kostis Kaffes, Stanko Novaković, Paul Turner, Hank Levy, David Culler, Christos Kozyrakis
Sep 02 2024 cs.OS arXiv:2408.17351v2

@misc{2408.17351, author = {Jack Tigar Humphries and Neel Natu and Kostis Kaffes and Stanko Novaković and Paul Turner and Hank Levy and David Culler and Christos Kozyrakis}, title = {{T}ide: {A} {S}plit {OS} {A}rchitecture for {C}ontrol {P}lane {O}ffloading}, year = {2024}, eprint = {2408.17351}, note = {arXiv:2408.17351v2} }
PDF
The end of Moore's Law is driving cloud providers to offload virtualization and the network data plane to SmartNICs to improve compute efficiency. Even though individual OS control plane tasks consume up to 5% of cycles across the fleet, they remain on the host CPU because they are tightly intertwined with OS mechanisms. Moreover, offloading puts the slow PCIe interconnect in the critical path of OS decisions. We propose Tide, a new split OS architecture that separates OS control plane policies from mechanisms and offloads the control plane policies onto a SmartNIC. Tide has a new host-SmartNIC communication API, state synchronization mechanism, and communication mechanisms that overcome the PCIe bottleneck, even for $\mu$s-scale workloads. Tide frees up host compute for applications and unlocks new optimization opportunities, including machine learning-driven policies, scheduling on the network I/O path, and reducing on-host interference. We demonstrate that Tide enables OS control planes that are competitive with on-host performance for the most difficult $\mu$s-scale workloads. Tide outperforms on-host control planes for memory management (saving 16 host cores), Stubby network RPCs (saving 8 cores), and GCE virtual machine management (11.2% performance improvement).
FRAP: A Flexible Resource Accessing Protocol for Multiprocessor Real-Time Systems
Shuai Zhao, Hanzhi Xu, Nan Chen, Ruoxian Su, Wanli Chang
Aug 27 2024 cs.OS arXiv:2408.13772v2

@misc{2408.13772, author = {Shuai Zhao and Hanzhi Xu and Nan Chen and Ruoxian Su and Wanli Chang}, title = {{FRAP}: {A} {F}lexible {R}esource {A}ccessing {P}rotocol for {M}ultiprocessor {R}eal-{T}ime {S}ystems}, year = {2024}, eprint = {2408.13772}, note = {arXiv:2408.13772v2} }
PDF
Fully-partitioned fixed-priority scheduling (FP-FPS) multiprocessor systems are widely found in real-time applications, where spin-based protocols are often deployed to manage the mutually exclusive access of shared resources. Unfortunately, existing approaches either enforce rigid spin priority rules for resource accessing or carry significant pessimism in the schedulability analysis, imposing substantial blocking time regardless of task execution urgency or resource over-provisioning. This paper proposes FRAP, a spin-based flexible resource accessing protocol for FP-FPS systems. A task under FRAP can spin at any priority within a range for accessing a resource, allowing flexible and fine-grained resource control with predictable worst-case behaviour. Under flexible spinning, we demonstrate that the existing analysis techniques can lead to incorrect timing bounds and present a novel MCMF (minimum cost maximum flow)-based blocking analysis, providing predictability guarantee for FRAP. A spin priority assignment is reported that fully exploits flexible spinning to reduce the blocking time of tasks with high urgency, enhancing the performance of FRAP. Experimental results show that FRAP outperforms the existing spin-based protocols in schedulability by 15.20%-32.73% on average, up to 65.85%.
Telepathic Datacenters: Fast RPCs using Shared CXL Memory
Suyash Mahar, Ehsan Hajyjasini, Seungjin Lee, Zifeng Zhang, Mingyao Shen, Steven Swanson
Aug 22 2024 cs.DC cs.OS arXiv:2408.11325v1

@misc{2408.11325, author = {Suyash Mahar and Ehsan Hajyjasini and Seungjin Lee and Zifeng Zhang and Mingyao Shen and Steven Swanson}, title = {{T}elepathic {D}atacenters: {F}ast {RPC}s using {S}hared {CXL} {M}emory}, year = {2024}, eprint = {2408.11325}, note = {arXiv:2408.11325v1} }
PDF
Datacenter applications often rely on remote procedure calls (RPCs) for fast, efficient, and secure communication. However, RPCs are slow, inefficient, and hard to use as they require expensive serialization and compression to communicate over a packetized serial network link. Compute Express Link 3.0 (CXL) offers an alternative solution, allowing applications to share data using a cache-coherent, shared-memory interface across clusters of machines. RPCool is a new framework that exploits CXL's shared memory capabilities. RPCool avoids serialization by passing pointers to data structures in shared memory. While avoiding serialization is useful, directly sharing pointer-rich data eliminates the isolation that copying data over traditional networks provides, leaving the receiver vulnerable to invalid pointers and concurrent updates to shared data by the sender. RPCool restores this safety with careful and efficient management of memory permissions. Another significant challenge with CXL shared memory capabilities is that they are unlikely to scale to an entire datacenter. RPCool addresses this by falling back to RDMA-based communication. Overall, RPCool reduces the round-trip latency by 1.93$\times$ and 7.2$\times$ compared to state-of-the-art RDMA and CXL-based RPC mechanisms, respectively. Moreover, RPCool performs either comparably or better than other RPC mechanisms across a range of workloads.
Delegation with Trust<T>: A Scalable, Type- and Memory-Safe Alternative to Locks
Noaman Ahmad, Ben Baenen, Chen Chen, Jakob Eriksson
Aug 22 2024 cs.PF cs.OS arXiv:2408.11173v1

@misc{2408.11173, author = {Noaman Ahmad and Ben Baenen and Chen Chen and Jakob Eriksson}, title = {{D}elegation with {T}rust<{T}>: {A} {S}calable, {T}ype- and {M}emory-{S}afe {A}lternative to {L}ocks}, year = {2024}, eprint = {2408.11173}, note = {arXiv:2408.11173v1} }
PDF
We present Trust<T>, a general, type- and memory-safe alternative to locking in concurrent programs. Instead of synchronizing multi-threaded access to an object of type T with a lock, the programmer may place the object in a Trust<T>. The object is then no longer directly accessible. Instead a designated thread, the object's trustee, is responsible for applying any requested operations to the object, as requested via the Trust<T> API. Locking is often said to offer a limited throughput per lock. Trust<T> is based on delegation, a message-passing technique which does not suffer this per-lock limitation. Instead, per-object throughput is limited by the capacity of the object's trustee, which is typically considerably higher. Our evaluation shows Trust<T> consistently and considerably outperforming locking where lock contention exists, with up to 22x higher throughput in microbenchmarks, and 5-9x for a home grown key-value store, as well as memcached, in situations with high lock contention. Moreover, Trust<T> is competitive with locks even in the absence of lock contention.
Timing Analysis and Priority-driven Enhancements of ROS 2 Multi-threaded Executors
Hoora Sobhani, Hyunjong Choi, Hyoseung Kim
Aug 19 2024 eess.SY cs.OS cs.RO cs.SY arXiv:2408.08440v1

@misc{2408.08440, author = {Hoora Sobhani and Hyunjong Choi and Hyoseung Kim}, title = {{T}iming {A}nalysis and {P}riority-driven {E}nhancements of {ROS} 2 {M}ulti-threaded {E}xecutors}, year = {2024}, eprint = {2408.08440}, note = {arXiv:2408.08440v1} }
PDF
The second generation of Robotic Operating System, ROS 2, has gained much attention for its potential to be used for safety-critical robotic applications. The need to provide a solid foundation for timing correctness and scheduling mechanisms is therefore growing rapidly. Although there are some pioneering studies conducted on formally analyzing the response time of processing chains in ROS 2, the focus has been limited to single-threaded executors, and multi-threaded executors, despite their advantages, have not been studied well. To fill this knowledge gap, in this paper, we propose a comprehensive response-time analysis framework for chains running on ROS 2 multi-threaded executors. We first analyze the timing behavior of the default scheduling scheme in ROS 2 multi-threaded executors, and then present priority-driven scheduling enhancements to address the limitations of the default scheme. Our framework can analyze chains with both arbitrary and constrained deadlines and also the effect of mutually-exclusive callback groups. Evaluation is conducted by a case study on NVIDIA Jetson AGX Xavier and schedulability experiments using randomly-generated chains. The results demonstrate that our analysis framework can safely upper-bound response times under various conditions and the priority-driven scheduling enhancements not only reduce the response time of critical chains but also improve analytical bounds.
Inspection of I/O Operations from System Call Traces using Directly-Follows-Graph
Aravind Sankaran, Ilya Zhukov, Wolfgang Frings, Paolo Bientinesi
Aug 15 2024 cs.PF cs.OS arXiv:2408.07378v2

@misc{2408.07378, author = {Aravind Sankaran and Ilya Zhukov and Wolfgang Frings and Paolo Bientinesi}, title = {{I}nspection of {I}/{O} {O}perations from {S}ystem {C}all {T}races using {D}irectly-{F}ollows-{G}raph}, year = {2024}, eprint = {2408.07378}, note = {arXiv:2408.07378v2} }
PDF
We aim to identify the differences in Input/Output(I/O) behavior between multiple user programs through the inspection of system calls (i.e., requests made to the operating system). A typical program issues a large number of I/O requests to the operating system, thereby making the process of inspection challenging. In this paper, we address this challenge by presenting a methodology to synthesize I/O system call traces into a specific type of directed graph, known as the Directly-Follows-Graph (DFG). Based on the DFG, we present a technique to compare the traces from multiple programs or different configurations of the same program, such that it is possible to identify the differences in the I/O behavior. We apply our methodology to the IOR benchmark, and compare the contentions for file accesses when the benchmark is run with different options for file output and software interface.
CRISP: Confidentiality, Rollback, and Integrity Storage Protection for Confidential Cloud-Native Computing
Ardhi Putra Pratama Hartono, Andrey Brito, Christof Fetzer
Aug 14 2024 cs.CR cs.OS arXiv:2408.06822v2

@misc{2408.06822, author = {Ardhi Putra Pratama Hartono and Andrey Brito and Christof Fetzer}, title = {{CRISP}: {C}onfidentiality, {R}ollback, and {I}ntegrity {S}torage {P}rotection for {C}onfidential {C}loud-{N}ative {C}omputing}, year = {2024}, eprint = {2408.06822}, doi = {10.1109/CLOUD62652.2024.00026}, note = {arXiv:2408.06822v2} }
PDF
Trusted execution environments (TEEs) protect the integrity and confidentiality of running code and its associated data. Nevertheless, TEEs' integrity protection does not extend to the state saved on disk. Furthermore, modern cloud-native applications heavily rely on orchestration (e.g., through systems such as Kubernetes) and, thus, have their services frequently restarted. During restarts, attackers can revert the state of confidential services to a previous version that may aid their malicious intent. This paper presents CRISP, a rollback protection mechanism that uses an existing runtime for Intel SGX and transparently prevents rollback. Our approach can constrain the attack window to a fixed and short period or give developers the tools to avoid the vulnerability window altogether. Finally, experiments show that applying CRISP in a critical stateful cloud-native application may incur a resource increase but only a minor performance penalty.
Object as a Service: Simplifying Cloud-Native Development through Serverless Object Abstraction
Pawissanutt Lertpongrujikorn, Mohsen Amini Salehi
Aug 12 2024 cs.DC cs.OS cs.SE arXiv:2408.04898v1

@misc{2408.04898, author = {Pawissanutt Lertpongrujikorn and Mohsen Amini Salehi}, title = {{O}bject as a {S}ervice: {S}implifying {C}loud-{N}ative {D}evelopment through {S}erverless {O}bject {A}bstraction}, year = {2024}, eprint = {2408.04898}, note = {arXiv:2408.04898v1} }
PDF
The function-as-a-service (FaaS) paradigm is envisioned as the next generation of cloud computing systems that mitigate the burden for cloud-native application developers by abstracting them from cloud resource management. However, it does not deal with the application data aspects. As such, developers have to intervene and undergo the burden of managing the application data, often via separate cloud storage services. To further streamline cloud-native application development, in this work, we propose a new paradigm, known as Object as a Service (OaaS) that encapsulates application data and functions into the cloud object abstraction. OaaS relieves developers from resource and data management burden while offering built-in optimization features. Inspired by OOP, OaaS incorporates access modifiers and inheritance into the serverless paradigm that: (a) prevents developers from compromising the system via accidentally accessing underlying data; and (b) enables software reuse in cloud-native application development. Furthermore, OaaS natively supports dataflow semantics. It enables developers to define function workflows while transparently handling data navigation, synchronization, and parallelism issues. To establish the OaaS paradigm, we develop a platform named Oparaca that offers state abstraction for structured and unstructured data with consistency and fault-tolerant guarantees. We evaluated Oparaca under real-world settings against state-of-the-art platforms with respect to the imposed overhead, scalability, and ease of use. The results demonstrate that the object abstraction provided by OaaS can streamline flexible and scalable cloud-native application development with an insignificant overhead on the underlying serverless system.
Wasm-bpf: Streamlining eBPF Deployment in Cloud Environments with WebAssembly
Yusheng Zheng, Tong Yu, Yiwei Yang, Andrew Quinn
Aug 12 2024 cs.OS arXiv:2408.04856v1

@misc{2408.04856, author = {Yusheng Zheng and Tong Yu and Yiwei Yang and Andrew Quinn}, title = {{W}asm-bpf: {S}treamlining e{BPF} {D}eployment in {C}loud {E}nvironments with {W}eb{A}ssembly}, year = {2024}, eprint = {2408.04856}, note = {arXiv:2408.04856v1} }
PDF
The extended Berkeley Packet Filter (eBPF) is extensively utilized for observability and performance analysis in cloud-native environments. However, deploying eBPF programs across a heterogeneous cloud environment presents challenges, including compatibility issues across different kernel versions, operating systems, runtimes, and architectures. Traditional deployment methods, such as standalone containers or tightly integrated core applications, are cumbersome and inefficient, particularly when dynamic plugin management is required. To address these challenges, we introduce Wasm-bpf, a lightweight runtime on WebAssembly and the WebAssembly System Interface (WASI). Leveraging Wasm platform independence and WASI standardized system interface, with enhanced relocation for different architectures, Wasm-bpf ensures cross-platform compatibility for eBPF programs. It simplifies deployment by integrating with container toolchains, allowing eBPF programs to be packaged as Wasm modules that can be easily managed within cloud environments. Additionally, Wasm-bpf supports dynamic plugin management in WebAssembly. Our implementation and evaluation demonstrate that Wasm-bpf introduces minimal overhead compared to native eBPF implementations while simplifying the deployment process.
Crash Consistency in DRAM-NVM-Disk Hybrid Storage System
Guoyu Wang, Xilong Che, Haoyang Wei, Chenju Pei, Juncheng Hu
Aug 09 2024 cs.OS arXiv:2408.04238v1

@misc{2408.04238, author = {Guoyu Wang and Xilong Che and Haoyang Wei and Chenju Pei and Juncheng Hu}, title = {{C}rash {C}onsistency in {DRAM}-{NVM}-{D}isk {H}ybrid {S}torage {S}ystem}, year = {2024}, eprint = {2408.04238}, note = {arXiv:2408.04238v1} }
PDF
NVM is used as a new hierarchy in the storage system, due to its intermediate speed and capacity between DRAM, and its byte granularity. However, consistency problems emerge when we attempt to put DRAM, NVM, and disk together as an efficient whole. In this paper, we discuss the challenging consistency problems faced by heterogeneous storage systems, and propose our solution to the problems. The discussion is based on NVPC as a case study, but can be inspiring and adaptive to all similar heterogeneous storage systems.
Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms
Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang
Aug 09 2024 cs.AR cs.AI cs.LG cs.OS arXiv:2408.04104v3

@misc{2408.04104, author = {Yuqi Xue and Yiqi Liu and Lifeng Nai and Jian Huang}, title = {{H}ardware-{A}ssisted {V}irtualization of {N}eural {P}rocessing {U}nits for {C}loud {P}latforms}, year = {2024}, eprint = {2408.04104}, note = {arXiv:2408.04104v3} }
PDF
Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs. We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4$\times$ and reduces the tail latency by up to 4.6$\times$, while improving the NPU utilization by 1.2$\times$ on average, compared to state-of-the-art NPU sharing approaches.
NVPC: A Transparent NVM Page Cache
Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu
Aug 07 2024 cs.OS arXiv:2408.02911v1

@misc{2408.02911, author = {Guoyu Wang and Xilong Che and Haoyang Wei and Shuo Chen and Puyi He and Juncheng Hu}, title = {{NVPC}: {A} {T}ransparent {NVM} {P}age {C}ache}, year = {2024}, eprint = {2408.02911}, note = {arXiv:2408.02911v1} }
PDF
Towards a compatible utilization of NVM, NVM-specialized kernel file systems and NVM-based disk file system accelerators have been proposed. However, these studies only focus on one or several characteristics of NVM, while failing to exploit its best practice by putting NVM in the proper position of the whole storage stack. In this paper, we present NVPC, a transparent acceleration to existing kernel file systems with an NVM-enhanced page cache. The acceleration lies in two aspects, respectively matching the desperate needs of existing disk file systems: sync writes and cache-missed operations. Besides, the fast DRAM page cache is preserved for cache-hit operations. For sync writes, a high-performance log-based sync absorbing area is provided to redirect data destination from the slow disk to the fast NVM. Meanwhile, the byte-addressable feature of NVM is used to prevent write amplification. For cache-missed operations, NVPC makes use of the idle space on NVM to extend the DRAM page cache, so that more and larger workloads can fit into the cache. NVPC is entirely implemented as a page cache, thus can provide efficient speed-up to disk file systems with full transparency to users and full compatibility to lower file systems. In Filebench macro-benchmarks, NVPC achieves at most 3.55x, 2.84x, and 2.64x faster than NOVA, Ext-4, and SPFS. In RocksDB workloads with working set larger than DRAM, NVPC achieves 1.12x, 2.59x, and 2.11x faster than NOVA, Ext-4, and SPFS. Meanwhile, NVPC gains positive revenue from NOVA, Ext-4, and SPFS in 62.5% of the tested cases in our read/write/sync mixed evaluation, demonstrating that NVPC is more balanced and adaptive to complex real-world workloads. Experimental results also show that NVPC is the only method that accelerates Ext-4 in particular cases for up to 15.19x, with no slow-down to any other use cases.
Understanding and Enhancing Linux Kernel-based Packet Switching on WiFi Access Points
Shiqi Zhang, Mridul Gupta, Behnam Dezfouli
Aug 05 2024 cs.NI cs.AR cs.OS cs.PF arXiv:2408.01013v1

@misc{2408.01013, author = {Shiqi Zhang and Mridul Gupta and Behnam Dezfouli}, title = {{U}nderstanding and {E}nhancing {L}inux {K}ernel-based {P}acket {S}witching on {W}i{F}i {A}ccess {P}oints}, year = {2024}, eprint = {2408.01013}, note = {arXiv:2408.01013v1} }
PDF
As the number of WiFi devices and their traffic demands continue to rise, the need for a scalable and high-performance wireless infrastructure becomes increasingly essential. Central to this infrastructure are WiFi Access Points (APs), which facilitate packet switching between Ethernet and WiFi interfaces. Despite APs' reliance on the Linux kernel's data plane for packet switching, the detailed operations and complexities of switching packets between Ethernet and WiFi interfaces have not been investigated in existing works. This paper makes the following contributions towards filling this research gap. Through macro and micro-analysis of empirical experiments, our study reveals insights in two distinct categories. Firstly, while the kernel's statistics offer valuable insights into system operations, we identify and discuss potential pitfalls that can severely affect system analysis. For instance, we reveal the implications of device drivers on the meaning and accuracy of the statistics related to packet-switching tasks and processor utilization. Secondly, we analyze the impact of the packet switching path and core configuration on performance and power consumption. Specifically, we identify the differences in Ethernet-to-WiFi and WiFi-to-Ethernet data paths regarding processing components, multi-core utilization, and energy efficiency. We show that the WiFi-to-Ethernet data path leverages better multi-core processing and exhibits lower power consumption.
MAARS: Multi-Rate Attack-Aware Randomized Scheduling for Securing Real-time Systems
Arkaprava Sain, Sunandan Adhikary, Ipsita Koley, Soumyajit Dey
Aug 02 2024 eess.SY cs.CR cs.OS cs.SY arXiv:2408.00341v1

@misc{2408.00341, author = {Arkaprava Sain and Sunandan Adhikary and Ipsita Koley and Soumyajit Dey}, title = {{MAARS}: {M}ulti-{R}ate {A}ttack-{A}ware {R}andomized {S}cheduling for {S}ecuring {R}eal-time {S}ystems}, year = {2024}, eprint = {2408.00341}, note = {arXiv:2408.00341v1} }
PDF
Modern Cyber-Physical Systems (CPSs) consist of numerous control units interconnected by communication networks. Each control unit executes multiple safety-critical and non-critical tasks in real-time. Most of the safety-critical tasks are executed with a fixed sampling period to ensure deterministic timing behaviour that helps in its safety and performance analysis. However, adversaries can exploit this deterministic behaviour of safety-critical tasks to launch inference-based-based attacks on them. This paper aims to prevent and minimize the possibility of such timing inference or schedule-based attacks to compromise the control units. This is done by switching between strategically chosen execution rates of the safety-critical control tasks such that their performance remains unhampered. Thereafter, we present a novel schedule vulnerability analysis methodology to switch between valid schedules generated for these multiple periodicities of the control tasks in run time. Utilizing these strategies, we introduce a novel Multi-Rate Attack-Aware Randomized Scheduling (MAARS) framework for preemptive fixed-priority schedulers that minimize the success rate of timing-inference-based attacks on safety-critical real-time systems. To our knowledge, this is the first work to propose a schedule randomization method with attack awareness that preserves both the control and scheduling aspects. The efficacy of the framework in terms of attack prevention is finally evaluated on several automotive benchmarks in a Hardware-in-loop (HiL) environment.
Rusty Linux: Advances in Rust for Linux Kernel Development
Shane K. Panter, Nasir U. Eisty
Jul 29 2024 cs.SE cs.OS arXiv:2407.18431v2

@misc{2407.18431, author = {Shane K.~Panter and Nasir U.~Eisty}, title = {{R}usty {L}inux: {A}dvances in {R}ust for {L}inux {K}ernel {D}evelopment}, year = {2024}, eprint = {2407.18431}, doi = {10.1145/3674805.3690756}, note = {arXiv:2407.18431v2} }
PDF
Context: The integration of Rust into kernel development is a transformative endeavor aimed at enhancing system security and reliability by leveraging Rust's strong memory safety guarantees. Objective: We aim to find the current advances in using Rust in Kernel development to reduce the number of memory safety vulnerabilities in one of the most critical pieces of software that underpins all modern applications. Method: By analyzing a broad spectrum of studies, we identify the advantages Rust offers, highlight the challenges faced, and emphasize the need for community consensus on Rust's adoption. Results: Our findings suggest that while the initial implementations of Rust in the kernel show promising results in terms of safety and stability, significant challenges remain. These challenges include achieving seamless interoperability with existing kernel components, maintaining performance, and ensuring adequate support and tooling for developers. Conclusions: This study underscores the need for continued research and practical implementation efforts to fully realize the benefits of Rust. By addressing these challenges, the integration of Rust could mark a significant step forward in the evolution of operating system development towards safer and more reliable systems
Operating System And Artificial Intelligence: A Systematic Review
Yifan Zhang, Xinkui Zhao, Jianwei Yin, Lufei Zhang, Zuoning Chen
Jul 23 2024 cs.OS cs.AI arXiv:2407.14567v1

@misc{2407.14567, author = {Yifan Zhang and Xinkui Zhao and Jianwei Yin and Lufei Zhang and Zuoning Chen}, title = {{O}perating {S}ystem {A}nd {A}rtificial {I}ntelligence: {A} {S}ystematic {R}eview}, year = {2024}, eprint = {2407.14567}, note = {arXiv:2407.14567v1} }
PDF
In the dynamic landscape of technology, the convergence of Artificial Intelligence (AI) and Operating Systems (OS) has emerged as a pivotal arena for innovation. Our exploration focuses on the symbiotic relationship between AI and OS, emphasizing how AI-driven tools enhance OS performance, security, and efficiency, while OS advancements facilitate more sophisticated AI applications. We delve into various AI techniques employed to optimize OS functionalities, including memory management, process scheduling, and intrusion detection. Simultaneously, we analyze the role of OS in providing essential services and infrastructure that enable effective AI application execution, from resource allocation to data processing. The article also addresses challenges and future directions in this domain, emphasizing the imperative of secure and efficient AI integration within OS frameworks. By examining case studies and recent developments, our review provides a comprehensive overview of the current state of AI-OS integration, underscoring its significance in shaping the next generation of computing technologies. Finally, we explore the promising prospects of Intelligent OSes, considering not only how innovative OS architectures will pave the way for groundbreaking opportunities but also how AI will significantly contribute to advancing these next-generation OSs.
Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild
Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger
Jul 16 2024 cs.OS cs.AR cs.DC cs.NI cs.PF arXiv:2407.10098v1

@misc{2407.10098, author = {Jiechen Zhao and Ran Shu and Katie Lim and Zewen Fan and Thomas Anderson and Mingyu Gao and Natalie Enright Jerger}, title = {{A}ccelerator-as-a-{S}ervice in {P}ublic {C}louds: {A}n {I}ntra-{H}ost {T}raffic {M}anagement {V}iew for {P}erformance {I}solation in the {W}ild}, year = {2024}, eprint = {2407.10098}, note = {arXiv:2407.10098v1} }
PDF
I/O devices in public clouds have integrated increasing numbers of hardware accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such specialized compute (1) is not explicitly accessible to cloud users with performance guarantee, (2) cannot be leveraged simultaneously by both providers and users, unlike general-purpose compute (e.g., CPUs). Through ten observations, we present that the fundamental difficulty of democratizing accelerators is insufficient performance isolation support. The key obstacles to enforcing accelerator isolation are (1) too many unknown traffic patterns in public clouds and (2) too many possible contention sources in the datapath. In this work, instead of scheduling such complex traffic on-the-fly and augmenting isolation support on each system component, we propose to model traffic as network flows and proactively re-shape the traffic to avoid unpredictable contention. We discuss the implications of our findings on the design of future I/O management stacks and device interfaces.
A parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems
José L. Risco-Martín, David Atienza, J. Manuel Colmenar, Oscar Garnica
Jul 16 2024 cs.NE cs.OS arXiv:2407.09555v1

@misc{2407.09555, author = {José L.~Risco-Martín and David Atienza and J.~Manuel Colmenar and Oscar Garnica}, title = {{A} parallel evolutionary algorithm to optimize dynamic memory managers in embedded systems}, year = {2024}, eprint = {2407.09555}, howpublished = {Parallel Computing, 36(10-11), pp. 572-590, 2010}, doi = {10.1016/j.parco.2010.07.001}, note = {arXiv:2407.09555v1} }
PDF
For the last thirty years, several Dynamic Memory Managers (DMMs) have been proposed. Such DMMs include first fit, best fit, segregated fit and buddy systems. Since the performance, memory usage and energy consumption of each DMM differs, software engineers often face difficult choices in selecting the most suitable approach for their applications. This issue has special impact in the field of portable consumer embedded systems, that must execute a limited amount of multimedia applications (e.g., 3D games, video players and signal processing software, etc.), demanding high performance and extensive memory usage at a low energy consumption. Recently, we have developed a novel methodology based on genetic programming to automatically design custom DMMs, optimizing performance, memory usage and energy consumption. However, although this process is automatic and faster than state-of-the-art optimizations, it demands intensive computation, resulting in a time consuming process. Thus, parallel processing can be very useful to enable to explore more solutions spending the same time, as well as to implement new algorithms. In this paper we present a novel parallel evolutionary algorithm for DMMs optimization in embedded systems, based on the Discrete Event Specification (DEVS) formalism over a Service Oriented Architecture (SOA) framework. Parallelism significantly improves the performance of the sequential exploration algorithm. On the one hand, when the number of generations are the same in both approaches, our parallel optimization framework is able to reach a speed-up of 86.40x when compared with other state-of-the-art approaches. On the other, it improves the global quality (i.e., level of performance, low memory usage and low energy consumption) of the final DMM obtained in a 36.36% with respect to two well-known general-purpose DMMs and two state-of-the-art optimization methodologies.
Data-driven Software-based Power Estimation for Embedded Devices
Haoyu Wang, Xinyi Li, Ti Zhou, Man Lin
Jul 04 2024 cs.OS arXiv:2407.02764v1

@misc{2407.02764, author = {Haoyu Wang and Xinyi Li and Ti Zhou and Man Lin}, title = {{D}ata-driven {S}oftware-based {P}ower {E}stimation for {E}mbedded {D}evices}, year = {2024}, eprint = {2407.02764}, note = {arXiv:2407.02764v1} }
PDF
Energy measurement of computer devices, which are widely used in the Internet of Things (IoT), is an important yet challenging task. Most of these IoT devices lack ready-to-use hardware or software for power measurement. A cost-effective solution is to use low-end consumer-grade power meters. However, these low-end power meters cannot provide accurate instantaneous power measurements. In this paper, we propose an easy-to-use approach to derive an instantaneous software-based energy estimation model with only low-end power meters based on data-driven analysis through machine learning. Our solution is demonstrated with a Jetson Nano board and Ruideng UM25C USB power meter. Various machine learning methods combined with our smart data collection method and physical measurement are explored. Benchmarks were used to evaluate the derived software-power model for the Jetson Nano board and Raspberry Pi. The results show that 92% accuracy can be achieved compared to the long-duration measurement. A kernel module that can collect running traces of utilization and frequencies needed is developed, together with the power model derived, for power prediction for programs running in real environment.
Imaginary Machines: A Serverless Model for Cloud Applications
Michael Wawrzoniak, Rodrigo Bruno, Ana Klimovic, Gustavo Alonso
Jul 02 2024 cs.DC cs.NI cs.OS arXiv:2407.00839v1

@misc{2407.00839, author = {Michael Wawrzoniak and Rodrigo Bruno and Ana Klimovic and Gustavo Alonso}, title = {{I}maginary {M}achines: {A} {S}erverless {M}odel for {C}loud {A}pplications}, year = {2024}, eprint = {2407.00839}, note = {arXiv:2407.00839v1} }
PDF
Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud applications because they are designed for networked virtual machines/containers environments. Since such cloud applications cannot take advantage of the highly elastic resources of serverless and require run-time orchestration systems to operate, they suffer from lower resource utilization, additional management complexity, and costs relative to their FaaS serverless counterparts. We propose Imaginary Machines, a new serverless model for cloud applications. This model (1.) exposes the highly elastic resources of serverless platforms as the traditional network-of-hosts model that cloud applications expect, and (2.) it eliminates the need for explicit run-time orchestration by transparently managing application resources based on signals generated during cloud application executions. With the Imaginary Machines model, unmodified cloud applications become serverless applications. While still based on the network-of-host model, they benefit from the highly elastic resources and do not require runtime orchestration, just like their specialized serverless FaaS counterparts, promising increased resource utilization while reducing management costs.
Boxer: FaaSt Ephemeral Elasticity for Off-the-Shelf Cloud Applications
Michael Wawrzoniak, Rodrigo Bruno, Ana Klimovic, Gustavo Alonso
Jul 02 2024 cs.DC cs.NI cs.OS arXiv:2407.00832v1

@misc{2407.00832, author = {Michael Wawrzoniak and Rodrigo Bruno and Ana Klimovic and Gustavo Alonso}, title = {{B}oxer: {F}aa{S}t {E}phemeral {E}lasticity for {O}ff-the-{S}helf {C}loud {A}pplications}, year = {2024}, eprint = {2407.00832}, note = {arXiv:2407.00832v1} }
PDF
Elasticity is a key property of cloud computing. However, elasticity is offered today at the granularity of virtual machines, which take tens of seconds to start. This is insufficient to react to load spikes and sudden failures in latency sensitive applications, leading users to resort to expensive overprovisioning. Function-as-a-Service (FaaS) provides significantly higher elasticity than VMs, but comes coupled with an event-triggered programming model and a constrained execution environment that makes them unsuitable for off-the-shelf applications. Previous work tries to overcome these obstacles but often requires re-architecting the applications. In this paper, we show how off-the-shelf applications can transparently benefit from ephemeral elasticity with FaaS. We built Boxer, an interposition layer spanning VMs and AWS Lambda, that intercepts application execution and emulates the network-of-hosts environment that applications expect when deployed in a conventional VM/container environment. The ephemeral elasticity of Boxer enables significant performance and cost savings for off-the-shelf applications with, e.g., recovery times over 5x faster than EC2 instances and absorbing load spikes comparable to overprovisioned EC2 VM instances.
FastMig: Leveraging FastFreeze to Establish Robust Service Liquidity in Cloud 2.0
Sorawit Manatura, Thanawat Chanikaphon, Chantana Chantrapornchai, Mohsen Amini Salehi
Jul 02 2024 cs.DC cs.OS arXiv:2407.00313v1

@misc{2407.00313, author = {Sorawit Manatura and Thanawat Chanikaphon and Chantana Chantrapornchai and Mohsen Amini Salehi}, title = {{F}ast{M}ig: {L}everaging {F}ast{F}reeze to {E}stablish {R}obust {S}ervice {L}iquidity in {C}loud 2.0}, year = {2024}, eprint = {2407.00313}, note = {arXiv:2407.00313v1} }
PDF
Service liquidity across edge-to-cloud or multi-cloud will serve as the cornerstone of the next generation of cloud computing systems (Cloud 2.0). Provided that cloud-based services are predominantly containerized, an efficient and robust live container migration solution is required to accomplish service liquidity. In a nod to this growing requirement, in this research, we leverage FastFreeze, a popular platform for process checkpoint/restore within a container, and promote it to be a robust solution for end-to-end live migration of containerized services. In particular, we develop a new platform, called FastMig that proactively controls the checkpoint/restore operations of FastFreeze, thereby, allowing for robust live migration of containerized services via standard HTTP interfaces. The proposed platform introduces post-checkpointing and pre-restoration operations to enhance migration robustness. Notably, the pre-restoration operation includes containerized service startup options, enabling warm restoration and reducing the migration downtime. In addition, we develop a method to make FastFreeze robust against failures that commonly happen during the migration and even during the normal operation of a containerized service. Experimental results under real-world settings show that the migration downtime of a containerized service can be reduced by 30X compared to the situation where the original FastFreeze was deployed for the migration. Moreover, we demonstrate that FastMig and warm restoration method together can significantly mitigate the container startup overhead. Importantly, these improvements are achieved without any significant performance reduction and only incurs a small resource usage overhead, compared to the bare (\ie non-FastFreeze) containerized services.