Operating Systems (cs.OS)

Securing Monolithic Kernels using Compartmentalization
Soo Yee Lim, Sidhartha Agrawal, Xueyuan Han, David Eyers, Dan O'Keeffe, Thomas Pasquier
Apr 16 2024 cs.CR cs.OS arXiv:2404.08716v1

@misc{2404.08716, author = {Soo Yee Lim and Sidhartha Agrawal and Xueyuan Han and David Eyers and Dan O'Keeffe and Thomas Pasquier}, title = {{S}ecuring {M}onolithic {K}ernels using {C}ompartmentalization}, year = {2024}, eprint = {2404.08716}, note = {arXiv:2404.08716v1} }
PDF
Monolithic operating systems, where all kernel functionality resides in a single, shared address space, are the foundation of most mainstream computer systems. However, a single flaw, even in a non-essential part of the kernel (e.g., device drivers), can cause the entire operating system to fall under an attacker's control. Kernel hardening techniques might prevent certain types of vulnerabilities, but they fail to address a fundamental weakness: the lack of intra-kernel security that safely isolates different parts of the kernel. We survey kernel compartmentalization techniques that define and enforce intra-kernel boundaries and propose a taxonomy that allows the community to compare and discuss future work. We also identify factors that complicate comparisons among compartmentalized systems, suggest new ways to compare future approaches with existing work meaningfully, and discuss emerging research directions.
Memory Sharing with CXL: Hardware and Software Design Approaches
Sunita Jain, Nagaradhesh Yeleswarapu, Hasan Al Maruf, Rita Gupta
Apr 05 2024 cs.ET cs.OS arXiv:2404.03245v1

@misc{2404.03245, author = {Sunita Jain and Nagaradhesh Yeleswarapu and Hasan Al Maruf and Rita Gupta}, title = {{M}emory {S}haring with {CXL}: {H}ardware and {S}oftware {D}esign {A}pproaches}, year = {2024}, eprint = {2404.03245}, note = {arXiv:2404.03245v1} }
PDF
Compute Express Link (CXL) is a rapidly emerging coherent interconnect standard that provides opportunities for memory pooling and sharing. Memory sharing is a well-established software feature that improves memory utilization by avoiding unnecessary data movement. In this paper, we discuss multiple approaches to enable memory sharing with different generations of CXL protocol (i.e., CXL 2.0 and CXL 3.0) considering the challenges with each of the architectures from the device hardware and software viewpoint.
AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review
Youcef Remil, Anes Bendimerad, Romain Mathonat, Mehdi Kaytoue
Apr 03 2024 cs.OS cs.AI cs.SE arXiv:2404.01363v1

@misc{2404.01363, author = {Youcef Remil and Anes Bendimerad and Romain Mathonat and Mehdi Kaytoue}, title = {{AIO}ps {S}olutions for {I}ncident {M}anagement: {T}echnical {G}uidelines and {A} {C}omprehensive {L}iterature {R}eview}, year = {2024}, eprint = {2404.01363}, note = {arXiv:2404.01363v1} }
PDF
The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.
THEMIS: Time, Heterogeneity, and Energy Minded Scheduling for Fair Multi-Tenant Use in FPGAs
Emre Karabulut, Arsalan Ali Malik, Amro Awad, Aydin Aysu
Apr 02 2024 cs.OS cs.DC arXiv:2404.00507v1

@misc{2404.00507, author = {Emre Karabulut and Arsalan Ali Malik and Amro Awad and Aydin Aysu}, title = {{THEMIS}: {T}ime, {H}eterogeneity, and {E}nergy {M}inded {S}cheduling for {F}air {M}ulti-{T}enant {U}se in {FPGA}s}, year = {2024}, eprint = {2404.00507}, note = {arXiv:2404.00507v1} }
PDF
Using correct design metrics and understanding the limitations of the underlying technology is critical to developing effective scheduling algorithms. Unfortunately, existing scheduling techniques used \emphincorrect metrics and had \emphunrealistic assumptions for fair scheduling of multi-tenant FPGAs where each tenant is aimed to share approximately the same number of resources both spatially and temporally. This paper introduces an enhanced fair scheduling algorithm for multi-tenant FPGA use, addressing previous metric and assumption issues, with three specific improvements claimed First, our method ensures spatiotemporal fairness by considering both spatial and temporal aspects, addressing the limitation of prior work that assumed uniform task latency. Second, we incorporate energy considerations into fairness by adjusting scheduling intervals and accounting for energy overhead, thereby balancing energy efficiency with fairness. Third, we acknowledge overlooked aspects of FPGA multi-tenancy, including heterogeneous regions and the constraints on dynamically merging/splitting partially reconfigurable regions. We develop and evaluate our improved fair scheduling algorithm with these three enhancements. Inspired by the Greek goddess of law and personification of justice, we name our fair scheduling solution THEMIS: \underlineTime, \underlineHeterogeneity, and \underlineEnergy \underlineMinded \underlineScheduling. We used the Xilinx Zedboard XC7Z020 to quantify our approach's savings. Compared to previous algorithms, our improved scheduling algorithm enhances fairness between 24.2--98.4\% and allows a trade-off between 55.3$\times$ in energy vs. 69.3$\times$ in fairness. The paper thus informs cloud providers about future scheduling optimizations for fairness with related challenges and opportunities.
PerOS: Personalized Self-Adapting Operating Systems in the Cloud
Hongyu Hè
Apr 02 2024 cs.HC cs.AI cs.CR cs.OS arXiv:2404.00057v1

@misc{2404.00057, author = {Hongyu Hè}, title = {{P}er{OS}: {P}ersonalized {S}elf-{A}dapting {O}perating {S}ystems in the {C}loud}, year = {2024}, eprint = {2404.00057}, note = {arXiv:2404.00057v1} }
PDF
Operating systems (OSes) are foundational to computer systems, managing hardware resources and ensuring secure environments for diverse applications. However, despite their enduring importance, the fundamental design objectives of OSes have seen minimal evolution over decades. Traditionally prioritizing aspects like speed, memory efficiency, security, and scalability, these objectives often overlook the crucial aspect of intelligence as well as personalized user experience. The lack of intelligence becomes increasingly critical amid technological revolutions, such as the remarkable advancements in machine learning (ML). Today's personal devices, evolving into intimate companions for users, pose unique challenges for traditional OSes like Linux and iOS, especially with the emergence of specialized hardware featuring heterogeneous components. Furthermore, the rise of large language models (LLMs) in ML has introduced transformative capabilities, reshaping user interactions and software development paradigms. While existing literature predominantly focuses on leveraging ML methods for system optimization or accelerating ML workloads, there is a significant gap in addressing personalized user experiences at the OS level. To tackle this challenge, this work proposes PerOS, a personalized OS ingrained with LLM capabilities. PerOS aims to provide tailored user experiences while safeguarding privacy and personal data through declarative interfaces, self-adaptive kernels, and secure data management in a scalable cloud-centric architecture; therein lies the main research question of this work: How can we develop intelligent, secure, and scalable OSes that deliver personalized experiences to thousands of users?
AIOS: LLM Agent Operating System
Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, Yongfeng Zhang
Mar 26 2024 cs.OS cs.AI cs.CL arXiv:2403.16971v2

@misc{2403.16971, author = {Kai Mei and Zelong Li and Shuyuan Xu and Ruosong Ye and Yingqiang Ge and Yongfeng Zhang}, title = {{AIOS}: {LLM} {A}gent {O}perating {S}ystem}, year = {2024}, eprint = {2403.16971}, note = {arXiv:2403.16971v2} }
PDF
The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating heterogeneous agents with different capabilities and specializations. The rapid increase of agent quantity and complexity further exacerbates these issues, often leading to bottlenecks and sub-optimal utilization of resources. Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. Specifically, AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, and maintain access control for agents. We present the architecture of such an operating system, outline the core challenges it aims to resolve, and provide the basic design and implementation of the AIOS. Our experiments on concurrent execution of multiple agents demonstrate the reliability and efficiency of our AIOS modules. Through this, we aim to not only improve the performance and efficiency of LLM agents but also to pioneer for better development and deployment of the AIOS ecosystem in the future. The project is open-source at https://github.com/agiresearch/AIOS.
UPSS: a User-centric Private Storage System with its applications
Arastoo Bozorgi, Mahya Soleimani Jadidi, Jonathan Anderson
Mar 26 2024 cs.CR cs.OS arXiv:2403.15884v1

@misc{2403.15884, author = {Arastoo Bozorgi and Mahya Soleimani Jadidi and Jonathan Anderson}, title = {{UPSS}: a {U}ser-centric {P}rivate {S}torage {S}ystem with its applications}, year = {2024}, eprint = {2403.15884}, doi = {10.5220/0012306600003648}, note = {arXiv:2403.15884v1} }
PDF
Strong confidentiality, integrity, user control, reliability and performance are critical requirements in privacy-sensitive applications. Such applications would benefit from a data storage and sharing infrastructure that provides these properties even in decentralized topologies with untrusted storage backends, but users today are forced to choose between systemic security properties and system reliability or performance. As an alternative to this status quo we present UPSS: the user-centric private sharing system, a cryptographic storage system that can be used as a conventional filesystem or as the foundation for security-sensitive applications such as redaction with integrity and private revision control. We demonstrate that both the security and performance properties of UPSS exceed that of existing cryptographic filesystems and that its performance is comparable to mature conventional filesystems - in some cases, even superior. Whether used directly via its Rust API or as a conventional filesystem, UPSS provides strong security and practical performance on untrusted storage.
LLM as a System Service on Mobile Devices
Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu
Mar 20 2024 cs.OS arXiv:2403.11805v1

@misc{2403.11805, author = {Wangsong Yin and Mengwei Xu and Yuanchun Li and Xuanzhe Liu}, title = {{LLM} as a {S}ystem {S}ervice on {M}obile {D}evices}, year = {2024}, eprint = {2403.11805}, note = {arXiv:2403.11805v1} }
PDF
Being more powerful and intrusive into user-device interactions, LLMs are eager for on-device execution to better preserve user privacy. In this work, we propose a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS). Unlike traditional DNNs that execute in a stateless manner, such a system service is stateful: LLMs execution often needs to maintain persistent states (mainly KV cache) across multiple invocations. To minimize the LLM context switching overhead under tight device memory budget, this work presents LLMS, which decouples the memory management of app and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache's unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to compression. (2) IO-Recompute Pipelined Loading: it introduces recompute to swapping-in for acceleration. (3) Chunk Lifecycle Management: it optimizes the memory activities of chunks with an ahead-of-time swapping-out and an LCTRU (Least Compression-Tolerable and Recently-Used) queue based eviction. In evaluations conducted on well-established traces and various edge devices, \sys reduces context switching latency by up to 2 orders of magnitude when compared to competitive baseline solutions.
Physical Memory Attacks and a Memory Safe Management System for Memory Defense
Alon Hillel-Tuch, Aspen Olmstead
Mar 14 2024 cs.CR cs.OS arXiv:2403.08656v1

@misc{2403.08656, author = {Alon Hillel-Tuch and Aspen Olmstead}, title = {{P}hysical {M}emory {A}ttacks and a {M}emory {S}afe {M}anagement {S}ystem for {M}emory {D}efense}, year = {2024}, eprint = {2403.08656}, howpublished = {Springer Nature - Book Series: Transactions on Computational Science & Computational Intelligence 2022}, note = {arXiv:2403.08656v1} }
PDF
Programming errors, defective hardware components (such as hard disk spindle defects), and environmental hazards can lead to invalid memory operations. In addition, less predictable forms of environmental stress, such as radiation, thermal influence, and energy fluctuations, can induce hardware faults. Sometimes, a soft error can occur instead of a complete failure, such as a bit-flip. The 'natural' factors that can cause bit-flips are replicable through targeted attacks that result in significant compromises, including full privileged system access. Existing physical defense solutions have consistently been circumvented shortly after deployment. We will explore the concept of a novel software-based low-level layer that can protect vulnerable memory targeted by physical attack vectors related to bit-flip vulnerabilities.
Next4: Snapshots in Ext4 File System
Aditya Dani, Shardul Mangade, Piyush Nimbalkar, Harshad Shirwadkar
Mar 12 2024 cs.OS arXiv:2403.06790v1

@misc{2403.06790, author = {Aditya Dani and Shardul Mangade and Piyush Nimbalkar and Harshad Shirwadkar}, title = {{N}ext4: {S}napshots in {E}xt4 {F}ile {S}ystem}, year = {2024}, eprint = {2403.06790}, note = {arXiv:2403.06790v1} }
PDF
The growing value of data as a strategic asset has given rise to the necessity of implementing reliable backup and recovery solutions in the most efficient and cost-effective manner. The data backup methods available today on linux are not effective enough, because while running, most of them block I/Os to guarantee data integrity. We propose and implement Next4 - file system based snapshot feature in Ext4 which creates an instant image of the file system, to provide incremental versions of data, enabling reliable backup and data recovery. In our design, the snapshot feature is implemented by efficiently infusing the copy-on-write strategy in the write-in-place, extent based Ext4 file system, without affecting its basic structure. Each snapshot is an incremental backup of the data within the system. What distinguishes Next4 is the way that the data is backed up, improving both space utilization as well as performance.
I/O Transit Caching for PMem-based Block Device
Qing Xu, Qisheng Jiang, Chundong Wang
Mar 12 2024 cs.AR cs.ET cs.OS arXiv:2403.06120v1

@misc{2403.06120, author = {Qing Xu and Qisheng Jiang and Chundong Wang}, title = {{I}/{O} {T}ransit {C}aching for {PM}em-based {B}lock {D}evice}, year = {2024}, eprint = {2403.06120}, note = {arXiv:2403.06120v1} }
PDF
Byte-addressable non-volatile memory (NVM) sitting on the memory bus is employed to make persistent memory (PMem) in general-purpose computing systems and embedded systems for data storage. Researchers develop software drivers such as the block translation table (BTT) to build block devices on PMem, so programmers can keep using mature and reliable conventional storage stack while expecting high performance by exploiting fast PMem. However, our quantitative study shows that BTT underutilizes PMem and yields inferior performance, due to the absence of the imperative in-device cache. We add a conventional I/O staging cache made of DRAM space to BTT. As DRAM and PMem have comparable access latency, I/O staging cache is likely to be fully filled over time. Continual cache evictions and fsyncs thus cause on-demand flushes with severe stalls, such that the I/O staging cache is concretely unappealing for PMem-based block devices. We accordingly propose an algorithm named Caiti with novel I/O transit caching. Caiti eagerly evicts buffered data to PMem through CPU's multi-cores. It also conditionally bypasses a full cache and directly writes data into PMem to further alleviate I/O stalls. Experiments confirm that Caiti significantly boosts the performance with BTT by up to 3.6x, without loss of block-level write atomicity.
Virtuoso: An Open-Source, Comprehensive and Modular Simulation Framework for Virtual Memory Research
Konstantinos Kanellopoulos, Konstantinos Sgouras, Onur Mutlu
Mar 08 2024 cs.AR cs.OS arXiv:2403.04635v1

@misc{2403.04635, author = {Konstantinos Kanellopoulos and Konstantinos Sgouras and Onur Mutlu}, title = {{V}irtuoso: {A}n {O}pen-{S}ource, {C}omprehensive and {M}odular {S}imulation {F}ramework for {V}irtual {M}emory {R}esearch}, year = {2024}, eprint = {2403.04635}, note = {arXiv:2403.04635v1} }
PDF
Virtual memory is a cornerstone of modern computing systems.Introduced as one of the earliest instances of hardware-software co-design, VM facilitates programmer-transparent memory man agement, data sharing, process isolation and memory protection. Evaluating the efficiency of various virtual memory (VM) designs is crucial (i) given their significant impact on the system, including the CPU caches, the main memory, and the storage device and (ii) given that different system architectures might benefit from various VM techniques. Such an evaluation is not straightforward, as it heavily hinges on modeling the interplay between different VM techniques and the interactions of VM with the system architecture. Modern simulators, however, struggle to keep up with the rapid VM research developments, lacking the capability to model a wide range of contemporary VM techniques and their interactions. To this end, we present Virtuoso, an open-source, comprehensive and modular simulation framework that models various VM designs to establish a common ground for virtual memory research. We demonstrate the versatility and the potential of Virtuoso with four new case studies. Virtuoso is freely open-source and can be found at https://github.com/CMU-SAFARI/Virtuoso.
Formal Definitions and Performance Comparison of Consistency Models for Parallel File Systems
Chen Wang, Kathryn Mohror, Marc Snir
Feb 23 2024 cs.DC cs.OS arXiv:2402.14105v2

@misc{2402.14105, author = {Chen Wang and Kathryn Mohror and Marc Snir}, title = {{F}ormal {D}efinitions and {P}erformance {C}omparison of {C}onsistency {M}odels for {P}arallel {F}ile {S}ystems}, year = {2024}, eprint = {2402.14105}, note = {arXiv:2402.14105v2} }
PDF
The semantics of HPC storage systems are defined by the consistency models to which they abide. Storage consistency models have been less studied than their counterparts in memory systems, with the exception of the POSIX standard and its strict consistency model. The use of POSIX consistency imposes a performance penalty that becomes more significant as the scale of parallel file systems increases and the access time to storage devices, such as node-local solid storage devices, decreases. While some efforts have been made to adopt relaxed storage consistency models, these models are often defined informally and ambiguously as by-products of a particular implementation. In this work, we establish a connection between memory consistency models and storage consistency models and revisit the key design choices of storage consistency models from a high-level perspective. Further, we propose a formal and unified framework for defining storage consistency models and a layered implementation that can be used to easily evaluate their relative performance for different I/O workloads. Finally, we conduct a comprehensive performance comparison of two relaxed consistency models on a range of commonly-seen parallel I/O workloads, such as checkpoint/restart of scientific applications and random reads of deep learning applications. We demonstrate that for certain I/O scenarios, a weaker consistency model can significantly improve the I/O performance. For instance, in small random reads that typically found in deep learning applications, session consistency achieved an 5x improvement in I/O bandwidth compared to commit consistency, even at small scales.
Fight Hardware with Hardware: System-wide Detection and Mitigation of Side-Channel Attacks using Performance Counters
Stefano Carnà, Serena Ferracci, Francesco Quaglia, Alessandro Pellegrini
Feb 22 2024 cs.CR cs.OS arXiv:2402.13281v1

@misc{2402.13281, author = {Stefano Carnà and Serena Ferracci and Francesco Quaglia and Alessandro Pellegrini}, title = {{F}ight {H}ardware with {H}ardware: {S}ystem-wide {D}etection and {M}itigation of {S}ide-{C}hannel {A}ttacks using {P}erformance {C}ounters}, year = {2024}, eprint = {2402.13281}, note = {arXiv:2402.13281v1} }
PDF
We present a kernel-level infrastructure that allows system-wide detection of malicious applications attempting to exploit cache-based side-channel attacks to break the process confinement enforced by standard operating systems. This infrastructure relies on hardware performance counters to collect information at runtime from all applications running on the machine. High-level detection metrics are derived from these measurements to maximize the likelihood of promptly detecting a malicious application. Our experimental assessment shows that we can catch a large family of side-channel attacks with a significantly reduced overhead. We also discuss countermeasures that can be enacted once a process is suspected of carrying out a side-channel attack to increase the overall tradeoff between the system's security level and the delivered performance under non-suspected process executions.
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, Weihua Zhang
Feb 16 2024 cs.OS cs.PF arXiv:2402.09688v1

@misc{2402.09688, author = {Jinhu Jiang and Chaoyi Liang and Rongchao Dong and Zhaohui Yang and Zhongjun Zhou and Wenwen Wang and Pen-Chung Yew and Weihua Zhang}, title = {{A} {S}ystem-{L}evel {D}ynamic {B}inary {T}ranslator using {A}utomatically-{L}earned {T}ranslation {R}ules}, year = {2024}, eprint = {2402.09688}, note = {arXiv:2402.09688v1} }
PDF
System-level emulators have been used extensively for system design, debugging and evaluation. They work by providing a system-level virtual machine to support a guest operating system (OS) running on a platform with the same or different native OS that uses the same or different instruction-set architecture. For such system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based DBT approach has shown a significantly improved performance with a higher quality of translated code using automatically learned translation rules. However, it has only been applied to user-level emulation, and not yet to system-level emulation. In this paper, we explore the feasibility of applying this approach to improve system-level emulation, and use QEMU to build a prototype. ... To achieve better performance, we leverage several optimizations that include coordination overhead reduction to reduce the overhead of each coordination, and coordination elimination and code scheduling to reduce the coordination frequency. Experimental results show that it can achieve an average of 1.36X speedup over QEMU 6.1 with negligible coordination overhead in the system emulation mode using SPEC CINT2006 as application benchmarks and 1.15X on real-world applications.
Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors
Juan Carlos Saez, Fernando Castro, Manuel Prieto-Matias
Feb 13 2024 cs.DC cs.OS arXiv:2402.07664v1

@misc{2402.07664, author = {Juan Carlos Saez and Fernando Castro and Manuel Prieto-Matias}, title = {{E}nabling performance portability of data-parallel {O}pen{MP} applications on asymmetric multicore processors}, year = {2024}, eprint = {2402.07664}, howpublished = {Proceedings of the 49th International Conference on Parallel Processing (ICPP 2020). ACM, Article 51, 1-11}, doi = {10.1145/3404397.3404441}, note = {arXiv:2402.07664v1} }
PDF
Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. To address this shortcoming, we propose \textitAsymmetric Iteration Distribution (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in \textitlibgomp --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the \textttstatic and \textttdynamic methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56\% and 16.8\%, respectively.
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci
Feb 13 2024 cs.LG cs.AI cs.OS arXiv:2402.07033v1

@misc{2402.07033, author = {Keisuke Kamahori and Yile Gu and Kan Zhu and Baris Kasikci}, title = {{F}iddler: {CPU}-{GPU} {O}rchestration for {F}ast {I}nference of {M}ixture-of-{E}xperts {M}odels}, year = {2024}, eprint = {2402.07033}, note = {arXiv:2402.07033v1} }
PDF
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over $3$ tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at \urlhttps://github.com/efeslab/fiddler
Age-Memory Trade-off in Read-Copy-Update
Vishakha Ramani, Jiachen Chen, Roy D. Yates
Feb 13 2024 cs.IT cs.OS math.IT arXiv:2402.06860v1

@misc{2402.06860, author = {Vishakha Ramani and Jiachen Chen and Roy D.~Yates}, title = {{A}ge-{M}emory {T}rade-off in {R}ead-{C}opy-{U}pdate}, year = {2024}, eprint = {2402.06860}, note = {arXiv:2402.06860v1} }
PDF
In the realm of shared memory systems, the challenge of reader-writer synchronization is closely coupled with the potential for readers to access outdated updates. Read-Copy-Update (RCU) is a synchronization primitive that allows for concurrent and non-blocking read access to fresh data. This is achieved through the creation of updated data copies, with each prior version retained until all associated read-locks are released. Given the principle that frequent updating keeps information fresh, the concern is whether we accumulate an infinite number of update copies, leading to excessively large memory usage. This paper analyzes trade-offs between memory usage and update age within real-time status updating systems, focusing specifically on RCU. The analysis demonstrates that with finite read time and read request rate, the average number of updates within the system remains bounded.
Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan
Soham Sinha, Shekhar Dwivedi, Mahdi Azizian
Feb 08 2024 cs.SE cs.AI cs.LG cs.OS arXiv:2402.04466v1

@misc{2402.04466, author = {Soham Sinha and Shekhar Dwivedi and Mahdi Azizian}, title = {{T}owards {D}eterministic {E}nd-to-end {L}atency for {M}edical {AI} {S}ystems in {NVIDIA} {H}oloscan}, year = {2024}, eprint = {2402.04466}, note = {arXiv:2402.04466v1} }
PDF
The introduction of AI and ML technologies into medical devices has revolutionized healthcare diagnostics and treatments. Medical device manufacturers are keen to maximize the advantages afforded by AI and ML by consolidating multiple applications onto a single platform. However, concurrent execution of several AI applications, each with its own visualization components, leads to unpredictable end-to-end latency, primarily due to GPU resource contentions. To mitigate this, manufacturers typically deploy separate workstations for distinct AI applications, thereby increasing financial, energy, and maintenance costs. This paper addresses these challenges within the context of NVIDIA's Holoscan platform, a real-time AI system for streaming sensor data and images. We propose a system design optimized for heterogeneous GPU workloads, encompassing both compute and graphics tasks. Our design leverages CUDA MPS for spatial partitioning of compute workloads and isolates compute and graphics processing onto separate GPUs. We demonstrate significant performance improvements across various end-to-end latency determinism metrics through empirical evaluation with real-world Holoscan medical device applications. For instance, the proposed design reduces maximum latency by 21-30% and improves latency distribution flatness by 17-25% for up to five concurrent endoscopy tool tracking AI applications, compared to a single-GPU baseline. Against a default multi-GPU setup, our optimizations decrease maximum latency by 35% for up to six concurrent applications by improving GPU utilization by 42%. This paper provides clear design insights for AI applications in the edge-computing domain including medical systems, where performance predictability of concurrent and heterogeneous GPU workloads is a critical requirement.
Design and implementation of a real-time onboard system for a stratospheric balloon mission using commercial off-the-self components and a model-based approach
Angel-Grover Perez-Munoz, Jose-Carlos Gamazo-Real, David Gonzalez-Barcena, Juan Zamorano
Feb 07 2024 eess.SY astro-ph.IM cs.AR cs.OS cs.SY arXiv:2402.04079v1

@misc{2402.04079, author = {Angel-Grover Perez-Munoz and Jose-Carlos Gamazo-Real and David Gonzalez-Barcena and Juan Zamorano}, title = {{D}esign and implementation of a real-time onboard system for a stratospheric balloon mission using commercial off-the-self components and a model-based approach}, year = {2024}, eprint = {2402.04079}, howpublished = {Computers and Electrical Engineering, vol. 111, Part B, no. 108953, pp. 1-22, 2023, ISSN 0045-7906}, doi = {10.1016/j.compeleceng.2023.108953}, note = {arXiv:2402.04079v1} }
PDF
Stratospheric balloons have emerged as an affordable and flexible alternative to traditional spacecrafts as they are implemented using commercial off-the-shelf (COTS) equipment without following strict methodologies. HERCCULES is a stratospheric balloon mission that aims to characterize the convective heat and radiative environment in the stratosphere. The purpose of this article is to present the HERCCULES onboard software (OBSW) whose design and complexity is comparable to that of satellite systems, since it must control about sixty COTS equipment using a single Raspberry Pi 4B as onboard computer and ensure the real-time requirements. Compared to similar systems, novel contributions are presented as the OBSW is developed following modelbased and component-based approaches using the TASTE toolchain from the European Space Agency (ESA) for automatic code generation. Besides, the OBSW is verified and validated following the ESA standards and the results obtained demonstrate the suitability and efficiency of the solution and the selected methodologies.
bypass4netns: Accelerating TCP/IP Communications in Rootless Containers
Naoki Matsumoto, Akihiro Suda
Feb 02 2024 cs.NI cs.OS arXiv:2402.00365v1

@misc{2402.00365, author = {Naoki Matsumoto and Akihiro Suda}, title = {bypass4netns: {A}ccelerating {TCP}/{IP} {C}ommunications in {R}ootless {C}ontainers}, year = {2024}, eprint = {2402.00365}, note = {arXiv:2402.00365v1} }
PDF
"Rootless containers" is a concept to run the entire container runtimes and containers without the root privileges. It protects the host environment from attackers exploiting container runtime vulnerabilities. However, when rootless containers communicate with external endpoints, the network performance is low compared to rootful containers because of the overhead of rootless networking components. In this paper, we propose bypass4netns that accelerates TCP/IP communications in rootless containers by bypassing slow networking components. bypass4netns uses sockets allocated on the host. It switches sockets in containers to the host's sockets by intercepting syscalls and injecting the file descriptors using Seccomp. Our method with Seccomp can handle statically linked applications that previous works could not handle. Also, we propose high-performance rootless multi-node communication. We confirmed that rootless containers with bypass4netns achieve more than 30x faster throughput than rootless containers without it. In addition, we evaluated performance with applications and it showed large improvements on some applications.
Beyond Control: Exploring Novel File System Objects for Data-Only Attacks on Linux Systems
Jinmeng Zhou, Jiayi Hu, Ziyue Pan, Jiaxun Zhu, Wenbo Shen, Guoren Li, Zhiyun Qian
Feb 01 2024 cs.CR cs.OS arXiv:2401.17618v2

@misc{2401.17618, author = {Jinmeng Zhou and Jiayi Hu and Ziyue Pan and Jiaxun Zhu and Wenbo Shen and Guoren Li and Zhiyun Qian}, title = {{B}eyond {C}ontrol: {E}xploring {N}ovel {F}ile {S}ystem {O}bjects for {D}ata-{O}nly {A}ttacks on {L}inux {S}ystems}, year = {2024}, eprint = {2401.17618}, note = {arXiv:2401.17618v2} }
PDF
The widespread deployment of control-flow integrity has propelled non-control data attacks into the mainstream. In the domain of OS kernel exploits, by corrupting critical non-control data, local attackers can directly gain root access or privilege escalation without hijacking the control flow. As a result, OS kernels have been restricting the availability of such non-control data. This forces attackers to continue to search for more exploitable non-control data in OS kernels. However, discovering unknown non-control data can be daunting because they are often tied heavily to semantics and lack universal patterns. We make two contributions in this paper: (1) discover critical non-control objects in the file subsystem and (2) analyze their exploitability. This work represents the first study, with minimal domain knowledge, to semi-automatically discover and evaluate exploitable non-control data within the file subsystem of the Linux kernel. Our solution utilizes a custom analysis and testing framework that statically and dynamically identifies promising candidate objects. Furthermore, we categorize these discovered objects into types that are suitable for various exploit strategies, including a novel strategy necessary to overcome the defense that isolates many of these objects. These objects have the advantage of being exploitable without requiring KASLR, thus making the exploits simpler and more reliable. We use 18 real-world CVEs to evaluate the exploitability of the file system objects using various exploit strategies. We develop 10 end-to-end exploits using a subset of CVEs against the kernel with all state-of-the-art mitigations enabled.
numaPTE: Managing Page-Tables and TLBs on NUMA Systems
Bin Gao, Qingxuan Kang, Hao-Wei Tee, Kyle Timothy Ng Chu, Alireza Sanaee, Djordje Jevdjic
Jan 30 2024 cs.OS arXiv:2401.15558v1

@misc{2401.15558, author = {Bin Gao and Qingxuan Kang and Hao-Wei Tee and Kyle Timothy Ng Chu and Alireza Sanaee and Djordje Jevdjic}, title = {numa{PTE}: {M}anaging {P}age-{T}ables and {TLB}s on {NUMA} {S}ystems}, year = {2024}, eprint = {2401.15558}, note = {arXiv:2401.15558v1} }
PDF
Memory management operations that modify page-tables, typically performed during memory allocation/deallocation, are infamous for their poor performance in highly threaded applications, largely due to process-wide TLB shootdowns that the OS must issue due to the lack of hardware support for TLB coherence. We study these operations in NUMA settings, where we observe up to 40x overhead for basic operations such as munmap or mprotect. The overhead further increases if page-table replication is used, where complete coherent copies of the page-tables are maintained across all NUMA nodes. While eager system-wide replication is extremely effective at localizing page-table reads during address translation, we find that it creates additional penalties upon any page-table changes due to the need to maintain all replicas coherent. In this paper, we propose a novel page-table management mechanism, called numaPTE, to enable transparent, on-demand, and partial page-table replication across NUMA nodes in order to perform address translation locally, while avoiding the overheads and scalability issues of system-wide full page-table replication. We then show that numaPTE's precise knowledge of page-table sharers can be leveraged to significantly reduce the number of TLB shootdowns issued upon any memory-management operation. As a result, numaPTE not only avoids replication-related slowdowns, but also provides significant speedup over the baseline on memory allocation/deallocation and access control operations. We implement numaPTEin Linux on x86_64, evaluate it on 4- and 8-socket systems, and show that numaPTE achieves the full benefits of eager page-table replication on a wide range of applications, while also achieving a 12% and 36% runtime improvement on Webserver and Memcached respectively due to a significant reduction in TLB shootdowns.
Characterizing Network Requirements for GPU API Remoting in AI Applications
Tianxia Wang, Zhuofu Chen, Xingda Wei, Jinyu Gu, Rong Chen, Haibo Chen
Jan 25 2024 cs.OS cs.NI arXiv:2401.13354v1

@misc{2401.13354, author = {Tianxia Wang and Zhuofu Chen and Xingda Wei and Jinyu Gu and Rong Chen and Haibo Chen}, title = {{C}haracterizing {N}etwork {R}equirements for {GPU} {API} {R}emoting in {AI} {A}pplications}, year = {2024}, eprint = {2401.13354}, note = {arXiv:2401.13354v1} }
PDF
GPU remoting is a promising technique for supporting AI applications. Networking plays a key role in enabling remoting. However, for efficient remoting, the network requirements in terms of latency and bandwidth are unknown. In this paper, we take a GPU-centric approach to derive the minimum latency and bandwidth requirements for GPU remoting, while ensuring no (or little) performance degradation for AI applications. Our study including theoretical model demonstrates that, with careful remoting design, unmodified AI applications can run on the remoting setup using commodity networking hardware without any overhead or even with better performance, with low network demands.
Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration
Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, Ren Wang
Jan 25 2024 cs.OS arXiv:2401.13154v2

@misc{2401.13154, author = {Lingfeng Xiang and Zhen Lin and Weishu Deng and Hui Lu and Jia Rao and Yifan Yuan and Ren Wang}, title = {{N}omad: {N}on-{E}xclusive {M}emory {T}iering via {T}ransactional {P}age {M}igration}, year = {2024}, eprint = {2401.13154}, note = {arXiv:2401.13154v2} }
PDF
With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose non-exclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop Nomad, a new page management mechanism for Linux that features transactional page migration and page shadowing. Nomad helps remove page migration off the critical path of program execution and makes migration completely asynchronous. Evaluations with carefully crafted micro-benchmarks and real-world applications show that Nomad is able to achieve up to 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach in Linux when under memory pressure. We also compare Nomad with a recently proposed hardware-assisted, access sampling-based page migration approach and demonstrate Nomad's strengths and potential weaknesses in various scenarios.
SyzRetrospector: A Large-Scale Retrospective Study of Syzbot
Joseph Bursey, Ardalan Amiri Sani, Zhiyun Qian
Jan 23 2024 cs.SE cs.CR cs.OS arXiv:2401.11642v1

@misc{2401.11642, author = {Joseph Bursey and Ardalan Amiri Sani and Zhiyun Qian}, title = {{S}yz{R}etrospector: {A} {L}arge-{S}cale {R}etrospective {S}tudy of {S}yzbot}, year = {2024}, eprint = {2401.11642}, note = {arXiv:2401.11642v1} }
PDF
Over the past 6 years, Syzbot has fuzzed the Linux kernel day and night to report over 5570 bugs, of which 4604 have been patched [11]. While this is impressive, we have found the average time to find a bug is over 405 days. Moreover, we have found that current metrics commonly used, such as time-to-find and number of bugs found, are inaccurate in evaluating Syzbot since bugs often spend the majority of their lives hidden from the fuzzer. In this paper, we set out to better understand and quantify Syzbot's performance and improvement in finding bugs. Our tool, SyzRetrospector, takes a different approach to evaluating Syzbot by finding the earliest that Syzbot was capable of finding a bug, and why that bug was revealed. We use SyzRetrospector on a large scale to analyze 559 bugs and find that bugs are hidden for an average of 331.17 days before Syzbot is even able to find them. We further present findings on the behaviors of revealing factors, how some bugs are harder to reveal than others, the trends in delays over the past 6 years, and how bug location relates to delays. We also provide key takeaways for improving Syzbot's delays.
Herding LLaMaS: Using LLMs as an OS Module
Aditya K Kamath, Sujay Yadalam
Jan 18 2024 cs.OS cs.LG arXiv:2401.08908v1

@misc{2401.08908, author = {Aditya K Kamath and Sujay Yadalam}, title = {{H}erding {LL}a{M}a{S}: {U}sing {LLM}s as an {OS} {M}odule}, year = {2024}, eprint = {2401.08908}, note = {arXiv:2401.08908v1} }
PDF
Computer systems are becoming increasingly heterogeneous with the emergence of new memory technologies and compute devices. GPUs alongside CPUs have become commonplace and CXL is poised to be a mainstay of cloud systems. The operating system is responsible for managing these hardware resources, requiring modification every time a new device is released. Years of research and development are sunk into tuning the OS for high performance with each new heterogeneous device. With the recent explosion in memory technologies and domain-specific accelerators, it would be beneficial to have an OS that could provide high performance for new devices without significant effort. We propose LLaMaS which can adapt to new devices easily. LLaMaS uses Large Language Models (LLMs) to extract the useful features of new devices from their textual description and uses these features to make operating system decisions at runtime. Adding support to LLaMaS for a new device is as simple as describing the system and new device properties in plaintext. LLaMaS reduces the burden on system administrators to enable easy integration of new devices into production systems. Preliminary evaluation using ChatGPT shows that LLMs are capable of extracting device features from text and make correct OS decisions based on those features.
File System Aging
Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton
Jan 18 2024 cs.OS arXiv:2401.08858v1

@misc{2401.08858, author = {Alex Conway and Ainesh Bakshi and Arghya Bhattacharya and Rory Bennett and Yizheng Jiao and Eric Knorr and Yang Zhan and Michael A.~Bender and William Jannen and Rob Johnson and Bradley C.~Kuszmaul and Donald E.~Porter and Jun Yuan and Martin Farach-Colton}, title = {{F}ile {S}ystem {A}ging}, year = {2024}, eprint = {2401.08858}, note = {arXiv:2401.08858v1} }
PDF
File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions that eventually lead to slower performance, or aging. Conventional wisdom suggests that file system aging is a solved problem in the common case; heuristics to avoid aging, such as colocating related files and data blocks, are effective until a storage device fills up, at which point space pressure exacerbates fragmentation-based aging. However, this article describes both realistic and synthetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging, even when the storage device is nearly empty. We argue that these slowdowns are caused by poor layout. We demonstrate a correlation between the read performance of a directory scan and the locality within a file system's access patterns, using a dynamic layout score. We complement these results with microbenchmarks that show that space pressure can cause a substantial amount of inter-file and intra-file fragmentation. However, our results suggest that the effect of free-space fragmentation on read performance is best described as accelerating the file system aging process. The effect on write performance is non-existent in some cases, and, in most cases, an order of magnitude smaller than the read degradation from fragmentation caused by normal usage. In short, many file systems are exquisitely prone to read aging after a variety of write patterns. We show, however, that aging is not inevitable. BetrFS, a file system based on write-optimized dictionaries, exhibits almost no aging in our experiments. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.
Dynamic Voltage and Frequency Scaling for Intermittent Computing
Andrea Maioli, Kevin A. Quinones, Saad Ahmed, Muhammad H. Alizai, Luca Mottola
Jan 18 2024 cs.AR cs.OS arXiv:2401.08710v1

@misc{2401.08710, author = {Andrea Maioli and Kevin A.~Quinones and Saad Ahmed and Muhammad H.~Alizai and Luca Mottola}, title = {{D}ynamic {V}oltage and {F}requency {S}caling for {I}ntermittent {C}omputing}, year = {2024}, eprint = {2401.08710}, note = {arXiv:2401.08710v1} }
PDF
We present hardware/software techniques to intelligently regulate supply voltage and clock frequency of intermittently-computing devices. These devices rely on ambient energy harvesting to power their operation and small capacitors as energy buffers. Statically setting their clock frequency fails to capture the unique relations these devices expose between capacitor voltage, energy efficiency at a given operating frequency, and the corresponding operating range. Existing dynamic voltage and frequency scaling techniques are also largely inapplicable due to extreme energy scarcity and peculiar hardware features. We introduce two hardware/software co-designs that accommodate the distinct hardware features and function within a constrained energy envelope, offering varied trade-offs and functionalities. Our experimental evaluation combines tests on custom-manufactured hardware and detailed emulation experiments. The data gathered indicate that our approaches result in up to 3.75x reduced energy consumption and 12x swifter execution times compared to the considered baselines, all while utilizing smaller capacitors to accomplish identical workloads.
Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching
Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna
Jan 15 2024 cs.NE cs.AR cs.LG cs.OS arXiv:2401.06362v3

@misc{2401.06362, author = {Pengmiao Zhang and Neelesh Gupta and Rajgopal Kannan and Viktor K.~Prasanna}, title = {{A}ttention, {D}istillation, and {T}abularization: {T}owards {P}ractical {N}eural {N}etwork-{B}ased {P}refetching}, year = {2024}, eprint = {2401.06362}, note = {arXiv:2401.06362v3} }
PDF
Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
When eBPF Meets Machine Learning: On-the-fly OS Kernel Compartmentalization
Zicheng Wang, Tiejin Chen, Qinrun Dai, Yueqi Chen, Hua Wei, Qingkai Zeng
Jan 12 2024 cs.OS cs.CR cs.LG arXiv:2401.05641v1

@misc{2401.05641, author = {Zicheng Wang and Tiejin Chen and Qinrun Dai and Yueqi Chen and Hua Wei and Qingkai Zeng}, title = {{W}hen e{BPF} {M}eets {M}achine {L}earning: {O}n-the-fly {OS} {K}ernel {C}ompartmentalization}, year = {2024}, eprint = {2401.05641}, note = {arXiv:2401.05641v1} }
PDF
Compartmentalization effectively prevents initial corruption from turning into a successful attack. This paper presents O2C, a pioneering system designed to enforce OS kernel compartmentalization on the fly. It not only provides immediate remediation for sudden threats but also maintains consistent system availability through the enforcement process. O2C is empowered by the newest advancements of the eBPF ecosystem which allows to instrument eBPF programs that perform enforcement actions into the kernel at runtime. O2C takes the lead in embedding a machine learning model into eBPF programs, addressing unique challenges in on-the-fly compartmentalization. Our comprehensive evaluation shows that O2C effectively confines damage within the compartment. Further, we validate that decision tree is optimally suited for O2C owing to its advantages in processing tabular data, its explainable nature, and its compliance with the eBPF ecosystem. Last but not least, O2C is lightweight, showing negligible overhead and excellent sacalability system-wide.
Characterizing Physical Memory Fragmentation
Mark Mansi, Michael M. Swift
Jan 09 2024 cs.OS cs.PF arXiv:2401.03523v1

@misc{2401.03523, author = {Mark Mansi and Michael M.~Swift}, title = {{C}haracterizing {P}hysical {M}emory {F}ragmentation}, year = {2024}, eprint = {2401.03523}, note = {arXiv:2401.03523v1} }
PDF
External fragmentation of physical memory occurs when adjacent differently sized regions of allocated physical memory are freed at different times, causing free memory to be physically discontiguous. It can significantly degrade system performance and efficiency, such as reducing the ability to use huge pages, a critical optimization on modern large-memory system. For decades system developers have sought to avoid and mitigate fragmentation, but few prior studies quantify and characterize it in production settings. Moreover, prior work often artificially fragments physical memory to create more realistic performance evaluations, but their fragmentation methodologies are ad hoc and unvalidated. Out of 13 papers, we found 11 different methodologies, some of which were subsequently found inadequate. The importance of addressing fragmentation necessitates a validated and principled methodology. Our work fills these gaps in knowledge and methodology. We conduct a study of memory fragmentation in production by observing 248 machines in the Computer Sciences Department at University of Wisconsin - Madison for a week. We identify six key memory usage patterns, and find that Linux's file cache and page reclamation systems are major contributors to fragmentation because they often obliviously break up contiguous memory. Finally, we create andúril, a tool to artificially fragment memory during experimental research evaluations. While andúril ultimately fails as a scientific tool, we discuss its design ideas, merits, and failings in hope that they may inspire future research.
RAID Organizations for Improved Reliability and Performance: A Not Entirely Unbiased Tutorial (1st revision)
Alexander Thomasian
Jan 09 2024 cs.DC cs.OS cs.PF arXiv:2401.03235v1

@misc{2401.03235, author = {Alexander Thomasian}, title = {{RAID} {O}rganizations for {I}mproved {R}eliability and {P}erformance: {A} {N}ot {E}ntirely {U}nbiased {T}utorial (1st revision)}, year = {2024}, eprint = {2401.03235}, note = {arXiv:2401.03235v1} }
PDF
RAID proposal advocated replacing large disks with arrays of PC disks, but as the capacity of small disks increased 100-fold in 1990s the production of large disks was discontinued. Storage dependability is increased via replication or erasure coding. Cloud storage providers store multiple copies of data obviating for need for further redundancy. Varitaions of RAID based on local recovery codes, partial MDS reduce recovery cost. NAND flash Solid State Disks - SSDs have low latency and high bandwidth, are more reliable, consume less power and have a lower TCO than Hard Disk Drives, which are more viable for hyperscalers.
Data-Driven Power Modeling and Monitoring via Hardware Performance Counters Tracking
Sergio Mazzola, Gabriele Ara, Thomas Benz, Björn Forsberg, Tommaso Cucinotta, Luca Benini
Jan 04 2024 cs.PF cs.OS arXiv:2401.01826v1

@misc{2401.01826, author = {Sergio Mazzola and Gabriele Ara and Thomas Benz and Björn Forsberg and Tommaso Cucinotta and Luca Benini}, title = {{D}ata-{D}riven {P}ower {M}odeling and {M}onitoring via {H}ardware {P}erformance {C}ounters {T}racking}, year = {2024}, eprint = {2401.01826}, note = {arXiv:2401.01826v1} }
PDF
In the current high-performance and embedded computing era, full-stack energy-centric design is paramount. Use cases require increasingly high performance at an affordable power budget, often under real-time constraints. Extreme heterogeneity and parallelism address these issues but greatly complicate online power consumption assessment, which is essential for dynamic hardware and software stack adaptations. We introduce a novel architecture-agnostic power modeling methodology with state-of-the-art accuracy, low overhead, and high responsiveness. Our methodology identifies the best Performance Monitoring Counters (PMCs) to model the power consumption of each hardware sub-system at each Dynamic Voltage and Frequency Scaling (DVFS) state. The individual linear models are combined into a complete model that effectively describes the power consumption of the whole system, achieving high accuracy and low overhead. Our evaluation reports an average estimation error of 7.5 % for power consumption and 1.3 % for energy. Furthermore, we propose Runmeter, an open-source, PMC-based monitoring framework integrated into the Linux kernel. Runmeter manages PMC samples collection and manipulation, efficiently evaluating our power models at runtime. With a time overhead of only 0.7 % in the worst case, Runmeter provides responsive and accurate power measurements directly in the kernel, which can be employed for actuation policies such as Dynamic Power Management (DPM) and power-aware task scheduling.
ALPC Is In Danger: ALPChecker Detects Spoofing and Blinding
Anastasiia Kropova, Igor Korkin
Jan 04 2024 cs.CR cs.OS arXiv:2401.01376v1

@misc{2401.01376, author = {Anastasiia Kropova and Igor Korkin}, title = {{ALPC} {I}s {I}n {D}anger: {ALPC}hecker {D}etects {S}poofing and {B}linding}, year = {2024}, eprint = {2401.01376}, note = {arXiv:2401.01376v1} }
PDF
The purpose of this study is to evaluate the possibility of implementing an attack on ALPC connection in the Windows operating system through the kernel without closing the connection covertly from programs and the operating system and to propose a method of protection against this type of attacks. Asynchronous Local Procedure Call technology (ALPC) is used in various Windows information protection systems, including antivirus systems (AV) and Endpoint Detection and Response systems (EDR). To ensure the concealment of malicious software, attackers need to disrupt the operation of AV, EDR tools, which in turn can be achieved by destructive impact on the components of the ALPC technology. Examples of such attacks already exist and are covered in this paper. To counteract such new threats, it is necessary to advance the improvement of information security systems and the ALPC security research was conducted. The most difficult case, Windows kernel driver attack, was considered. Three attacks on the ALPC connection were carried out, based on changing the ALPC structures in the kernel memory, which led to creation of illegitimate connections in the system and the disruption of correct connections. ALPChecker protection tool has been developed. The tool was successfully tested on three demonstrated attacks.
Security, extensibility, and redundancy in the Metabolic Operating System
Samuel T. King
Jan 04 2024 cs.CR cs.OS arXiv:2401.01357v1

@misc{2401.01357, author = {Samuel T.~King}, title = {{S}ecurity, extensibility, and redundancy in the {M}etabolic {O}perating {S}ystem}, year = {2024}, eprint = {2401.01357}, note = {arXiv:2401.01357v1} }
PDF
People living with Type 1 Diabetes (T1D) lose the ability to produce insulin naturally. To compensate, they inject synthetic insulin. One common way to inject insulin is through automated insulin delivery systems, which use sensors to monitor their metabolic state and an insulin pump device to adjust insulin to adapt. In this paper, we present the Metabolic Operating System, a new automated insulin delivery system that we designed from the ground up using security first principles. From an architecture perspective, we apply separation principles to simplify the core system and isolate non-critical functionality from the core closed-loop algorithm. From an algorithmic perspective, we evaluate trends in insulin technology and formulate a simple, but effective, algorithm given the state-of-the-art. From a safety perspective, we build in multiple layers of redundancy to ensure that the person using our system remains safe. Fundamentally, this paper is a paper on real-world experiences building and running an automated insulin delivery system. We report on the design iterations we make based on experiences working with one individual using our system. Our evaluation shows that an automated insulin delivery system built from the ground up using security first principles can still help manage T1D effectively. Our source code is open source and available on GitHub (link omitted).
RASP for LSASS: Preventing Mimikatz-Related Attacks
Anna Revazova, Igor Korkin
Jan 02 2024 cs.CR cs.OS arXiv:2401.00316v1

@misc{2401.00316, author = {Anna Revazova and Igor Korkin}, title = {{RASP} for {LSASS}: {P}reventing {M}imikatz-{R}elated {A}ttacks}, year = {2024}, eprint = {2401.00316}, note = {arXiv:2401.00316v1} }
PDF
The Windows authentication infrastructure relies on the Local Security Authority (LSA) system, with its integral component being lsass.exe. Regrettably, this framework is not impervious, presenting vulnerabilities that attract threat actors with malicious intent. By exploiting documented vulnerabilities sourced from the CVE database or leveraging sophisticated tools such as mimikatz, adversaries can successfully compromise user password-address information. In this comprehensive analysis, we delve into proactive measures aimed at fortifying the local authentication subsystem against potential threats. Moreover, we present empirical evidence derived from practical assessments of various defensive methodologies, including those articulated previously. This examination not only underscores the importance of proactive security measures but also assesses the practical efficacy of these strategies in real-world contexts.
Design and Implementation Considerations for a Virtual File System Using an Inode Data Structure
Qin Sun, Grace McKenzie, Guanqun Song, Ting Zhu
Dec 27 2023 cs.OS cs.CR cs.SY eess.SY arXiv:2312.15153v1

@misc{2312.15153, author = {Qin Sun and Grace McKenzie and Guanqun Song and Ting Zhu}, title = {{D}esign and {I}mplementation {C}onsiderations for a {V}irtual {F}ile {S}ystem {U}sing an {I}node {D}ata {S}tructure}, year = {2023}, eprint = {2312.15153}, note = {arXiv:2312.15153v1} }
PDF
Virtual file systems are a tool to centralize and mobilize a file system that could otherwise be complex and consist of multiple hierarchies, hard disks, and more. In this paper, we discuss the design of Unix-based file systems and how this type of file system layout using inode data structures and a disk emulator can be implemented as a single-file virtual file system in Linux. We explore the ways that virtual file systems are vulnerable to security attacks and introduce straightforward solutions that can be implemented to help prevent or mitigate the consequences of such attacks.
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
Dec 21 2023 cs.LG cs.OS arXiv:2312.12456v1

@misc{2312.12456, author = {Yixin Song and Zeyu Mi and Haotong Xie and Haibo Chen}, title = {{P}ower{I}nfer: {F}ast {L}arge {L}anguage {M}odel {S}erving with a {C}onsumer-grade {GPU}}, year = {2023}, eprint = {2312.12456}, note = {arXiv:2312.12456v1} }
PDF
This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.
On a Foundation Model for Operating Systems
Divyanshu Saxena, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Sebastian Angel, Joydeep Biswas, Swarat Chaudhuri, Isil Dillig, Alex Dimakis, P. Brighten Godfrey, Daehyeok Kim, Chris Rossbach, Gang Wang
Dec 14 2023 cs.OS cs.LG arXiv:2312.07813v1

@misc{2312.07813, author = {Divyanshu Saxena and Nihal Sharma and Donghyun Kim and Rohit Dwivedula and Jiayi Chen and Chenxi Yang and Sriram Ravula and Zichao Hu and Aditya Akella and Sebastian Angel and Joydeep Biswas and Swarat Chaudhuri and Isil Dillig and Alex Dimakis and P.~Brighten Godfrey and Daehyeok Kim and Chris Rossbach and Gang Wang}, title = {{O}n a {F}oundation {M}odel for {O}perating {S}ystems}, year = {2023}, eprint = {2312.07813}, note = {arXiv:2312.07813v1} }
PDF
This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case for a foundation model revolves around the observations that several OS components such as CPU, memory, and network subsystems are interrelated and that OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behavior in varying environments and workloads. We discuss a wide range of possibilities that then arise, from employing foundation models as policy agents to utilizing them as generators and predictors to assist traditional OS control algorithms. Our hope is that this paper spurs further research into OS foundation models and creating the next generation of operating systems for the evolving computing landscape.
BPF-oF: Storage Function Pushdown Over the Network
Ioannis Zarkadas, Tal Zussman, Jeremy Carin, Sheng Jiang, Yuhong Zhong, Jonas Pfefferle, Hubertus Franke, Junfeng Yang, Kostis Kaffes, Ryan Stutsman, Asaf Cidon
Dec 13 2023 cs.OS arXiv:2312.06808v1

@misc{2312.06808, author = {Ioannis Zarkadas and Tal Zussman and Jeremy Carin and Sheng Jiang and Yuhong Zhong and Jonas Pfefferle and Hubertus Franke and Junfeng Yang and Kostis Kaffes and Ryan Stutsman and Asaf Cidon}, title = {{BPF}-o{F}: {S}torage {F}unction {P}ushdown {O}ver the {N}etwork}, year = {2023}, eprint = {2312.06808}, note = {arXiv:2312.06808v1} }
PDF
Storage disaggregation, wherein storage is accessed over the network, is popular because it allows applications to independently scale storage capacity and bandwidth based on dynamic application demand. However, the added network processing introduced by disaggregation can consume significant CPU resources. In many storage systems, logical storage operations (e.g., lookups, aggregations) involve a series of simple but dependent I/O access patterns. Therefore, one way to reduce the network processing overhead is to execute dependent series of I/O accesses at the remote storage server, reducing the back-and-forth communication between the storage layer and the application. We refer to this approach as \emphremote-storage pushdown. We present BPF-oF, a new remote-storage pushdown protocol built on top of NVMe-oF, which enables applications to safely push custom eBPF storage functions to a remote storage server. The main challenge in integrating BPF-oF with storage systems is preserving the benefits of their client-based in-memory caches. We address this challenge by designing novel caching techniques for storage pushdown, including splitting queries into separate in-memory and remote-storage phases and periodically refreshing the client cache with sampled accesses from the remote storage device. We demonstrate the utility of BPF-oF by integrating it with three storage systems, including RocksDB, a popular persistent key-value store that has no existing storage pushdown capability. We show BPF-oF provides significant speedups in all three systems when accessed over the network, for example improving RocksDB's throughput by up to 2.8$\times$ and tail latency by up to 2.6$\times$.
KEN: Kernel Extensions using Natural Language
Yusheng Zheng, Yiwei Yang, Maolin Chen, Andrew Quinn
Dec 12 2023 cs.AI cs.OS arXiv:2312.05531v1

@misc{2312.05531, author = {Yusheng Zheng and Yiwei Yang and Maolin Chen and Andrew Quinn}, title = {{KEN}: {K}ernel {E}xtensions using {N}atural {L}anguage}, year = {2023}, eprint = {2312.05531}, note = {arXiv:2312.05531v1} }
PDF
The ability to modify and extend an operating system is an important feature for improving a system's security, reliability, and performance. The extended Berkeley Packet Filters (eBPF) ecosystem has emerged as the standard mechanism for extending the Linux kernel and has recently been ported to Windows. eBPF programs inject new logic into the kernel that the system will execute before or after existing logic. While the eBPF ecosystem provides a flexible mechanism for kernel extension, it is difficult for developers to write eBPF programs today. An eBPF developer must have deep knowledge of the internals of the operating system to determine where to place logic and cope with programming limitations on the control flow and data accesses of their eBPF program enforced by the eBPF verifier. This paper presents KEN, an alternative framework that alleviates the difficulty of writing an eBPF program by allowing Kernel Extensions to be written in Natural language. KEN uses recent advances in large language models (LLMs) to synthesize an eBPF program given a user's English language prompt. To ensure that LLM's output is semantically equivalent to the user's prompt, KEN employs a combination of LLM-empowered program comprehension, symbolic execution, and a series of feedback loops. KEN's key novelty is the combination of these techniques. In particular, the system uses symbolic execution in a novel structure that allows it to combine the results of program synthesis and program comprehension and build on the recent success that LLMs have shown for each of these tasks individually. To evaluate KEN, we developed a new corpus of natural language prompts for eBPF programs. We show that KEN produces correct eBPF programs on 80% which is an improvement of a factor of 2.67 compared to an LLM-empowered program synthesis baseline.
SYSFLOW: Efficient Execution Platform for IoT Devices
Jun Lu, Zhenya Ma, Yinggang Gao, Ju Ren, Yaoxue Zhang
Dec 11 2023 cs.NI cs.OS arXiv:2312.04871v1

@misc{2312.04871, author = {Jun Lu and Zhenya Ma and Yinggang Gao and Ju Ren and Yaoxue Zhang}, title = {{SYSFLOW}: {E}fficient {E}xecution {P}latform for {I}o{T} {D}evices}, year = {2023}, eprint = {2312.04871}, note = {arXiv:2312.04871v1} }
PDF
Traditional executable delivery models pose challenges for IoT devices with limited storage, necessitating the download of complete executables and dependencies. Network solutions like NFS, designed for data files, encounter high IO overhead for irregular access patterns. This paper introduces SYSFLOW, a lightweight network-based executable delivery system for IoT. SYSFLOW delivers on-demand, redirecting local disk IO to the server through optimized network IO. To optimize cache hit rates, SYSFLOW employs server-side action-based prefetching, reducing latency by 45.1% to 75.8% compared to native Linux filesystems on SD cards. In wired environments, SYSFLOW's latency is up to 67.7% lower than NFS. In wireless scenarios, SYSFLOW performs 22.9% worse than Linux, comparable with Linux and outperforming NFS by up to 60.7%. While SYSFLOW's power consumption may be 6.7% higher than NFS, it offers energy savings due to lower processing time.
Lightweight Frequency-Based Tiering for CXL Memory Systems
Kevin Song, Jiacheng Yang, Sihang Liu, Gennady Pekhimenko
Dec 11 2023 cs.DC cs.OS arXiv:2312.04789v1

@misc{2312.04789, author = {Kevin Song and Jiacheng Yang and Sihang Liu and Gennady Pekhimenko}, title = {{L}ightweight {F}requency-{B}ased {T}iering for {CXL} {M}emory {S}ystems}, year = {2023}, eprint = {2312.04789}, note = {arXiv:2312.04789v1} }
PDF
Modern workloads are demanding increasingly larger memory capacity. Compute Express Link (CXL)-based memory tiering has emerged as a promising solution for addressing this trend by utilizing traditional DRAM alongside slow-tier CXL-memory devices in the same system. Unfortunately, most prior tiering systems are recency-based, which cannot accurately identify hot and cold pages, since a recently accessed page is not necessarily a hot page. On the other hand, more accurate frequency-based systems suffer from high memory and runtime overhead as a result of tracking large memories. In this paper, we propose FreqTier, a fast and accurate frequency-based tiering system for CXL memory. We observe that memory tiering systems can tolerate a small amount of tracking inaccuracy without compromising the overall application performance. Based on this observation, FreqTier probabilistically tracks the access frequency of each page, enabling accurate identification of hot and cold pages while maintaining minimal memory overhead. Finally, FreqTier intelligently adjusts the intensity of tiering operations based on the application's memory access behavior, thereby significantly reducing the amount of migration traffic and application interference. We evaluate FreqTier on two emulated CXL memory devices with different bandwidths. On the high bandwidth CXL device, FreqTier can outperform state-of-the-art tiering systems while using 4$\times$ less local DRAM memory for in-memory caching workloads. On GAP graph analytics and XGBoost workloads with 1:32 local DRAM to CXL-memory ratio, FreqTier outperforms prior works by 1.04$-$2.04$\times$ (1.39$\times$ on average). Even on the low bandwidth CXL device, FreqTier outperforms AutoNUMA by 1.14$\times$ on average.
Stop Hiding The Sharp Knives: The WebAssembly Linux Interface
Arjun Ramesh, Tianshu Huang, Ben L. Titzer, Anthony Rowe
Dec 08 2023 cs.OS cs.SE arXiv:2312.03858v1

@misc{2312.03858, author = {Arjun Ramesh and Tianshu Huang and Ben L.~Titzer and Anthony Rowe}, title = {{S}top {H}iding {T}he {S}harp {K}nives: {T}he {W}eb{A}ssembly {L}inux {I}nterface}, year = {2023}, eprint = {2312.03858}, note = {arXiv:2312.03858v1} }
PDF
WebAssembly is gaining popularity as a portable binary format targetable from many programming languages. With a well-specified low-level virtual instruction set, minimal memory footprint and many high-performance implementations, it has been successfully adopted for lightweight in-process memory sandboxing in many contexts. Despite these advantages, WebAssembly lacks many standard system interfaces, making it difficult to reuse existing applications. This paper proposes WALI: The WebAssembly Linux Interface, a thin layer over Linux's userspace system calls, creating a new class of virtualization where WebAssembly seamlessly interacts with native processes and the underlying operating system. By virtualizing the lowest level of userspace, WALI offers application portability with little effort and reuses existing compiler backends. With WebAssembly's control flow integrity guarantees, these modules gain an additional level of protection against remote code injection attacks. Furthermore, capability-based APIs can themselves be virtualized and implemented in terms of WALI, improving reuse and robustness through better layering. We present an implementation of WALI in a modern WebAssembly engine and evaluate its performance on a number of applications which we can now compile with mostly trivial effort.
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem
Yingqiang Ge, Yujie Ren, Wenyue Hua, Shuyuan Xu, Juntao Tan, Yongfeng Zhang
Dec 08 2023 cs.OS cs.AI cs.CL cs.LG arXiv:2312.03815v2

@misc{2312.03815, author = {Yingqiang Ge and Yujie Ren and Wenyue Hua and Shuyuan Xu and Juntao Tan and Yongfeng Zhang}, title = {{LLM} as {OS}, {A}gents as {A}pps: {E}nvisioning {AIOS}, {A}gents and the {AIOS}-{A}gent {E}cosystem}, year = {2023}, eprint = {2312.03815}, note = {arXiv:2312.03815v2} }
PDF
This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system "with soul". Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level). We begin by introducing the architecture of traditional OS. Then we formalize a conceptual framework for AIOS through "LLM as OS (LLMOS)", drawing analogies between AIOS and traditional OS: LLM is likened to OS kernel, context window to memory, external storage to file system, hardware tools to peripheral devices, software tools to programming libraries, and user prompts to user commands. Subsequently, we introduce the new AIOS-Agent Ecosystem, where users can easily program Agent Applications (AAPs) using natural language, democratizing the development of software, which is different from the traditional OS-APP ecosystem. Following this, we explore the diverse scope of Agent Applications. We delve into both single-agent and multi-agent systems, as well as human-agent interaction. Lastly, drawing on the insights from traditional OS-APP ecosystem, we propose a roadmap for the evolution of the AIOS-Agent ecosystem. This roadmap is designed to guide the future research and development, suggesting systematic progresses of AIOS and its Agent applications.
Robust Resource Partitioning Approach for ARINC 653 RTOS
Vitaly Cheptsov, Alexey Khoroshilov
Dec 05 2023 cs.OS arXiv:2312.01436v1

@misc{2312.01436, author = {Vitaly Cheptsov and Alexey Khoroshilov}, title = {{R}obust {R}esource {P}artitioning {A}pproach for {ARINC} 653 {RTOS}}, year = {2023}, eprint = {2312.01436}, note = {arXiv:2312.01436v1} }
PDF
Modern airborne operating systems implement the concept of robust time and resource partitioning imposed by the standards for aerospace and airborne-embedded software systems, such as ARINC 653. While these standards do provide a considerable amount of design choices in regards to resource partitioning on the architectural and API levels, such as isolated memory spaces between the application partitions, predefined resource configuration, and unidirectional ports with limited queue and message sizes for inter-partition communication, they do not specify how an operating system should implement them in software. Furthermore, they often tend to set the minimal level of the required guarantees, for example, in terms of memory permissions, and disregard the hardware state of the art, which presently can provide considerably stronger guarantees at no extra cost. In the paper we present an architecture of robust resource partitioning for ARINC 653 real-time operating systems based on completely static MMU configuration. The architecture was implemented on different types of airborne hardware, including platforms with TLB-based and page table-based MMU. Key benefits of the proposed approach include minimised run-time overhead and simpler verification of the memory subsystem.
MaxMem: Colocation and Performance for Big Data Applications on Tiered Main Memory Servers
Amanda Raybuck, Wei Zhang, Kayvan Mansoorshahi, Aditya K. Kamath, Mattan Erez, Simon Peter
Dec 04 2023 cs.OS arXiv:2312.00647v1

@misc{2312.00647, author = {Amanda Raybuck and Wei Zhang and Kayvan Mansoorshahi and Aditya K.~Kamath and Mattan Erez and Simon Peter}, title = {{M}ax{M}em: {C}olocation and {P}erformance for {B}ig {D}ata {A}pplications on {T}iered {M}ain {M}emory {S}ervers}, year = {2023}, eprint = {2312.00647}, note = {arXiv:2312.00647v1} }
PDF
We present MaxMem, a tiered main memory management system that aims to maximize Big Data application colocation and performance. MaxMem uses an application-agnostic and lightweight memory occupancy control mechanism based on fast memory miss ratios to provide application QoS under increasing colocation. By relying on memory access sampling and binning to quickly identify per-process memory heat gradients, MaxMem maximizes performance for many applications sharing tiered main memory simultaneously. MaxMem is designed as a user-space memory manager to be easily modifiable and extensible, without complex kernel code development. On a system with tiered main memory consisting of DRAM and Intel Optane persistent memory modules, our evaluation confirms that MaxMem provides 11% and 38% better throughput and up to 80% and an order of magnitude lower 99th percentile latency than HeMem and Linux AutoNUMA, respectively, with a Big Data key-value store in dynamic colocation scenarios.
Cascade: A Platform for Delay-Sensitive Edge Intelligence
Weijia Song, Thiago Garrett, Yuting Yang, Mingzhao Liu, Edward Tremel, Lorenzo Rosa, Andrea Merlina, Roman Vitenberg, Ken Birman
Nov 30 2023 cs.OS cs.AI arXiv:2311.17329v1

@misc{2311.17329, author = {Weijia Song and Thiago Garrett and Yuting Yang and Mingzhao Liu and Edward Tremel and Lorenzo Rosa and Andrea Merlina and Roman Vitenberg and Ken Birman}, title = {{C}ascade: {A} {P}latform for {D}elay-{S}ensitive {E}dge {I}ntelligence}, year = {2023}, eprint = {2311.17329}, note = {arXiv:2311.17329v1} }
PDF
Interactive intelligent computing applications are increasingly prevalent, creating a need for AI/ML platforms optimized to reduce per-event latency while maintaining high throughput and efficient resource management. Yet many intelligent applications run on AI/ML platforms that optimize for high throughput even at the cost of high tail-latency. Cascade is a new AI/ML hosting platform intended to untangle this puzzle. Innovations include a legacy-friendly storage layer that moves data with minimal copying and a "fast path" that collocates data and computation to maximize responsiveness. Our evaluation shows that Cascade reduces latency by orders of magnitude with no loss of throughput.
Trace-enabled Timing Model Synthesis for ROS2-based Autonomous Applications
Hazem Abaza, Debayan Roy, Shiqing Fan, Selma Saidi, Antonios Motakis
Nov 23 2023 cs.OS arXiv:2311.13333v2

@misc{2311.13333, author = {Hazem Abaza and Debayan Roy and Shiqing Fan and Selma Saidi and Antonios Motakis}, title = {{T}race-enabled {T}iming {M}odel {S}ynthesis for {ROS}2-based {A}utonomous {A}pplications}, year = {2023}, eprint = {2311.13333}, note = {arXiv:2311.13333v2} }
PDF
Autonomous applications are typically developed over Robot Operating System 2.0 (ROS2) even in time-critical systems like automotive. Recent years have seen increased interest in developing model-based timing analysis and schedule optimization approaches for ROS2-based applications. To complement these approaches, we propose a tracing and measurement framework to obtain timing models of ROS2-based applications. It offers a tracer based on extended Berkeley Packet Filter (eBPF) that probes different functions in ROS2 middleware and reads their arguments or return values to reason about the data flow in applications. It combines event traces from ROS2 and the operating system to generate a directed acyclic graph showing ROS2 callbacks, precedence relations between them, and their timing attributes. While being compatible with existing analyses, we also show how to model (i)~message synchronization, e.g., in sensor fusion, and (ii)~service requests from multiple clients, e.g., in motion planning. Considering that, in real-world scenarios, the application code might be confidential and formal models are unavailable, our framework still enables the application of existing analysis and optimization techniques.