Performance (cs.PF)

QOPS: A Compiler Framework for Quantum Circuit Simulation Acceleration with Profile Guided Optimizations
Yu-Tsung Wu, Po-Hsuan Huang, Kai-Chieh Chang, Chia-Heng Tu, Shih-Hao Hung
Oct 15 2024 quant-ph cs.PF cs.SE arXiv:2410.09326v2

@misc{2410.09326, author = {Yu-Tsung Wu and Po-Hsuan Huang and Kai-Chieh Chang and Chia-Heng Tu and Shih-Hao Hung}, title = {{QOPS}: {A} {C}ompiler {F}ramework for {Q}uantum {C}ircuit {S}imulation {A}cceleration with {P}rofile {G}uided {O}ptimizations}, year = {2024}, eprint = {2410.09326}, note = {arXiv:2410.09326v2} }
PDF
Quantum circuit simulation is important in the evolution of quantum software and hardware. Novel algorithms can be developed and evaluated by performing quantum circuit simulations on classical computers before physical quantum computers are available. Unfortunately, compared with a physical quantum computer, a prolonged simulation time hampers the rapid development of quantum algorithms. Inspired by the feedback-directed optimization scheme used by classical compilers to improve the generated code, this work proposes a quantum compiler framework QOPS to enable profile-guided optimization (PGO) for quantum circuit simulation acceleration. The QOPS compiler instruments a quantum simulator to collect performance data during the circuit simulation and it then generates the optimized version of the quantum circuit based on the collected data. Experimental results show the PGO can effectively shorten the simulation time on our tested benchmark programs. Especially, the simulator-specific PGO (virtual swap) can be applied to the benchmarks to accelerate the simulation speed by a factor of 1.19. As for the hardware-independent PGO, compared with the brute force mechanism (turning on all available compilation flags), which achieves 21% performance improvement against the non-optimized version, the PGO can achieve 16% speedup with a factor of 63 less compilation time than the brute force approach.
Results of the Big ANN: NeurIPS'23 competition
Harsha Vardhan Simhadri, Martin Aumüller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum, Mazin Karjikar, Laxman Dhulipala, Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng, et al (3)
Sep 27 2024 cs.IR cs.DS cs.LG cs.PF arXiv:2409.17424v1

@misc{2409.17424, author = {Harsha Vardhan Simhadri and Martin Aumüller and Amir Ingber and Matthijs Douze and George Williams and Magdalen Dobson Manohar and Dmitry Baranchuk and Edo Liberty and Frank Liu and Ben Landrum and Mazin Karjikar and Laxman Dhulipala and Meng Chen and Yue Chen and Rui Ma and Kai Zhang and Yuzheng Cai and Jiayang Shi and Yizhuo Chen and Weiguo Zheng and Zihao Wan and Jie Yin and Ben Huang}, title = {{R}esults of the {B}ig {ANN}: {N}eur{IPS}'23 competition}, year = {2024}, eprint = {2409.17424}, note = {arXiv:2409.17424v1} }
PDF
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\citeDBLP:conf/nips/SimhadriWADBBCH21, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.
Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security
Nathan Tallent, Jan Strube, Luanzheng Guo, Hyungro Lee, Jesun Firoz, Sayan Ghosh, Bo Fang, Oceane Bel, Steven Spurgeon, Sarah Akers, Christina Doty, Erol Cromwell
Oct 22 2024 cs.DC cs.CV cs.PF cs.SY eess.SY arXiv:2410.16093v1

@misc{2410.16093, author = {Nathan Tallent and Jan Strube and Luanzheng Guo and Hyungro Lee and Jesun Firoz and Sayan Ghosh and Bo Fang and Oceane Bel and Steven Spurgeon and Sarah Akers and Christina Doty and Erol Cromwell}, title = {{F}inal {R}eport for {CHESS}: {C}loud, {H}igh-{P}erformance {C}omputing, and {E}dge for {S}cience and {S}ecurity}, year = {2024}, eprint = {2410.16093}, note = {arXiv:2410.16093v1} }
PDF
Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, computing resources at multiple facilities, data sets distributed across multiple information sources, and potentially cloud. Unfortunately, the obvious methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets over time fail to achieve scientific requirements for performance, energy, security, and reliability. Furthermore, achieving the best use of continuum resources depends upon the efficient composition and execution of workflow tasks, i.e., combinations of numerical solvers, data analytics, and machine learning. Pacific Northwest National Laboratory's LDRD "Cloud, High-Performance Computing (HPC), and Edge for Science and Security" (CHESS) has developed a set of interrelated capabilities for enabling distributed scientific workflows and curating datasets. This report describes the results and successes of CHESS from the perspective of open science.
ADS Performance Revisited
Alexander Weber, Holger Eichelberger, Jobst Hildebrand
Oct 22 2024 cs.PF arXiv:2410.15853v1

@misc{2410.15853, author = {Alexander Weber and Holger Eichelberger and Jobst Hildebrand}, title = {{ADS} {P}erformance {R}evisited}, year = {2024}, eprint = {2410.15853}, note = {arXiv:2410.15853v1} }
PDF
Real-time measurements are important for in-depth control of manufacturing processes, which, for modern AI methods, need integration with high-level languages. In our last SSP paper we investigated the performance of a Python and a Java-JNA based approach to integrate the Beckhoff ADS protocol for real-time edge communication into an Industry 4.0 platform. There, we have shown that while Java outperforms Python, both solutions do not meet the desired goal of 1-20kHz depending on the task. However, we are are still lacking an explanation for this result as well as an analysis of alternatives. For the first topic, we show in this paper that 1) exchanging Java-JNA with Java-JNI in this setting does not further improve the performance 2) a C++ program realizing the same behavior in a more direct integration does not perform better and 3) profiling shows that the majority of the execution is spend in ADS. For the second topic, we show that alternative uses of the ADS library allow for better performance.
Industry 4.0 Connectors -- A Performance Experiment with Modbus/TCP
Christian Nikolajew, Holger Eichelberger
Oct 22 2024 cs.PF arXiv:2410.15813v1

@misc{2410.15813, author = {Christian Nikolajew and Holger Eichelberger}, title = {{I}ndustry 4.0 {C}onnectors -- {A} {P}erformance {E}xperiment with {M}odbus/{TCP}}, year = {2024}, eprint = {2410.15813}, note = {arXiv:2410.15813v1} }
PDF
For Industry 4.0 applications, communication protocols and data formats even for legacy devices are fundamental. In this paper, we focus on the Modbus/TCP protocol, which is, e.g., used in energy metering. Allowing Industry 4.0 applications to include data from such protocols without need for programming would increase flexibility and, in turn, improve development efficiency. As one particular approach, we discuss the automated generation of Modbus/TCP connectors for our Open Source oktoflow platform and compare the performance of handcrafted as well as generated connectors in different settings, including industrial energy metering devices.
DRIM-ANN: An Approximate Nearest Neighbor Search Engine based on Commercial DRAM-PIMs
Mingkai Chen, Tianhua Han, Cheng Liu, Shengwen Liang, Kuai Yu, Lei Dai, Ziming Yuan, Ying Wang, Lei Zhang, Huawei Li, Xiaowei Li
Oct 22 2024 cs.PF arXiv:2410.15621v1

@misc{2410.15621, author = {Mingkai Chen and Tianhua Han and Cheng Liu and Shengwen Liang and Kuai Yu and Lei Dai and Ziming Yuan and Ying Wang and Lei Zhang and Huawei Li and Xiaowei Li}, title = {{DRIM}-{ANN}: {A}n {A}pproximate {N}earest {N}eighbor {S}earch {E}ngine based on {C}ommercial {DRAM}-{PIM}s}, year = {2024}, eprint = {2410.15621}, note = {arXiv:2410.15621v1} }
PDF
Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dimensional data. This leads to I/O bottlenecks on CPUs and memory limitations on GPUs. DRAM-based Processing-in-Memory (DRAM-PIM) architecture, which offers high bandwidth, large-capacity memory, and the ability to perform efficient computation in or near the data, presents a promising solution for ANNS. In this work, we investigate the use of commercial DRAM-PIM for ANNS for the first time and propose DRIM-ANN, an optimized ANNS engine based on DRAM-PIMs from UPMEM. Notably, given that the target DRAM-PIM exhibits an even lower compute-to-I/O ratio than basic ANNS, we leverage lookup tables (LUTs) to replace more multiplications with I/O operations. We then systematically tune ANNS to search optimized configurations with lower computational load, aligning the compute-to-I/O ratio of ANNS with that of DRAM-PIMs while maintaining accuracy constraints. Building on this tuned ANNS algorithm, we further explore implementation optimizations to fully utilize the two thousand parallel processing units with private local memory in DRAM-PIMs. To address the load imbalance caused by ANNS requests distributed across different clusters of large datasets, we propose a load-balancing strategy that combines static data layout optimization with dynamic runtime request scheduling. Experimental results on representative datasets show that DRIM-ANN achieves an average performance speedup of 2.92x compared to a 32-thread CPU counterpart.
Real-time Event Joining in Practice With Kafka and Flink
Srijan Saket, Vivek Chandela, Md. Danish Kalim
Oct 22 2024 cs.SE cs.DB cs.PF arXiv:2410.15533v1

@misc{2410.15533, author = {Srijan Saket and Vivek Chandela and Md.~Danish Kalim}, title = {{R}eal-time {E}vent {J}oining in {P}ractice {W}ith {K}afka and {F}link}, year = {2024}, eprint = {2410.15533}, note = {arXiv:2410.15533v1} }
PDF
Historically, machine learning training pipelines have predominantly relied on batch training models, retraining models every few hours. However, industrial practitioners have proved that real-time training can lead to a more adaptive and personalized user experience. The transition from batch to real-time is full of tradeoffs to get the benefits of accuracy and freshness while keeping the costs low and having a predictable, maintainable system. Our work characterizes migrating to a streaming pipeline for a machine learning model using Apache Kafka and Flink. We demonstrate how to transition from Google Pub/Sub to Kafka to handle incoming real-time events and leverage Flink for streaming joins using RocksDB and checkpointing. We also address challenges such as managing causal dependencies between events, balancing event time versus processing time, and ensuring exactly-once versus at-least-once delivery guarantees, among other issues. Furthermore, we showcase how we improved scalability by using topic partitioning in Kafka, reduced event throughput by \textbf85\% through the use of Avro schema and compression, decreased costs by \textbf40\%, and implemented a separate pipeline to ensure data correctness. Our findings provide valuable insights into the tradeoffs and complexities of real-time systems, enabling better-informed decisions tailored to specific requirements for building effective streaming systems that enhance user satisfaction.
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Junhao Hu, Wenrui Huang, Haoyi Wang, Weidong Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
Oct 22 2024 cs.LG cs.CL cs.DC cs.PF arXiv:2410.15332v1

@misc{2410.15332, author = {Junhao Hu and Wenrui Huang and Haoyi Wang and Weidong Wang and Tiancheng Hu and Qin Zhang and Hao Feng and Xusheng Chen and Yizhou Shan and Tao Xie}, title = {{EPIC}: {E}fficient {P}osition-{I}ndependent {C}ontext {C}aching for {S}erving {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2410.15332}, note = {arXiv:2410.15332v1} }
PDF
Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.
The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware Scheduling
Noman Bashir, Varun Gohil, Anagha Belavadi, Mohammad Shahrad, David Irwin, Elsa Olivetti, Christina Delimitrou
Oct 22 2024 cs.DC cs.CY cs.ET cs.PF arXiv:2410.15087v1

@misc{2410.15087, author = {Noman Bashir and Varun Gohil and Anagha Belavadi and Mohammad Shahrad and David Irwin and Elsa Olivetti and Christina Delimitrou}, title = {{T}he {S}unk {C}arbon {F}allacy: {R}ethinking {C}arbon {F}ootprint {M}etrics for {E}ffective {C}arbon-{A}ware {S}cheduling}, year = {2024}, eprint = {2410.15087}, doi = {10.1145/3698038.3698542}, note = {arXiv:2410.15087v1} }
PDF
The rapid increase in computing demand and its corresponding energy consumption have focused attention on computing's impact on the climate and sustainability. Prior work proposes metrics that quantify computing's carbon footprint across several lifecycle phases, including its supply chain, operation, and end-of-life. Industry uses these metrics to optimize the carbon footprint of manufacturing hardware and running computing applications. Unfortunately, prior work on optimizing datacenters' carbon footprint often succumbs to the \emphsunk cost fallacy by considering embodied carbon emissions (a sunk cost) when making operational decisions (i.e., job scheduling and placement), which leads to operational decisions that do not always reduce the total carbon footprint. In this paper, we evaluate carbon-aware job scheduling and placement on a given set of servers for a number of carbon accounting metrics. Our analysis reveals state-of-the-art carbon accounting metrics that include embodied carbon emissions when making operational decisions can actually increase the total carbon footprint of executing a set of jobs. We study the factors that affect the added carbon cost of such suboptimal decision-making. We then use a real-world case study from a datacenter to demonstrate how the sunk carbon fallacy manifests itself in practice. Finally, we discuss the implications of our findings in better guiding effective carbon-aware scheduling in on-premise and cloud datacenters.
Towards Safer Heuristics With XPlain
Pantea Karimi, Solal Pirelli, Siva Kesava Reddy Kakarla, Ryan Beckett, Santiago Segarra, Beibin Li, Pooria Namyar, Behnaz Arzani
Oct 22 2024 cs.AI cs.CL cs.DC cs.NI cs.PF arXiv:2410.15086v1

@misc{2410.15086, author = {Pantea Karimi and Solal Pirelli and Siva Kesava Reddy Kakarla and Ryan Beckett and Santiago Segarra and Beibin Li and Pooria Namyar and Behnaz Arzani}, title = {{T}owards {S}afer {H}euristics {W}ith {XP}lain}, year = {2024}, eprint = {2410.15086}, note = {arXiv:2410.15086v1} }
PDF
Many problems that cloud operators solve are computationally expensive, and operators often use heuristic algorithms (that are faster and scale better than optimal) to solve them more efficiently. Heuristic analyzers enable operators to find when and by how much their heuristics underperform. However, these tools do not provide enough detail for operators to mitigate the heuristic's impact in practice: they only discover a single input instance that causes the heuristic to underperform (and not the full set), and they do not explain why. We propose XPlain, a tool that extends these analyzers and helps operators understand when and why their heuristics underperform. We present promising initial results that show such an extension is viable.
Approximating Spanning Centrality with Random Bouquets
Gökhan Göktürk, Kamer Kaya
Oct 21 2024 cs.SI cs.DC cs.PF arXiv:2410.14056v1

@misc{2410.14056, author = {Gökhan Göktürk and Kamer Kaya}, title = {{A}pproximating {S}panning {C}entrality with {R}andom {B}ouquets}, year = {2024}, eprint = {2410.14056}, note = {arXiv:2410.14056v1} }
PDF
Spanning Centrality is a measure used in network analysis to determine the importance of an edge in a graph based on its contribution to the connectivity of the entire network. Specifically, it quantifies how critical an edge is in terms of the number of spanning trees that include that edge. The current state-of-the-art for All Edges Spanning Centrality~(AESC), which computes the exact centrality values for all the edges, has a time complexity of $\mathcal{O}(mn^{3/2})$ for $n$ vertices and $m$ edges. This makes the computation infeasible even for moderately sized graphs. Instead, there exist approximation algorithms which process a large number of random walks to estimate edge centralities. However, even the approximation algorithms can be computationally overwhelming, especially if the approximation error bound is small. In this work, we propose a novel, hash-based sampling method and a vectorized algorithm which greatly improves the execution time by clustering random walks into \it Bouquets. On synthetic random walk benchmarks, \it Bouquets performs $7.8\times$ faster compared to naive, traditional random-walk generation. We also show that the proposed technique is scalable by employing it within a state-of-the-art AESC approximation algorithm, \sc TGT+. The experiments show that using Bouquets yields more than $100\times$ speed-up via parallelization with 16 threads.
DiFuseR: A Distributed Sketch-based Influence Maximization Algorithm for GPUs
Gökhan Göktürk, Kamer Kaya
Oct 21 2024 cs.DC cs.PF cs.SI arXiv:2410.14047v1

@misc{2410.14047, author = {Gökhan Göktürk and Kamer Kaya}, title = {{D}i{F}use{R}: {A} {D}istributed {S}ketch-based {I}nfluence {M}aximization {A}lgorithm for {GPU}s}, year = {2024}, eprint = {2410.14047}, howpublished = {J Supercomput 81, 21 (2025).}, note = {arXiv:2410.14047v1} }
PDF
Influence Maximization (IM) aims to find a given number of "seed" vertices that can effectively maximize the expected spread under a given diffusion model. Due to the NP-Hardness of finding an optimal seed set, approximation algorithms are often used for IM. However, these algorithms require a large number of simulations to find good seed sets. In this work, we propose DiFuseR, a blazing-fast, high-quality IM algorithm that can run on multiple GPUs in a distributed setting. DiFuseR is designed to increase GPU utilization, reduce inter-node communication, and minimize overlapping data/computation among the nodes. Based on the experiments with various graphs, containing some of the largest networks available, and diffusion settings, the proposed approach is found to be 3.2x and 12x faster on average on a single GPU and 8 GPUs, respectively. It can achieve up to 8x and 233.7x speedup on the same hardware settings. Furthermore, thanks to its smart load-balancing mechanism, on 8 GPUs, it is on average 5.6x faster compared to its single-GPU performance.
A Zoned Storage Optimized Flash Cache on ZNS SSDs
Chongzhuo Yang, Chang Guo, Ming Zhao, Zhichao Cao
Oct 16 2024 cs.PF arXiv:2410.11260v1

@misc{2410.11260, author = {Chongzhuo Yang and Chang Guo and Ming Zhao and Zhichao Cao}, title = {{A} {Z}oned {S}torage {O}ptimized {F}lash {C}ache on {ZNS} {SSD}s}, year = {2024}, eprint = {2410.11260}, note = {arXiv:2410.11260v1} }
PDF
Zoned Namespace SSDs (ZNS) are introduced recently to mitigate the block interface penalties of flash-based SSDs. It is a good opportunity for flash cache to address cache throughput and write amplification (WA) issues by fully controlling data allocation and garbage collection via zone-based interfaces. However, there are several critical challenges that need to be addressed including zone-interface compatibility, data management of large zone size, and a better tradeoff between throughput, cache hit ratio, and WA. In this paper, we present Z-CacheLib, a zoned storage optimized flash cache on ZNS SSDs. In Z-CacheLib, we propose: 1) a new zStorage Engine for ZNS SSDs with low mapping and operational overhead, and 2) a novel zCache Engine with cross-layer optimizations to resolve the throughput regression and WA issues of garbage collection, which consists of delayed data eviction with virtual over-provisioning (vOP), a top-down eviction policy (zLRU) optimized from LRU, and a bottom-up drop mechanism (zDrop) for low WA. Our evaluation shows that Z-CacheLib can achieve up to 2X throughput, 5% improvement hit ratio, and almost no WA compared to CacheLib with compatible regular SSDs, demonstrating benefits of using ZNS SSDs for cache. Moreover, Z-CacheLib can achieve up to 6X throughput and 92% WA reduction compared with F2FS-based scheme.
On Efficient Topology Management in Service-Oriented 6G Networks: An Edge Video Distribution Case Study
Zied Ennaceur, Mounir Bensalem, Admela Jukan, Claus Keuker, Huanzhuo Wu, Rastin Pries
Oct 15 2024 cs.NI cs.PF arXiv:2410.10338v1

@misc{2410.10338, author = {Zied Ennaceur and Mounir Bensalem and Admela Jukan and Claus Keuker and Huanzhuo Wu and Rastin Pries}, title = {{O}n {E}fficient {T}opology {M}anagement in {S}ervice-{O}riented 6{G} {N}etworks: {A}n {E}dge {V}ideo {D}istribution {C}ase {S}tudy}, year = {2024}, eprint = {2410.10338}, note = {arXiv:2410.10338v1} }
PDF
An efficient topology management in future 6G networks is one of the fundamental challenges for a dynamic network creation based on location services, whereby each autonomous network entity, i.e., a sub-network, can be created for a specific application scenario. In this paper, we study the performance of a novel topology changes management system in a sample 6G network being dynamically organized in autonomous sub-networks. We propose and analyze an algorithm for intelligent prediction of topology changes and provide a comparative analysis with topology monitoring based approach. To this end, we present an industrially relevant case study on edge video distribution, as it is envisioned to be implemented in line with the 3GPP and ETSI MEC (Multi-access Edge Computing) standards. For changes prediction, we implement and analyze a novel topology change prediction algorithm, which can automatically optimize, train and, finally, select the best of different machine learning models available, based on the specific scenario under study. For link change scenario, the results show that three selected ML models exhibit high accuracy in detecting changes in link delay and bandwidth using measured throughput and RTT. ANN demonstrates the best performance in identifying cases with no changes, slightly outperforming random forest and XGBoost. For user mobility scenario, XGBoost is more efficient in learning patterns for topology change prediction while delivering much faster results compared to the more computationally demanding deep learning models, such as LSTM and CNN. In terms of cost efficiency, our ML-based approach represents a significantly cost-effective alternative to traditional monitoring approaches.
SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs
Mohammad Mozaffari, Maryam Mehri Dehnavi
Oct 15 2024 cs.LG cs.AI cs.PF arXiv:2410.09615v1

@misc{2410.09615, author = {Mohammad Mozaffari and Maryam Mehri Dehnavi}, title = {{SL}i{M}: {O}ne-shot {Q}uantized {S}parse {P}lus {L}ow-rank {A}pproximation of {LLM}s}, year = {2024}, eprint = {2410.09615}, note = {arXiv:2410.09615v1} }
PDF
Large Language Models (LLMs) have revolutionized natural language understanding and generation tasks but suffer from high memory consumption and slow inference times due to their large parameter sizes. Traditional model compression techniques, such as quantization and pruning, mitigate these issues but often require retraining to maintain accuracy, which is computationally expensive. This paper introduces SLiM, a novel approach for compressing LLMs using a one-shot Quantized Sparse Plus Low-rank Approximation. SLiM eliminates the need for costly retraining by combining a symmetric quantization method (SLiM-Quant) with a saliency-based low-rank approximation. Our method reduces quantization error while leveraging sparse representations compatible with accelerated hardware architectures. Additionally, we propose a parameter-efficient fine-tuning recipe that significantly reduces overhead compared to conventional quantization-aware training. SLiM achieves up to a 5.4% improvement in model accuracy for sparsity patterns like 2:4, and the fine-tuning step further enhances accuracy by up to 5.8%, demonstrating state-of-the-art performance. This work provides a pathway for efficiently deploying large models in memory-constrained environments without compromising accuracy.
Testing the Unknown: A Framework for OpenMP Testing via Random Program Generation
Ignacio Laguna, Patrick Chapman, Konstantinos Parasyris, Giorgis Georgakoudis, Cindy Rubio-González
Oct 15 2024 cs.SE cs.PF cs.PL arXiv:2410.09191v1

@misc{2410.09191, author = {Ignacio Laguna and Patrick Chapman and Konstantinos Parasyris and Giorgis Georgakoudis and Cindy Rubio-González}, title = {{T}esting the {U}nknown: {A} {F}ramework for {O}pen{MP} {T}esting via {R}andom {P}rogram {G}eneration}, year = {2024}, eprint = {2410.09191}, note = {arXiv:2410.09191v1} }
PDF
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible random OpenMP tests using a grammar and implement our method as an extension of the Varity program generator. By generating 1,800 OpenMP tests, we find various performance anomalies and correctness issues when we apply it to three OpenMP implementations: GCC, Clang, and Intel. We also present several case studies that analyze the anomalies and give more details about the classes of tests that our approach creates.
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Andrey Anurin, Jonathan Ng, Kibo Schaffer, Ziyue Wang, Jason Schreiber, Esben Kran
Oct 15 2024 cs.CR cs.AI cs.LG cs.PF arXiv:2410.09114v1

@misc{2410.09114, author = {Andrey Anurin and Jonathan Ng and Kibo Schaffer and Ziyue Wang and Jason Schreiber and Esben Kran}, title = {{C}atastrophic {C}yber {C}apabilities {B}enchmark (3{CB}): {R}obustly {E}valuating {LLM} {A}gent {C}yber {O}ffense {C}apabilities}, year = {2024}, eprint = {2410.09114}, note = {arXiv:2410.09114v1} }
PDF
LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.
Unlocking FedNL: Self-Contained Compute-Optimized Implementation
Konstantin Burlachenko, Peter Richtárik
Oct 14 2024 cs.LG cs.AI cs.MS cs.PF math.OC arXiv:2410.08760v1

@misc{2410.08760, author = {Konstantin Burlachenko and Peter Richtárik}, title = {{U}nlocking {F}ed{NL}: {S}elf-{C}ontained {C}ompute-{O}ptimized {I}mplementation}, year = {2024}, eprint = {2410.08760}, note = {arXiv:2410.08760v1} }
PDF
Federated Learning (FL) is an emerging paradigm that enables intelligent agents to collaboratively train Machine Learning (ML) models in a distributed manner, eliminating the need for sharing their local data. The recent work (arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL) algorithms, marking a significant step towards applying second-order methods to FL and large-scale optimization. However, the reference FedNL prototype exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch a single experiment in a sever-grade workstation; (ii) The prototype only simulates multi-node setting; (iii) Prototype integration into resource-constrained applications is challenging. To bridge the gap between theory and practice, we present a self-contained implementation of FedNL, FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves the aforementioned issues and reduces the wall clock time by x1000. With this FedNL outperforms alternatives for training logistic regression in a single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark (arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose two practical-orientated compressors for FedNL - adaptive TopLEK and cache-aware RandSeqK, which fulfill the theory of FedNL.
AsyncFS: Metadata Updates Made Asynchronous for Distributed Filesystems with In-Network Coordination
Jingwei Xu, Mingkai Dong, Qiulin Tian, Ziyi Tian, Tong Xin, Haibo Chen
Oct 14 2024 cs.DC cs.OS cs.PF arXiv:2410.08618v1

@misc{2410.08618, author = {Jingwei Xu and Mingkai Dong and Qiulin Tian and Ziyi Tian and Tong Xin and Haibo Chen}, title = {{A}sync{FS}: {M}etadata {U}pdates {M}ade {A}synchronous for {D}istributed {F}ilesystems with {I}n-{N}etwork {C}oordination}, year = {2024}, eprint = {2410.08618}, note = {arXiv:2410.08618v1} }
PDF
Distributed filesystems typically employ synchronous metadata updates, facing inherent challenges for access efficiency, load balancing, and directory contention, especially under dynamic and skewed workloads. This paper argues that synchronous updates are overly conservative for distributed filesystems. We propose AsyncFS with asynchronous metadata updates, allowing operations to return early and defer directory updates until respective read to enable latency hiding and conflict resolution. The key challenge is efficiently maintaining the synchronous semantics of metadata updates. To address this, AsyncFS is co-designed with a programmable switch, leveraging the constrained on-switch resources to holistically track directory states in the network with negligible cost. This allows AsyncFS to timely aggregate and efficiently apply delayed updates using batching and consolidation before directory reads. Evaluation shows that AsyncFS achieves up to 13.34$\times$ and 3.85$\times$ higher throughput, and 61.6% and 57.3% lower latency than two state-of-the-art distributed filesystems, InfiniFS and CFS-KV, respectively, on skewed workloads. For real-world workloads, AsyncFS improves end-to-end throughput by 21.1$\times$, 1.1$\times$ and 30.1% over Ceph, IndexFS and CFS-KV, respectively.
Neural Architecture Search of Hybrid Models for NPU-CIM Heterogeneous AR/VR Devices
Yiwei Zhao, Ziyun Li, Win-San Khwa, Xiaoyu Sun, Sai Qian Zhang, Syed Shakib Sarwar, Kleber Hugo Stangherlin, Yi-Lun Lu, Jorge Tomas Gomez, Jae-Sun Seo, Phillip B. Gibbons, Barbara De Salvo, Chiao Liu
Oct 14 2024 cs.CV cs.AR cs.LG cs.PF arXiv:2410.08326v1

@misc{2410.08326, author = {Yiwei Zhao and Ziyun Li and Win-San Khwa and Xiaoyu Sun and Sai Qian Zhang and Syed Shakib Sarwar and Kleber Hugo Stangherlin and Yi-Lun Lu and Jorge Tomas Gomez and Jae-Sun Seo and Phillip B.~Gibbons and Barbara De Salvo and Chiao Liu}, title = {{N}eural {A}rchitecture {S}earch of {H}ybrid {M}odels for {NPU}-{CIM} {H}eterogeneous {AR}/{VR} {D}evices}, year = {2024}, eprint = {2410.08326}, note = {arXiv:2410.08326v1} }
PDF
Low-Latency and Low-Power Edge AI is essential for Virtual Reality and Augmented Reality applications. Recent advances show that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can pose system challenges for latency and energy-efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage the architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and perform diverse execution schemas to efficiently execute these hybrid models. We also introduce H4H-NAS, a Neural Architecture Search framework to design efficient hybrid CNN/ViT models for heterogeneous edge systems with both NPU and CIM. Our H4H-NAS approach is powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN/ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet dataset. Moreover, results from our Algo/HW co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing such heterogeneous computing over baseline solutions. The framework guides the design of hybrid network architectures and system architectures of NPU+CIM heterogeneous systems.
Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data
Can Wang, Dianbo Sui, Hongliang Sun, Hao Ding, Bolin Zhang, Zhiying Tu
Oct 11 2024 cs.PF cs.LG arXiv:2410.07737v1

@misc{2410.07737, author = {Can Wang and Dianbo Sui and Hongliang Sun and Hao Ding and Bolin Zhang and Zhiying Tu}, title = {{P}lug-and-{P}lay {P}erformance {E}stimation for {LLM} {S}ervices without {R}elying on {L}abeled {D}ata}, year = {2024}, eprint = {2410.07737}, note = {arXiv:2410.07737v1} }
PDF
Large Language Model (LLM) services exhibit impressive capability on unlearned tasks leveraging only a few examples by in-context learning (ICL). However, the success of ICL varies depending on the task and context, leading to heterogeneous service quality. Directly estimating the performance of LLM services at each invocation can be laborious, especially requiring abundant labeled data or internal information within the LLM. This paper introduces a novel method to estimate the performance of LLM services across different tasks and contexts, which can be "plug-and-play" utilizing only a few unlabeled samples like ICL. Our findings suggest that the negative log-likelihood and perplexity derived from LLM service invocation can function as effective and significant features. Based on these features, we utilize four distinct meta-models to estimate the performance of LLM services. Our proposed method is compared against unlabeled estimation baselines across multiple LLM services and tasks. And it is experimentally applied to two scenarios, demonstrating its effectiveness in the selection and further optimization of LLM services.
An Analysis of XML Compression Efficiency
Christopher James Augeri, Barry E. Mullins, Leemon C. Baird III, Dursun A. Bulutoglu, Rusty O. Baldwin
Oct 11 2024 cs.DB cs.IT cs.PF math.IT arXiv:2410.07603v1

@misc{2410.07603, author = {Christopher James Augeri and Barry E.~Mullins and Leemon C.~Baird III and Dursun A.~Bulutoglu and Rusty O.~Baldwin}, title = {{A}n {A}nalysis of {XML} {C}ompression {E}fficiency}, year = {2024}, eprint = {2410.07603}, howpublished = {Proceedings of the 2007 workshop on Experimental Computer Science (ExpCS) at ACM FCRC 2007}, doi = {10.1145/1281700.1281707}, note = {arXiv:2410.07603v1} }
PDF
XML simplifies data exchange among heterogeneous computers, but it is notoriously verbose and has spawned the development of many XML-specific compressors and binary formats. We present an XML test corpus and a combined efficiency metric integrating compression ratio and execution speed. We use this corpus and linear regression to assess 14 general-purpose and XML-specific compressors relative to the proposed metric. We also identify key factors when selecting a compressor. Our results show XMill or WBXML may be useful in some instances, but a general-purpose compressor is often the best choice.
Serverless Cold Starts and Where to Find Them
Artjom Joosen, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Luke Darlow, Jianfeng Wang, Qiwen Deng, Adam Barker
Oct 10 2024 cs.DC cs.OS cs.PF arXiv:2410.06145v1

@misc{2410.06145, author = {Artjom Joosen and Ahmed Hassan and Martin Asenov and Rajkarn Singh and Luke Darlow and Jianfeng Wang and Qiwen Deng and Adam Barker}, title = {{S}erverless {C}old {S}tarts and {W}here to {F}ind {T}hem}, year = {2024}, eprint = {2410.06145}, note = {arXiv:2410.06145v1} }
PDF
This paper releases and analyzes a month-long trace of 85 billion user requests and 11.9 million cold starts from Huawei's serverless cloud platform. Our analysis spans workloads from five data centers. We focus on cold starts and provide a comprehensive examination of the underlying factors influencing the number and duration of cold starts. These factors include trigger types, request synchronicity, runtime languages, and function resource allocations. We investigate components of cold starts, including pod allocation time, code and dependency deployment time, and scheduling delays, and examine their relationships with runtime languages, trigger types, and resource allocation. We introduce pod utility ratio to measure the pod's useful lifetime relative to its cold start time, giving a more complete picture of cold starts, and see that some pods with long cold start times have longer useful lifetimes. Our findings reveal the complexity and multifaceted origins of the number, duration, and characteristics of cold starts, driven by differences in trigger types, runtime languages, and function resource allocations. For example, cold starts in Region 1 take up to 7 seconds, dominated by dependency deployment time and scheduling. In Region 2, cold starts take up to 3 seconds and are dominated by pod allocation time. Based on this, we identify opportunities to reduce the number and duration of cold starts using strategies for multi-region scheduling. Finally, we suggest directions for future research to address these challenges and enhance the performance of serverless cloud platforms. Our datasets and code are available here https://github.com/sir-lab/data-release
It's Not Easy Being Green: On the Energy Efficiency of Programming Languages
Nicolas van Kempen, Hyuk-Je Kwon, Dung Tuan Nguyen, Emery D. Berger
Oct 10 2024 cs.PL cs.PF arXiv:2410.05460v1

@misc{2410.05460, author = {Nicolas van Kempen and Hyuk-Je Kwon and Dung Tuan Nguyen and Emery D.~Berger}, title = {{I}t's {N}ot {E}asy {B}eing {G}reen: {O}n the {E}nergy {E}fficiency of {P}rogramming {L}anguages}, year = {2024}, eprint = {2410.05460}, note = {arXiv:2410.05460v1} }
PDF
Does the choice of programming language affect energy consumption? Previous highly visible studies have established associations between certain programming languages and energy consumption. A causal misinterpretation of this work has led academics and industry leaders to use or support certain languages based on their claimed impact on energy consumption. This paper tackles this causal question directly. It first corrects and improves the measurement methodology used by prior work. It then develops a detailed causal model capturing the complex relationship between programming language choice and energy consumption. This model identifies and incorporates several critical but previously overlooked factors that affect energy usage. These factors, such as distinguishing programming languages from their implementations, the impact of the application implementations themselves, the number of active cores, and memory activity, can significantly skew energy consumption measurements if not accounted for. We show -- via empirical experiments, improved methodology, and careful examination of anomalies -- that when these factors are controlled for, notable discrepancies in prior work vanish. Our analysis suggests that the choice of programming language implementation has no significant impact on energy consumption beyond execution time.
AraSync: Precision Time Synchronization in Rural Wireless Living Lab
Md Nadim, Taimoor Ul Islam, Salil Reddy, Tianyi Zhang, Zhibo Meng, Reshal Afzal, Sarath Babu, Arsalan Ahmad, Daji Qiao, Anish Arora, Hongwei Zhang
Oct 07 2024 cs.NI cs.PF arXiv:2410.03583v1

@misc{2410.03583, author = {Md Nadim and Taimoor Ul Islam and Salil Reddy and Tianyi Zhang and Zhibo Meng and Reshal Afzal and Sarath Babu and Arsalan Ahmad and Daji Qiao and Anish Arora and Hongwei Zhang}, title = {{A}ra{S}ync: {P}recision {T}ime {S}ynchronization in {R}ural {W}ireless {L}iving {L}ab}, year = {2024}, eprint = {2410.03583}, doi = {10.1145/3636534.3697318}, note = {arXiv:2410.03583v1} }
PDF
Time synchronization is a critical component in network operation and management, and it is also required by Ultra-Reliable, Low-Latency Communications (URLLC) in next-generation wireless systems such as those of 5G, 6G, and Open RAN. In this context, we design and implement AraSync as an end-to-end time synchronization system in the ARA wireless living lab to enable advanced wireless experiments and applications involving stringent time constraints. We make use of Precision Time Protocol (PTP) at different levels to achieve synchronization accuracy in the order of nanoseconds. Along with fiber networks, AraSync enables time synchronization across the AraHaul wireless x-haul network consisting of long-range, high-capacity mmWave and microwave links. In this paper, we present the detailed design and implementation of AraSync, including its hardware and software components and the PTP network topology. Further, we experimentally characterize the performance of AraSync from spatial and temporal dimensions. Our measurement and analysis of the clock offset and mean path delay show the impact of the wireless channel and weather conditions on the PTP synchronization accuracy.
EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy
Oct 07 2024 cs.LG cs.AI cs.PF arXiv:2410.03185v1

@misc{2410.03185, author = {Moran Shkolnik and Maxim Fishman and Brian Chmiel and Hilla Ben-Yaacov and Ron Banner and Kfir Yehuda Levy}, title = {{EXAQ}: {E}xponent {A}ware {Q}uantization {F}or {LLM}s {A}cceleration}, year = {2024}, eprint = {2410.03185}, note = {arXiv:2410.03185v1} }
PDF
Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both $e^x$ and $\sum(e^x)$ with minimal to no accuracy degradation. For example, in LLaMA1-30B, we achieve baseline performance with 2-bit quantization on the well-known "Physical Interaction: Question Answering" (PIQA) dataset evaluation. This ultra-low bit quantization allows, for the first time, an acceleration of approximately 4x in the accumulation phase. The combination of accelerating both $e^x$ and $\sum(e^x)$ results in a 36.9% acceleration in the softmax operation.
Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
Shangye Chen, Jin Huang, Shuangyan Yang, Jie Liu, Huaicheng Li, Dimitrios Nikolopoulos, Junhee Ryu, Jinho Baek, Kwangsik Shin, Dong Li
Oct 02 2024 cs.PF arXiv:2410.00328v1

@misc{2410.00328, author = {Shangye Chen and Jin Huang and Shuangyan Yang and Jie Liu and Huaicheng Li and Dimitrios Nikolopoulos and Junhee Ryu and Jinho Baek and Kwangsik Shin and Dong Li}, title = {{T}uning {F}ast {M}emory {S}ize based on {M}odeling of {P}age {M}igration for {T}iered {M}emory}, year = {2024}, eprint = {2410.00328}, note = {arXiv:2410.00328v1} }
PDF
Tiered memory, built upon a combination of fast memory and slow memory, provides a cost-effective solution to meet ever-increasing requirements from emerging applications for large memory capacity. Reducing the size of fast memory is valuable to improve memory utilization in production and reduce production costs because fast memory tends to be expensive. However, deciding the fast memory size is challenging because there is a complex interplay between application characterization and the overhead of page migration used to mitigate the impact of limited fast memory capacity. In this paper, we introduce a system, Tuna, to decide fast memory size based on modeling of page migration. Tuna uses micro-benchmarking to model the impact of page migration on application performance using three metrics. Tuna decides the fast memory size based on offline modeling results and limited information on workload telemetry. Evaluating with common big-memory applications and using 5% as the performance loss target, we show that Tuna in combination with a page management system (TPP) saves fast memory by 8.5% on average (up to 16%). This is in contrast to the 5% saving in fast memory reported by Microsoft Pond for the same workloads (BFS and SSSP) and the same performance loss target.
Streaming Data in HPC Workflows Using ADIOS
Greg Eisenhauer, Norbert Podhorszki, Ana Gainaru, Scott Klasky, Philip E. Davis, Manish Parashar, Matthew Wolf, Eric Suchtya, Erick Fredj, Vicente Bolea, Franz Pöschel, Klaus Steiniger, Michael Bussmann, Richard Pausch, Sunita Chandrasekaran
Oct 02 2024 cs.PF arXiv:2410.00178v1

@misc{2410.00178, author = {Greg Eisenhauer and Norbert Podhorszki and Ana Gainaru and Scott Klasky and Philip E.~Davis and Manish Parashar and Matthew Wolf and Eric Suchtya and Erick Fredj and Vicente Bolea and Franz Pöschel and Klaus Steiniger and Michael Bussmann and Richard Pausch and Sunita Chandrasekaran}, title = {{S}treaming {D}ata in {HPC} {W}orkflows {U}sing {ADIOS}}, year = {2024}, eprint = {2410.00178}, note = {arXiv:2410.00178v1} }
PDF
The "IO Wall" problem, in which the gap between computation rate and data access rate grows continuously, poses significant problems to scientific workflows which have traditionally relied upon using the filesystem for intermediate storage between workflow stages. One way to avoid this problem in scientific workflows is to stream data directly from producers to consumers and avoiding storage entirely. However, the manner in which this is accomplished is key to both performance and usability. This paper presents the Sustainable Staging Transport, an approach which allows direct streaming between traditional file writers and readers with few application changes. SST is an ADIOS "engine", accessible via standard ADIOS APIs, and because ADIOS allows engines to be chosen at run-time, many existing file-oriented ADIOS workflows can utilize SST for direct application-to-application communication without any source code changes. This paper describes the design of SST and presents performance results from various applications that use SST, for feeding model training with simulation data with substantially higher bandwidth than the theoretical limits of Frontier's file system, for strong coupling of separately developed applications for multiphysics multiscale simulation, or for in situ analysis and visualization of data to complete all data processing shortly after the simulation finishes.
How do Practitioners Perceive Energy Consumption on Stack Overflow?
Bihui Jin, Heng Li, Ying Zou
Oct 01 2024 cs.SE cs.PF arXiv:2409.19222v1

@misc{2409.19222, author = {Bihui Jin and Heng Li and Ying Zou}, title = {{H}ow do {P}ractitioners {P}erceive {E}nergy {C}onsumption on {S}tack {O}verflow?}, year = {2024}, eprint = {2409.19222}, note = {arXiv:2409.19222v1} }
PDF
Energy consumption of software applications has emerged as a critical issue for practitioners to contemplate in their daily development processes. Previous studies have performed user surveys with a limited number of practitioners to comprehend practitioners' viewpoints on energy consumption. In this paper, we complement prior studies by conducting an empirical analysis of a meticulously curated dataset comprising 985 Stack Overflow (SO) questions concerning energy consumption. These questions reflect real-world energy-related predicaments faced by practitioners in their daily development activities. To understand practitioners' perception of energy consumption, we investigate the intentions behind these questions, their semantic topics, as well as the tag categories associated with these questions. Our empirical study results reveal that (i) the intentions that drive the questioners to initiate posts and ask questions are primarily associated with understanding a concept or how to use an API; (ii) the most prevalent topic related to energy consumption concerns computing resources; (iii) monitoring energy usage poses a challenging issue, and it takes the longest response time to receive a community response to the questions; and (iv) practitioners are apprehensive about energy consumption from different levels, i.e., hardware, operating systems, and programming languages, during the development of the applications. Our work furnishes insights into the issues related to energy consumption faced by practitioners. Our observations raise awareness among practitioners about the impact of energy consumption on developing software systems from different perspectives, such as coding efficiency and energy monitoring, and shed light on future research opportunities to assist practitioners in developing energy-efficient software systems.
ZERNIPAX: A Fast and Accurate Zernike Polynomial Calculator in Python
Yigit Gunsur Elmacioglu, Rory Conlin, Daniel W. Dudt, Dario Panici, Egemen Kolemen
Oct 01 2024 cs.PF arXiv:2409.19156v1

@misc{2409.19156, author = {Yigit Gunsur Elmacioglu and Rory Conlin and Daniel W.~Dudt and Dario Panici and Egemen Kolemen}, title = {{ZERNIPAX}: {A} {F}ast and {A}ccurate {Z}ernike {P}olynomial {C}alculator in {P}ython}, year = {2024}, eprint = {2409.19156}, note = {arXiv:2409.19156v1} }
PDF
Zernike Polynomials serve as an orthogonal basis on the unit disc, and have been proven to be effective in optics simulations, astrophysics, and more recently in plasma simulations. Unlike Bessel functions, they maintain finite values at the disc center, ensuring inherent analyticity along the axis. We developed ZERNIPAX, an open-source Python package capable of utilizing CPU/GPUs, leveraging Google's JAX package and available on https://github.com/PlasmaControl/FastZernike.git as well as PyPI. Our implementation of the recursion relation between Jacobi polynomials significantly improves computation time compared to alternative methods by use of parallel computing while still preserving accuracy for mode numbers n>100.
Cluster-BPI: Efficient Fine-Grain Blind Power Identification for Defending against Hardware Thermal Trojans in Multicore SoCs
Mohamed R. Elshamy, Mehdi Elahi, Ahmad Patooghy, Abdel-Hameed A. Badawy
Sep 30 2024 cs.CR cs.PF eess.SP arXiv:2409.18921v1

@misc{2409.18921, author = {Mohamed R.~Elshamy and Mehdi Elahi and Ahmad Patooghy and Abdel-Hameed A.~Badawy}, title = {{C}luster-{BPI}: {E}fficient {F}ine-{G}rain {B}lind {P}ower {I}dentification for {D}efending against {H}ardware {T}hermal {T}rojans in {M}ulticore {S}o{C}s}, year = {2024}, eprint = {2409.18921}, note = {arXiv:2409.18921v1} }
PDF
Modern multicore System-on-Chips (SoCs) feature hardware monitoring mechanisms that measure total power consumption. However, these aggregate measurements are often insufficient for fine-grained thermal and power management. This paper presents an enhanced Clustering Blind Power Identification (ICBPI) approach, designed to improve the sensitivity and robustness of the traditional Blind Power Identification (BPI) method. BPI estimates the power consumption of individual cores and models the thermal behavior of an SoC using only thermal sensor data and total power measurements. The proposed ICBPI approach refines BPI's initialization process, particularly improving the non-negative matrix factorization (NNMF) step, which is critical to the accuracy of BPI. ICBPI introduces density-based spatial clustering of applications with noise (DBSCAN) to better align temperature and power consumption data, thereby providing more accurate power consumption estimates. We validate the ICBPI method through two key tasks. The first task evaluates power estimation accuracy across four different multicore architectures, including a heterogeneous processor. Results show that ICBPI significantly enhances accuracy, reducing error rates by 77.56% compared to the original BPI and by 68.44% compared to the state-of-the-art BPISS method. The second task focuses on improving the detection and localization of malicious thermal sensor attacks in heterogeneous processors. The results demonstrate that ICBPI enhances the security and robustness of multicore SoCs against such attacks.
Toward Greener Matrix Operations by Lossless Compressed Formats
Francesco Tosoni, Philip Bille, Valerio Brunacci, Alessio De Angelis, Paolo Ferragina, Giovanni Manzini
Sep 30 2024 cs.DS cs.PF arXiv:2409.18620v1

@misc{2409.18620, author = {Francesco Tosoni and Philip Bille and Valerio Brunacci and Alessio De Angelis and Paolo Ferragina and Giovanni Manzini}, title = {{T}oward {G}reener {M}atrix {O}perations by {L}ossless {C}ompressed {F}ormats}, year = {2024}, eprint = {2409.18620}, note = {arXiv:2409.18620v1} }
PDF
Sparse matrix-vector multiplication (SpMV) is a fundamental operation in machine learning, scientific computing, and graph algorithms. In this paper, we investigate the space, time, and energy efficiency of SpMV using various compressed formats for large sparse matrices, focusing specifically on Boolean matrices and real-valued vectors. Through extensive analysis and experiments conducted on server and edge devices, we found that different matrix compression formats offer distinct trade-offs among space usage, execution time, and energy consumption. Notably, by employing the appropriate compressed format, we can reduce energy consumption by an order of magnitude on both server and single-board computers. Furthermore, our experiments indicate that while data parallelism can enhance execution speed and energy efficiency, achieving simultaneous time and energy efficiency presents partially distinct challenges. Specifically, we show that for certain compression schemes, the optimal degree of parallelism for time does not align with that for energy, thereby challenging prevailing assumptions about a straightforward linear correlation between execution time and energy consumption. Our results have significant implications for software engineers in all domains where SpMV operations are prevalent. They also suggest that similar studies exploring the trade-offs between time, space, and energy for other compressed data structures can substantially contribute to designing more energy-efficient software components.
Balanced Splitting: A Framework for Achieving Zero-wait in the Multiserver-job Model
Jonatha Anselmi, Josu Doncel
Sep 30 2024 cs.PF arXiv:2409.18557v1

@misc{2409.18557, author = {Jonatha Anselmi and Josu Doncel}, title = {{B}alanced {S}plitting: {A} {F}ramework for {A}chieving {Z}ero-wait in the {M}ultiserver-job {M}odel}, year = {2024}, eprint = {2409.18557}, note = {arXiv:2409.18557v1} }
PDF
We present a new framework for designing nonpreemptive and job-size oblivious scheduling policies in the multiserver-job queueing model. The main requirement is to identify a static and balanced sub-partition of the server set and ensure that the servers in each set of that sub-partition can only handle jobs of a given class and in a first-come first-served order. A job class is determined by the number of servers to which it has exclusive access during its entire execution and the probability distribution of its service time. This approach aims to reduce delays by preventing small jobs from being blocked by larger ones that arrived first, and it is particularly beneficial when the job size variability intra resp. inter classes is small resp. large. In this setting, we propose a new scheduling policy, Balanced-Splitting. We provide a sufficient condition for the stability of Balanced-Splitting and show that the resulting queueing probability, i.e., the probability that an arriving job needs to wait for processing upon arrival, vanishes in both the subcritical (the load is kept fixed to a constant less than one) and critical (the load approaches one from below) many-server limiting regimes. Crucial to our analysis is a connection with the M/GI/s/s queue and Erlang's loss formula, which allows our analysis to rely on fundamental results from queueing theory. Numerical simulations show that the proposed policy performs better than several preemptive/nonpreemptive size-aware/oblivious policies in various practical scenarios. This is also confirmed by simulations running on real traces from High Performance Computing (HPC) workloads. The delays induced by Balanced-Splitting are also competitive with those induced by state-of-the-art policies such as First-Fit-SRPT and ServerFilling-SRPT, though our approach has the advantage of not requiring preemption, nor the knowledge of job sizes.
VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search
Solmaz Seyed Monir, Irene Lau, Shubing Yang, Dongfang Zhao
Sep 27 2024 cs.IR cs.AI cs.DB cs.LG cs.PF arXiv:2409.17383v1

@misc{2409.17383, author = {Solmaz Seyed Monir and Irene Lau and Shubing Yang and Dongfang Zhao}, title = {{V}ector{S}earch: {E}nhancing {D}ocument {R}etrieval with {S}emantic {E}mbeddings and {O}ptimized {S}earch}, year = {2024}, eprint = {2409.17383}, note = {arXiv:2409.17383v1} }
PDF
Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic understanding and accurate retrieval remains challenging due to high dimensionality and semantic gaps. The above challenges call for new techniques to effectively reduce the dimensions and close the semantic gaps. To this end, we propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics, demonstrating its efficacy for large-scale retrieval tasks.
EfiMon: A Process Analyser for Granular Power Consumption Prediction
Luis G. León-Vega, Niccolò Tosato, Stefano Cozzini
Sep 27 2024 cs.DC cs.PF arXiv:2409.17368v1

@misc{2409.17368, author = {Luis G.~León-Vega and Niccolò Tosato and Stefano Cozzini}, title = {{E}fi{M}on: {A} {P}rocess {A}nalyser for {G}ranular {P}ower {C}onsumption {P}rediction}, year = {2024}, eprint = {2409.17368}, note = {arXiv:2409.17368v1} }
PDF
High-performance computing (HPC) and supercomputing are critical in Artificial Intelligence (AI) research, development, and deployment. The extensive use of supercomputers for training complex AI models, which can take from days to months, raises significant concerns about energy consumption and carbon emissions. Traditional methods for estimating the energy consumption of HPC workloads rely on metering reports from computing nodes power supply units, assuming exclusive use of the entire node. This assumption is increasingly untenable with the advent of next-generation supercomputers that share resources to accelerate workloads, as seen in initiatives like Acceleration as a Service (XaaS) and cloud computing. This paper introduces EfiMon, an agnostic and non-invasive tool designed to extract detailed information about process execution, including instructions executed within specific time windows and CPU and RAM usage. Additionally, it captures comprehensive system metrics, such as power consumption reported by CPU sockets and PSUs. This data enables the development of prediction models to estimate the energy consumption of individual processes without requiring isolation. Using a regression-based mathematical model, our tool is able to estimate single processes' power consumption in isolated and shared resource environments. In shared scenarios, the model demonstrates robust performance, deviating by a maximum of 2.2% on Intel-based machines and 4.4% on AMD systems compared to non-shared cases. This significant accuracy showcases EfiMon's potential for enhancing energy accounting in supercomputing, contributing to more efficient and energy-aware optimisation strategies in HPC.
Performance and scaling of the LFRic weather and climate model on different generations of HPE Cray EX supercomputers
J. Mark Bull, Andrew Coughtrie, Deva Deeptimahanti, Mark Hedley, Caoimhín Laoide-Kemp, Christopher Maynard, Harry Shepherd, Sebastiaan van de Bund, Michèle Weiland, Benjamin Went
Sep 25 2024 cs.DC cs.PF arXiv:2409.15859v1

@misc{2409.15859, author = {J.~Mark Bull and Andrew Coughtrie and Deva Deeptimahanti and Mark Hedley and Caoimhín Laoide-Kemp and Christopher Maynard and Harry Shepherd and Sebastiaan van de Bund and Michèle Weiland and Benjamin Went}, title = {{P}erformance and scaling of the {LFR}ic weather and climate model on different generations of {HPE} {C}ray {EX} supercomputers}, year = {2024}, eprint = {2409.15859}, note = {arXiv:2409.15859v1} }
PDF
This study presents scaling results and a performance analysis across different supercomputers and compilers for the Met Office weather and climate model, LFRic. The model is shown to scale to large numbers of nodes which meets the design criteria, that of exploitation of parallelism to achieve good scaling. The model is written in a Domain-Specific Language, embedded in modern Fortran and uses a Domain-Specific Compiler, PSyclone, to generate the parallel code. The performance analysis shows the effect of choice of algorithm, such as redundant computation and scaling with OpenMP threads. The analysis can be used to motivate a discussion of future work to improve the OpenMP performance of other parts of the code. Finally, an analysis of the performance tuning of the I/O server, XIOS is presented.
FRSZ2 for In-Register Block Compression Inside GMRES on GPUs
Thomas Grützmacher, Robert Underwood, Sheng Di, Franck Cappello, Hartwig Anzt
Sep 25 2024 cs.PF cs.DS arXiv:2409.15468v1

@misc{2409.15468, author = {Thomas Grützmacher and Robert Underwood and Sheng Di and Franck Cappello and Hartwig Anzt}, title = {{FRSZ}2 for {I}n-{R}egister {B}lock {C}ompression {I}nside {GMRES} on {GPU}s}, year = {2024}, eprint = {2409.15468}, note = {arXiv:2409.15468v1} }
PDF
The performance of the GMRES iterative solver on GPUs is limited by the GPU main memory bandwidth. Compressed Basis GMRES outperforms GMRES by storing the Krylov basis in low precision, thereby reducing the memory access. An open question is whether compression techniques that are more sophisticated than casting to low precision can enable large runtime savings while preserving the accuracy of the final results. This paper presents the lightweight in-register compressor FRSZ2 that can decompress at the bandwidth speed of a modern NVIDIA H100 GPU. In an experimental evaluation, we demonstrate using FRSZ2 instead of low precision for compression of the Krylov basis can bring larger runtime benefits without impacting final accuracy.
Deploying Open-Source Large Language Models: A performance Analysis
Yannis Bendi-Ouis, Dan Dutarte, Xavier Hinaut
Sep 24 2024 cs.PF cs.AI cs.LG arXiv:2409.14887v2

@misc{2409.14887, author = {Yannis Bendi-Ouis and Dan Dutarte and Xavier Hinaut}, title = {{D}eploying {O}pen-{S}ource {L}arge {L}anguage {M}odels: {A} performance {A}nalysis}, year = {2024}, eprint = {2409.14887}, note = {arXiv:2409.14887v2} }
PDF
Since the release of ChatGPT in November 2022, large language models (LLMs) have seen considerable success, including in the open-source community, with many open-weight models available. However, the requirements to deploy such a service are often unknown and difficult to evaluate in advance. To facilitate this process, we conducted numerous tests at the Centre Inria de l'Université de Bordeaux. In this article, we propose a comparison of the performance of several models of different sizes (mainly Mistral and LLaMa) depending on the available GPUs, using vLLM, a Python library designed to optimize the inference of these models. Our results provide valuable information for private and public groups wishing to deploy LLMs, allowing them to evaluate the performance of different models based on their available hardware. This study thus contributes to facilitating the adoption and use of these large language models in various application domains.
Solving Combinatorial Optimization Problems on a Photonic Quantum Computer
Mateusz Slysz, Krzysztof Kurowski, Grzegorz Waligóra
Sep 24 2024 quant-ph cs.PF arXiv:2409.13781v1

@misc{2409.13781, author = {Mateusz Slysz and Krzysztof Kurowski and Grzegorz Waligóra}, title = {{S}olving {C}ombinatorial {O}ptimization {P}roblems on a {P}hotonic {Q}uantum {C}omputer}, year = {2024}, eprint = {2409.13781}, note = {arXiv:2409.13781v1} }
PDF
Combinatorial optimization problems pose significant computational challenges across various fields, from logistics to cryptography. Traditional computational methods often struggle with their exponential complexity, motivating exploration into alternative paradigms such as quantum computing. In this paper, we investigate the application of photonic quantum computing to solve combinatorial optimization problems. Leveraging the principles of quantum mechanics, we demonstrate how photonic quantum computers can efficiently explore solution spaces and identify optimal solutions for a range of combinatorial problems. We provide an overview of quantum algorithms tailored for combinatorial optimization for different quantum architectures (boson sampling, quantum annealing and gate-based quantum computing). Additionally, we discuss the advantages and challenges of implementing those algorithms on photonic quantum hardware. Through experiments run on an 8-qumode photonic quantum device, as well as numerical simulations, we evaluate the performance of photonic quantum computers in solving representative combinatorial optimization problems, such as the Max-Cut problem and the Job Shop Scheduling Problem.
RAVE: RISC-V Analyzer of Vector Executions, a QEMU tracing plugin
Pablo Vizcaino, Filippo Mantovani, Jesus Labarta, Roger Ferrer
Sep 23 2024 cs.PF arXiv:2409.13639v1

@misc{2409.13639, author = {Pablo Vizcaino and Filippo Mantovani and Jesus Labarta and Roger Ferrer}, title = {{RAVE}: {RISC}-{V} {A}nalyzer of {V}ector {E}xecutions, a {QEMU} tracing plugin}, year = {2024}, eprint = {2409.13639}, note = {arXiv:2409.13639v1} }
PDF
Simulators are crucial during the development of a chip, like the RISC-V accelerator designed in the European Processor Initiative project. In this paper, we showcase the limitations of the current simulation solutions in the project and propose using QEMU with RAVE, a plugin we implement and describe in this document. This methodology can rapidly simulate and analyze applications running on the v1.0 and v0.7.1 RISC-V V-extension. Our plugin reports the vector and scalar instructions alongside useful information such as the vector-length being used, the single-element-width, and the register usage, among other vectorization metrics. We provide an API used from the simulated Application to control the RAVE plugin and the capability to generate vectorization traces that can be analyzed using Paraver. Finally, we demonstrate the efficiency of our solution between different evaluated machines and against other simulation methods used in the European Processor Accelerator (EPAC) project.
Stabl: Blockchain Fault Tolerance
Vincent Gramoli, Rachid Guerraoui, Andrei Lebedev, Gauthier Voron
Sep 23 2024 cs.DC cs.PF arXiv:2409.13142v1

@misc{2409.13142, author = {Vincent Gramoli and Rachid Guerraoui and Andrei Lebedev and Gauthier Voron}, title = {{S}tabl: {B}lockchain {F}ault {T}olerance}, year = {2024}, eprint = {2409.13142}, note = {arXiv:2409.13142v1} }
PDF
Blockchain promises to make online services more fault tolerant due to their inherent distributed nature. Their ability to execute arbitrary programs in different geo-distributed regions and on diverse operating systems make them an alternative of choice to our dependence on unique software whose recent failure affected 8.5 millions of machines. As of today, it remains, however, unclear whether blockchains can truly tolerate failures. In this paper, we assess the fault tolerance of blockchain. To this end, we inject failures in controlled deployments of five modern blockchain systems, namely Algorand, Aptos, Avalanche, Redbelly and Solana. We introduce a novel sensitivity metric, interesting in its own right, as the difference between the integrals of two cumulative distribution functions, one obtained in a baseline environment and one obtained in an adversarial environment. Our results indicate that (i) all blockchains except Redbelly are highly impacted by the failure of a small part of their network, (ii) Avalanche and Redbelly benefit from the redundant information needed for Byzantine fault tolerance while others are hampered by it, and more dramatically (iii) Avalanche and Solana cannot recover from localised transient failures.
Optimization of a Radiofrequency Ablation FEM Application Using Parallel Sparse Solvers
Marcelo Cogo Miletto, Claudio Schepke, Lucas Mello Schnorr
Sep 23 2024 cs.DC cs.PF arXiv:2409.13036v1

@misc{2409.13036, author = {Marcelo Cogo Miletto and Claudio Schepke and Lucas Mello Schnorr}, title = {{O}ptimization of a {R}adiofrequency {A}blation {FEM} {A}pplication {U}sing {P}arallel {S}parse {S}olvers}, year = {2024}, eprint = {2409.13036}, howpublished = {The 2020 International Conference on High Performance Computing and Simulation (2020)}, note = {arXiv:2409.13036v1} }
PDF
Finite element method applications are a common approach to simulate a handful of phenomena but can take a lot of computing power, causing elevated waiting time to produce precise results. The radiofrequency ablation finite element method is an application to simulate the medical procedure of radiofrequency ablation, a minimally invasive liver cancer treatment. The application runs sequentially and can take up to 20 hours of execution to generate 15 minutes of simulation results. Most of this time arises from the need to solve a sparse system of linear equations. In this work, we accelerate this application by using three sparse solvers packages (MAGMA cuSOLVER, and QRMumps), including direct and iterative methods over different multicore and GPU architectures. We conducted a numerical result analysis to access the solution quality provided by the distinct solvers and their configurations, proposing the use of the peak signal-to-noise ratio metric. We were able to reduce the application execution time up to 40 times compared to the original sequential version while keeping a similar numerical quality for the results.
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
Chelsea Maria John, Stepan Nassyr, Carolin Penke, Andreas Herten
Sep 23 2024 cs.AR cs.AI cs.DC cs.LG cs.PF arXiv:2409.12994v1

@misc{2409.12994, author = {Chelsea Maria John and Stepan Nassyr and Carolin Penke and Andreas Herten}, title = {{P}erformance and {P}ower: {S}ystematic {E}valuation of {AI} {W}orkloads on {A}ccelerators with {CARAML}}, year = {2024}, eprint = {2409.12994}, note = {arXiv:2409.12994v1} }
PDF
The rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.

Recent comments