subscribe to arXiv mailings

On a Foundation Model for Operating Systems

Authors: Divyanshu Saxena, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Sebastian Angel, Joydeep Biswas, Swarat Chaudhuri, Isil Dillig, Alex Dimakis, P. Brighten Godfrey, Daehyeok Kim, Chris Rossbach, Gang Wang

Abstract: This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case for a foundation model revolves around the observations that several OS components such as CPU, memory, and network subsystems are interrelated and that OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behavior in… ▽ More This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case for a foundation model revolves around the observations that several OS components such as CPU, memory, and network subsystems are interrelated and that OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behavior in varying environments and workloads. We discuss a wide range of possibilities that then arise, from employing foundation models as policy agents to utilizing them as generators and predictors to assist traditional OS control algorithms. Our hope is that this paper spurs further research into OS foundation models and creating the next generation of operating systems for the evolving computing landscape. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: Machine Learning for Systems Workshop at 37th NeurIPS Conference, 2023, New Orleans, LA, USA

arXiv:2311.02800 [pdf, other]

Kivi: Verification for Cluster Management

Authors: Bingzhe Liu, Gangmuk Lim, Ryan Beckett, P. Brighten Godfrey

Abstract: Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users' operational burden, the complex, dynamic, and non-deterministic nature of these systems makes them hard to reason about, potentially leading to failures ranging from performance degradation to outages. We present Kivi, the first system for verifying… ▽ More Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users' operational burden, the complex, dynamic, and non-deterministic nature of these systems makes them hard to reason about, potentially leading to failures ranging from performance degradation to outages. We present Kivi, the first system for verifying controllers and their configurations in cluster management systems. Kivi focuses on the popular system Kubernetes, and models its controllers and events into processes whereby their interleavings are exhaustively checked via model checking. Central to handling autoscaling and large-scale deployments is our design that seeks to find violations in a smaller and reduced topology. We also develop several model optimizations in Kivi to scale to large clusters. We show that Kivi is effective and accurate in finding issues in realistic and complex scenarios and showcase two new issues in Kubernetes controller source code. △ Less

Submitted 5 November, 2023; originally announced November 2023.

Comments: 18 pages

arXiv:2305.03348 [pdf, other]

Flock: Accurate network fault localization at scale

Authors: Vipul Harsh, Tong Meng, Kapil Agrawal, P. Brighten Godfrey

Abstract: Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for "gray" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and… ▽ More Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for "gray" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and system that achieves both high accuracy and speed at datacenter scale. Flock uses a probabilistic graphical model (PGM) to achieve high accuracy, coupled with new techniques to dramatically accelerate inference in discrete-valued Bayesian PGMs. Large-scale simulations and experiments in a hardware testbed show Flock speeds up inference by >10000x compared to past PGM methods, and improves accuracy over the best previous datacenter fault localization approaches, reducing inference error by 1.19-11x on the same input telemetry, and by 1.2-55x after incorporating passive telemetry. We also prove Flock's inference is optimal in restricted settings △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: To appear in ACM PACMNET, Vol 1, June 2023

arXiv:1911.02128 [pdf, other]

Plankton: Scalable network configuration verification through model checking

Authors: Santhosh Prabhu, Kuan-Yen Chou, Ali Kheradmand, P. Brighten Godfrey, Matthew Caesar

Abstract: Network configuration verification enables operators to ensure that the network will behave as intended, prior to deployment of their configurations. Although techniques ranging from graph algorithms to SMT solvers have been proposed, scalable configuration verification with sufficient protocol support continues to be a challenge. In this paper, we show that by combining equivalence partitioning w… ▽ More Network configuration verification enables operators to ensure that the network will behave as intended, prior to deployment of their configurations. Although techniques ranging from graph algorithms to SMT solvers have been proposed, scalable configuration verification with sufficient protocol support continues to be a challenge. In this paper, we show that by combining equivalence partitioning with explicit-state model checking, network configuration verification can be scaled significantly better than the state of the art, while still supporting a rich set of protocol features. We propose Plankton, which uses symbolic partitioning to manage large header spaces and efficient model checking to exhaustively explore protocol behavior. Thanks to a highly effective suite of optimizations including state hashing, partial order reduction, and policy-based pruning, Plankton successfully verifies policies in industrial-scale networks quickly and compactly, at times reaching a 10000$\times$ speedup compared to the state of the art. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: NSDI'20

arXiv:1811.10737 [pdf, other]

Dissecting Latency in the Internet's Fiber Infrastructure

Authors: Ilker Nadi Bozkurt, Waqar Aqeel, Debopam Bhattacherjee, Balakrishnan Chandrasekaran, Philip Brighten Godfrey, Gregory Laughlin, Bruce M. Maggs, Ankit Singla

Abstract: The recent publication of the `InterTubes' map of long-haul fiber-optic cables in the contiguous United States invites an exciting question: how much faster would the Internet be if routes were chosen to minimize latency? Previous measurement campaigns suggest the following rule of thumb for estimating Internet latency: multiply line-of-sight distance by 2.1, then divide by the speed of light in f… ▽ More The recent publication of the `InterTubes' map of long-haul fiber-optic cables in the contiguous United States invites an exciting question: how much faster would the Internet be if routes were chosen to minimize latency? Previous measurement campaigns suggest the following rule of thumb for estimating Internet latency: multiply line-of-sight distance by 2.1, then divide by the speed of light in fiber. But a simple computation of shortest-path lengths through the conduits in the InterTubes map suggests that the conversion factor for all pairs of the 120 largest population centers in the U.S.\ could be reduced from 2.1 to 1.3, in the median, even using less than half of the links. To determine whether an overlay network could be used to provide shortest paths, and how well it would perform, we used the diverse server deployment of a CDN to measure latency across individual conduits. We were surprised to find, however, that latencies are sometimes much higher than would be predicted by conduit length alone. To understand why, we report findings from our analysis of network latency data from the backbones of two Tier-1 ISPs, two scientific and research networks, and the recently built fiber backbone of a CDN. △ Less

Submitted 26 November, 2018; originally announced November 2018.

arXiv:1811.00212 [pdf, other]

Expander Datacenters: From Theory to Practice

Authors: Vipul Harsh, Sangeetha Abdu Jyothi, Inderdeep Singh, P. Brighten Godfrey

Abstract: Recent work has shown that expander-based data center topologies are robust and can yield superior performance over Clos topologies. However, to achieve these benefits, previous proposals use routing and transport schemes that impede quick industry adoption. In this paper, we examine if expanders can be effective for the technology and environments practical in today's data centers, including the… ▽ More Recent work has shown that expander-based data center topologies are robust and can yield superior performance over Clos topologies. However, to achieve these benefits, previous proposals use routing and transport schemes that impede quick industry adoption. In this paper, we examine if expanders can be effective for the technology and environments practical in today's data centers, including the use of traditional protocols, at both small and large scale while complying with common practices such as over-subscription. We study bandwidth, latency and burst tolerance of topologies, highlighting pitfalls of previous topology comparisons. We consider several other metrics of interest: packet loss during failures, queue occupancy and topology degradation. Our experiments show that expanders can realize 3x more throughput than an equivalent fat tree, and 1.5x more throughput than an equivalent leaf-spine topology, for a wide range of scenarios, with only traditional protocols. We observe that expanders achieve lower flow completion times, are more resilient to bursty load conditions like incast and outcast and degrade more gracefully with increasing load. Our results are based on extensive simulations and experiments on a hardware testbed with realistic topologies and real traffic patterns. △ Less

Submitted 31 October, 2018; originally announced November 2018.

Comments: 15 pages, 17 figures

arXiv:1810.03259 [pdf, other]

Internet Congestion Control via Deep Reinforcement Learning

Authors: Nathan Jay, Noga H. Rotman, P. Brighten Godfrey, Michael Schapira, Aviv Tamar

Abstract: We present and investigate a novel and timely application domain for deep reinforcement learning (RL): Internet congestion control. Congestion control is the core networking task of modulating traffic sources' data-transmission rates to efficiently utilize network capacity, and is the subject of extensive attention in light of the advent of Internet services such as live video, virtual reality, In… ▽ More We present and investigate a novel and timely application domain for deep reinforcement learning (RL): Internet congestion control. Congestion control is the core networking task of modulating traffic sources' data-transmission rates to efficiently utilize network capacity, and is the subject of extensive attention in light of the advent of Internet services such as live video, virtual reality, Internet-of-Things, and more. We show that casting congestion control as RL enables training deep network policies that capture intricate patterns in data traffic and network conditions, and leverage this to outperform the state-of-the-art. We also highlight significant challenges facing real-world adoption of RL-based congestion control, including fairness, safety, and generalization, which are not trivial to address within conventional RL formalism. To facilitate further research and reproducibility of our results, we present a test suite for RL-guided congestion control based on the OpenAI Gym interface. △ Less

Submitted 21 May, 2019; v1 submitted 7 October, 2018; originally announced October 2018.

Comments: 10 pages, accepted to ICML 2019

arXiv:1809.10897 [pdf, other]

cISP: A Speed-of-Light Internet Service Provider

Authors: Debopam Bhattacherjee, Sangeetha Abdu Jyothi, Ilker Nadi Bozkurt, Muhammad Tirmazi, Waqar Aqeel, Anthony Aguirre, Balakrishnan Chandrasekaran, P. Brighten Godfrey, Gregory P. Laughlin, Bruce M. Maggs, Ankit Singla

Abstract: Low latency is a requirement for a variety of interactive network applications. The Internet, however, is not optimized for latency. We thus explore the design of cost-effective wide-area networks that move data over paths very close to great-circle paths, at speeds very close to the speed of light in vacuum. Our cISP design augments the Internet's fiber with free-space wireless connectivity. cISP… ▽ More Low latency is a requirement for a variety of interactive network applications. The Internet, however, is not optimized for latency. We thus explore the design of cost-effective wide-area networks that move data over paths very close to great-circle paths, at speeds very close to the speed of light in vacuum. Our cISP design augments the Internet's fiber with free-space wireless connectivity. cISP addresses the fundamental challenge of simultaneously providing low latency and scalable bandwidth, while accounting for numerous practical factors ranging from transmission tower availability to packet queuing. We show that instantiations of cISP across the contiguous United States and Europe would achieve mean latencies within 5% of that achievable using great-circle paths at the speed of light, over medium and long distances. Further, we estimate that the economic value from such networks would substantially exceed their expense. △ Less

Submitted 10 October, 2018; v1 submitted 28 September, 2018; originally announced September 2018.

arXiv:1505.03449 [pdf, other]

Towards a Speed of Light Internet

Authors: Ankit Singla, Balakrishnan Chandrasekaran, P. Brighten Godfrey, Bruce Maggs

Abstract: In principle, a network can transfer data at nearly the speed of light. Today's Internet, however, is much slower: our measurements show that latencies are typically more than one, and often more than two orders of magnitude larger than the lower bound implied by the speed of light. Closing this gap would not only add value to today's Internet applications, but might also open the door to exciting… ▽ More In principle, a network can transfer data at nearly the speed of light. Today's Internet, however, is much slower: our measurements show that latencies are typically more than one, and often more than two orders of magnitude larger than the lower bound implied by the speed of light. Closing this gap would not only add value to today's Internet applications, but might also open the door to exciting new applications. Thus, we propose a grand challenge for the networking research community: building a speed-of-light Internet. Towards addressing this goal, we begin by investigating the causes of latency inflation in the Internet across the network stack. Our analysis reveals that while protocol overheads, which have dominated the community's attention, are indeed important, infrastructural inefficiencies are a significant and under-explored problem. Thus, we propose a radical, yet surprisingly low-cost approach to mitigating latency inflation at the lowest layers and building a nearly speed-of-light Internet infrastructure. △ Less

Submitted 13 May, 2015; originally announced May 2015.

arXiv:1402.2531 [pdf, other]

Measuring and Understanding Throughput of Network Topologies

Authors: Sangeetha Abdu Jyothi, Ankit Singla, P. Brighten Godfrey, Alexandra Kolla

Abstract: High throughput is of particular interest in data center and HPC networks. Although myriad network topologies have been proposed, a broad head-to-head comparison across topologies and across traffic patterns is absent, and the right way to compare worst-case throughput performance is a subtle problem. In this paper, we develop a framework to benchmark the throughput of network topologies, using… ▽ More High throughput is of particular interest in data center and HPC networks. Although myriad network topologies have been proposed, a broad head-to-head comparison across topologies and across traffic patterns is absent, and the right way to compare worst-case throughput performance is a subtle problem. In this paper, we develop a framework to benchmark the throughput of network topologies, using a two-pronged approach. First, we study performance on a variety of synthetic and experimentally-measured traffic matrices (TMs). Second, we show how to measure worst-case throughput by generating a near-worst-case TM for any given topology. We apply the framework to study the performance of these TMs in a wide range of network topologies, revealing insights into the performance of topologies with scaling, robustness of performance across TMs, and the effect of scattered workload placement. Our evaluation code is freely available. △ Less

Submitted 14 November, 2016; v1 submitted 11 February, 2014; originally announced February 2014.

arXiv:1309.7066 [pdf, other]

High Throughput Data Center Topology Design

Authors: Ankit Singla, P. Brighten Godfrey, Alexandra Kolla

Abstract: With high throughput networks acquiring a crucial role in supporting data-intensive applications, a variety of data center network topologies have been proposed to achieve high capacity at low cost. While this literature explores a large number of design points, even in the limited case of a network of identical switches, no proposal has been able to claim any notion of optimality. The case of het… ▽ More With high throughput networks acquiring a crucial role in supporting data-intensive applications, a variety of data center network topologies have been proposed to achieve high capacity at low cost. While this literature explores a large number of design points, even in the limited case of a network of identical switches, no proposal has been able to claim any notion of optimality. The case of heterogeneous networks, incorporating multiple line-speeds and port-counts as data centers grow over time, introduces even greater complexity. In this paper, we present the first non-trivial upper-bound on network throughput under uniform traffic patterns for any topology with identical switches. We then show that random graphs achieve throughput surprisingly close to this bound, within a few percent at the scale of a few thousand servers. Apart from demonstrating that homogeneous topology design may be reaching its limits, this result also motivates our use of random graphs as building blocks to explore the design of heterogeneous networks. Given a heterogeneous pool of network switches, through experiments and analysis, we explore how the distribution of servers across switches and the interconnection of switches affect network throughput. We apply these insights to a real-world heterogeneous data center topology, VL2, demonstrating as much as 43% higher throughput with the same equipment. △ Less

Submitted 12 February, 2014; v1 submitted 26 September, 2013; originally announced September 2013.

Comments: 15 pages

arXiv:1309.0874 [pdf]

Shortest Paths in Microseconds

Authors: Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, Ben Y. Zhao

Abstract: Computing shortest paths is a fundamental primitive for several social network applications including socially-sensitive ranking, location-aware search, social auctions and social network privacy. Since these applications compute paths in response to a user query, the goal is to minimize latency while maintaining feasible memory requirements. We present ASAP, a system that achieves this goal by ex… ▽ More Computing shortest paths is a fundamental primitive for several social network applications including socially-sensitive ranking, location-aware search, social auctions and social network privacy. Since these applications compute paths in response to a user query, the goal is to minimize latency while maintaining feasible memory requirements. We present ASAP, a system that achieves this goal by exploiting the structure of social networks. ASAP preprocesses a given network to compute and store a partial shortest path tree (PSPT) for each node. The PSPTs have the property that for any two nodes, each edge along the shortest path is with high probability contained in the PSPT of at least one of the nodes. We show that the structure of social networks enable the PSPT of each node to be an extremely small fraction of the entire network; hence, PSPTs can be stored efficiently and each shortest path can be computed extremely quickly. For a real network with 5 million nodes and 69 million edges, ASAP computes a shortest path for most node pairs in less than 49 microseconds per pair. ASAP, unlike any previous technique, also computes hundreds of paths (along with corresponding distances) between any node pair in less than 100 microseconds. Finally, ASAP admits efficient implementation on distributed programming frameworks like MapReduce. △ Less

Submitted 3 September, 2013; originally announced September 2013.

Comments: Extended version of WOSN'12 paper: new techniques (reduced memory, faster computations), distributed (MapReduce) algorithm, multiple paths between a source-destination pair

arXiv:1306.3707 [pdf, other]

Low latency via redundancy

Authors: Ashish Vulimiri, P. Brighten Godfrey, Radhika Mittal, Justine Sherry, Sylvia Ratnasamy, Scott Shenker

Abstract: Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resou… ▽ More Low latency is critical for interactive networked applications. But while we know how to scale systems to increase capacity, reducing latency --- especially the tail of the latency distribution --- can be much more difficult. In this paper, we argue that the use of redundancy is an effective way to convert extra capacity into reduced latency. By initiating redundant operations across diverse resources and using the first result which completes, redundancy improves a system's latency even under exceptional conditions. We study the tradeoff with added system utilization, characterizing the situations in which replicating all tasks reduces mean latency. We then demonstrate empirically that replicating all operations can result in significant mean and tail latency reduction in real-world systems including DNS queries, database servers, and packet forwarding within networks. △ Less

Submitted 16 June, 2013; originally announced June 2013.

arXiv:1306.3534 [pdf, other]

A cost-benefit analysis of low latency via added utilization

Authors: Ashish Vulimiri, P. Brighten Godfrey, Sri Varsha Gorge, Zitian Liu, Scott Shenker

Abstract: Several recently proposed techniques achieve latency reduction by trading it off for some amount of additional bandwidth usage. But how would one quantify whether the tradeoff is actually beneficial in a given system? We develop an economic cost vs. benefit analysis for answering this question. We use the analysis to derive a benchmark for wide-area client-server applications, and demonstrate how… ▽ More Several recently proposed techniques achieve latency reduction by trading it off for some amount of additional bandwidth usage. But how would one quantify whether the tradeoff is actually beneficial in a given system? We develop an economic cost vs. benefit analysis for answering this question. We use the analysis to derive a benchmark for wide-area client-server applications, and demonstrate how it can be applied to reason about a particular latency saving technique --- redundant DNS requests. △ Less

Submitted 4 December, 2014; v1 submitted 14 June, 2013; originally announced June 2013.

arXiv:1302.6156 [pdf, other]

Scalable Routing on Flat Names

Authors: Ankit Singla, P. Brighten Godfrey, Kevin Fall, Gianluca Iannaccone, Sylvia Ratnasamy

Abstract: We introduce a protocol which routes on flat, location-independent identifiers with guaranteed scalability and low stretch. Our design builds on theoretical advances in the area of compact routing, and is the first to realize these guarantees in a dynamic distributed setting. We introduce a protocol which routes on flat, location-independent identifiers with guaranteed scalability and low stretch. Our design builds on theoretical advances in the area of compact routing, and is the first to realize these guarantees in a dynamic distributed setting. △ Less

Submitted 25 February, 2013; originally announced February 2013.

Comments: 13 pages

Journal ref: Extends our ACM CoNEXT 2010 paper with proofs for the theoretical results

arXiv:1206.2057 [pdf, other]

Finishing Flows Quickly with Preemptive Scheduling

Authors: Chi-Yao Hong, Matthew Caesar, P. Brighten Godfrey

Abstract: Today's data centers face extreme challenges in providing low latency. However, fair sharing, a principle commonly adopted in current congestion control protocols, is far from optimal for satisfying latency requirements. We propose Preemptive Distributed Quick (PDQ) flow scheduling, a protocol designed to complete flows quickly and meet flow deadlines. PDQ enables flow preemption to approximate… ▽ More Today's data centers face extreme challenges in providing low latency. However, fair sharing, a principle commonly adopted in current congestion control protocols, is far from optimal for satisfying latency requirements. We propose Preemptive Distributed Quick (PDQ) flow scheduling, a protocol designed to complete flows quickly and meet flow deadlines. PDQ enables flow preemption to approximate a range of scheduling disciplines. For example, PDQ can emulate a shortest job first algorithm to give priority to the short flows by pausing the contending flows. PDQ borrows ideas from centralized scheduling disciplines and implements them in a fully distributed manner, making it scalable to today's data centers. Further, we develop a multipath version of PDQ to exploit path diversity. Through extensive packet-level and flow-level simulation, we demonstrate that PDQ significantly outperforms TCP, RCP and D3 in data center environments. We further show that PDQ is stable, resilient to packet loss, and preserves nearly all its performance gains even given inaccurate flow information. △ Less

Submitted 12 June, 2012; v1 submitted 10 June, 2012; originally announced June 2012.

Comments: The conference version was published in SIGCOMM 2012

arXiv:1206.1134 [pdf]

Shortest Paths in Less Than a Millisecond

Authors: Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, Ben Y. Zhao

Abstract: We consider the problem of answering point-to-point shortest path queries on massive social networks. The goal is to answer queries within tens of milliseconds while minimizing the memory requirements. We present a technique that achieves this goal for an extremely large fraction of path queries by exploiting the structure of the social networks. Using evaluations on real-world datasets, we argu… ▽ More We consider the problem of answering point-to-point shortest path queries on massive social networks. The goal is to answer queries within tens of milliseconds while minimizing the memory requirements. We present a technique that achieves this goal for an extremely large fraction of path queries by exploiting the structure of the social networks. Using evaluations on real-world datasets, we argue that our technique offers a unique trade-off between latency, memory and accuracy. For instance, for the LiveJournal social network (roughly 5 million nodes and 69 million edges), our technique can answer 99.9% of the queries in less than a millisecond. In comparison to storing all pair shortest paths, our technique requires at least 550x less memory; the average query time is roughly 365 microseconds --- 430x faster than the state-of-the-art shortest path algorithm. Furthermore, the relative performance of our technique improves with the size (and density) of the network. For the Orkut social network (3 million nodes and 220 million edges), for instance, our technique is roughly 2588x faster than the state-of-the-art algorithm for computing shortest paths. △ Less

Submitted 6 June, 2012; originally announced June 2012.

Comments: 6 pages; to appear in SIGCOMM WOSN 2012

arXiv:1201.2703 [pdf]

Faster Approximate Distance Queries and Compact Routing in Sparse Graphs

Authors: Rachit Agarwal, P. Brighten Godfrey, Sariel Har-Peled

Abstract: A distance oracle is a compact representation of the shortest distance matrix of a graph. It can be queried to approximate shortest paths between any pair of vertices. Any distance oracle that returns paths of worst-case stretch (2k-1) must require space $Ω(n^{1 + 1/k})$ for graphs of n nodes. The hard cases that enforce this lower bound are, however, rather dense graphs with average degree Ω(n^{1… ▽ More A distance oracle is a compact representation of the shortest distance matrix of a graph. It can be queried to approximate shortest paths between any pair of vertices. Any distance oracle that returns paths of worst-case stretch (2k-1) must require space $Ω(n^{1 + 1/k})$ for graphs of n nodes. The hard cases that enforce this lower bound are, however, rather dense graphs with average degree Ω(n^{1/k}). We present distance oracles that, for sparse graphs, substantially break the lower bound barrier at the expense of higher query time. For any 1 \leq α\leq n, our distance oracles can return stretch 2 paths using O(m + n^2/α) space and stretch 3 paths using O(m + n^2/α^2) space, at the expense of O(αm/n) query time. By setting appropriate values of α, we get the first distance oracles that have size linear in the size of the graph, and return constant stretch paths in non-trivial query time. The query time can be further reduced to O(α), by using an additional O(m α) space for all our distance oracles, or at the cost of a small constant additive stretch. We use our stretch 2 distance oracle to present the first compact routing scheme with worst-case stretch 2. Any compact routing scheme with stretch less than 2 must require linear memory at some nodes even for sparse graphs; our scheme, hence, achieves the optimal stretch with non-trivial memory requirements. Moreover, supported by large-scale simulations on graphs including the AS-level Internet graph, we argue that our stretch-2 scheme would be simple and efficient to implement as a distributed compact routing protocol. △ Less

Submitted 12 January, 2012; originally announced January 2012.

Comments: 20 pages, an earlier version appeared in INFOCOM 2011, this version presents data structures with improved space/query-time trade-off

arXiv:1201.1661 [pdf, ps, other]

Slick Packets

Authors: Giang T. K. Nguyen, Rachit Agarwal, Junda Liu, Matthew Caesar, P. Brighten Godfrey, Scott Shenker

Abstract: Source-controlled routing has been proposed as a way to improve flexibility of future network architectures, as well as simplifying the data plane. However, if a packet specifies its path, this precludes fast local re-routing within the network. We propose SlickPackets, a novel solution that allows packets to slip around failures by specifying alternate paths in their headers, in the form of compa… ▽ More Source-controlled routing has been proposed as a way to improve flexibility of future network architectures, as well as simplifying the data plane. However, if a packet specifies its path, this precludes fast local re-routing within the network. We propose SlickPackets, a novel solution that allows packets to slip around failures by specifying alternate paths in their headers, in the form of compactly-encoded directed acyclic graphs. We show that this can be accomplished with reasonably small packet headers for real network topologies, and results in responsiveness to failures that is competitive with past approaches that require much more state within the network. Our approach thus enables fast failure response while preserving the benefits of source-controlled routing. △ Less

Submitted 8 January, 2012; originally announced January 2012.

Comments: This is the full version of a paper with the same title that appeared in ACM SIGMETRICS 2011, with the inclusion of the appendix. 16 pages

ACM Class: C.2.1; C.2.2; C.2.6

arXiv:1110.1687 [pdf, other]

Jellyfish: Networking Data Centers Randomly

Authors: Ankit Singla, Chi-Yao Hong, Lucian Popa, P. Brighten Godfrey

Abstract: Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect, which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish i… ▽ More Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect, which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree: A Jellyfish interconnect built using the same equipment as a fat-tree, supports as many as 25% more servers at full capacity at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish's unstructured design brings new challenges in routing, physical layout, and wiring. We describe and evaluate approaches that resolve these challenges effectively, indicating that Jellyfish could be deployed in today's data centers. △ Less

Submitted 20 April, 2012; v1 submitted 7 October, 2011; originally announced October 2011.

Comments: 14 pages, 12 figures

arXiv:1108.0192 [pdf, other]

BGP Stability is Precarious

Authors: P. Brighten Godfrey

Abstract: We note a fact which is simple, but may be useful for the networking research community: essentially any change to BGP's decision process can cause divergence --- or convergence when BGP would otherwise diverge. We note a fact which is simple, but may be useful for the networking research community: essentially any change to BGP's decision process can cause divergence --- or convergence when BGP would otherwise diverge. △ Less

Submitted 31 July, 2011; originally announced August 2011.

arXiv:0803.0632 [pdf, ps, other]

Network Coding for Distributed Storage Systems

Authors: Alexandros G. Dimakis, P. Brighten Godfrey, Yunnan Wu, Martin J. Wainwright, Kannan Ramchandran

Abstract: Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since… ▽ More Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network. For an erasure coded system, a common practice to repair from a node failure is for a new node to download subsets of data stored at a number of surviving nodes, reconstruct a lost coded block using the downloaded data, and store it at the new node. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to download \emph{functions} of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff. △ Less

Submitted 5 March, 2008; originally announced March 2008.

Report number: EECS 14546

arXiv:cs/0702015 [pdf, ps, other]

Network Coding for Distributed Storage Systems

Authors: Alexandros G. Dimakis, P. Brighten Godfrey, Martin J. Wainwright, Kannan Ramchandran

Abstract: Peer-to-peer distributed storage systems provide reliable access to data through redundancy spread over nodes across the Internet. A key goal is to minimize the amount of bandwidth used to maintain that redundancy. Storing a file using an erasure code, in fragments spread across nodes, promises to require less redundancy and hence less maintenance bandwidth than simple replication to provide the… ▽ More Peer-to-peer distributed storage systems provide reliable access to data through redundancy spread over nodes across the Internet. A key goal is to minimize the amount of bandwidth used to maintain that redundancy. Storing a file using an erasure code, in fragments spread across nodes, promises to require less redundancy and hence less maintenance bandwidth than simple replication to provide the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate a new fragment in a distributed way while transferring as little data as possible across the network. In this paper, we introduce a general technique to analyze storage architectures that combine any form of coding and replication, as well as presenting two new schemes for maintaining redundancy using erasure codes. First, we show how to optimally generate MDS fragments directly from existing fragments in the system. Second, we introduce a new scheme called Regenerating Codes which use slightly larger fragments than MDS but have lower overall bandwidth use. We also show through simulation that in realistic environments, Regenerating Codes can reduce maintenance bandwidth use by 25 percent or more compared with the best previous design--a hybrid of replication and erasure codes--while simplifying system architecture. △ Less

Submitted 2 February, 2007; originally announced February 2007.

Comments: To appear in INFOCOM 2007

Showing 1–23 of 23 results for author: Godfrey, P B