-
Towards a Standardized Representation for Deep Learning Collective Algorithms
Authors:
Jinsun Yoo,
William Won,
Meghan Cowan,
Nan Jiang,
Benjamin Klenk,
Srinivas Sridharan,
Tushar Krishna
Abstract:
The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimiz…
▽ More
The explosion of machine learning model size has led to its execution on distributed clusters at a very large scale. Many works have tried to optimize the process of producing collective algorithms and running collective communications, which act as a bottleneck to distributed machine learning. However, different works use their own collective algorithm representation, pushing away from co-optimizing collective communication and the rest of the workload. The lack of a standardized collective algorithm representation has also hindered interoperability between collective algorithm producers and consumers. Additionally, tool-specific conversions and modifications have to be made for each pair of tools producing and consuming collective algorithms which adds to engineering efforts.
In this position paper, we propose a standardized workflow leveraging a common collective algorithm representation. Upstream producers and downstream consumers converge to a common representation format based on Chakra Execution Trace, a commonly used graph based representation of distributed machine learning workloads. Such a common representation enables us to view collective communications at the same level as workload operations and decouple producer and consumer tools, enhance interoperability, and relieve the user from the burden of having to focus on downstream implementations. We provide a proof-of-concept of this standardized workflow by simulating collective algorithms generated by the MSCCLang domain-specific language through the ASTRA-sim distributed machine learning simulator using various network configurations.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Authors:
Saeed Rashidi,
William Won,
Sudarshan Srinivasan,
Puneet Gupta,
Tushar Krishna
Abstract:
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating h…
▽ More
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Adversarial Sleeping Bandit Problems with Multiple Plays: Algorithm and Ranking Application
Authors:
Jianjun Yuan,
Wei Lee Woon,
Ludovik Coba
Abstract:
This paper presents an efficient algorithm to solve the sleeping bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the sleeping bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with reg…
▽ More
This paper presents an efficient algorithm to solve the sleeping bandit with multiple plays problem in the context of an online recommendation system. The problem involves bounded, adversarial loss and unknown i.i.d. distributions for arm availability. The proposed algorithm extends the sleeping bandit algorithm for single arm selection and is guaranteed to achieve theoretical performance with regret upper bounded by $\bigO(kN^2\sqrt{T\log T})$, where $k$ is the number of arms selected per time step, $N$ is the total number of arms, and $T$ is the time horizon.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
Authors:
William Won,
Midhilesh Elavazhagan,
Sudarshan Srinivasan,
Swati Gupta,
Tushar Krishna
Abstract:
The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collec…
▽ More
The surge of artificial intelligence, particularly large language models, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.
△ Less
Submitted 2 October, 2024; v1 submitted 11 April, 2023;
originally announced April 2023.
-
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
Authors:
William Won,
Taekyung Heo,
Saeed Rashidi,
Srinivas Sridharan,
Sudarshan Srinivasan,
Tushar Krishna
Abstract:
As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emergin…
▽ More
As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models
Authors:
Saeed Rashidi,
William Won,
Sudarshan Srinivasan,
Srinivas Sridharan,
Tushar Krishna
Abstract:
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional n…
▽ More
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.
△ Less
Submitted 7 July, 2022; v1 submitted 9 October, 2021;
originally announced October 2021.
-
LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
Authors:
William Won,
Saeed Rashidi,
Sudarshan Srinivasan,
Tushar Krishna
Abstract:
As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of…
▽ More
As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth. We also identify that optimal bandwidth allocation is pivotal for multi-dimensional networks to ensure efficient resource utilization. We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures. Through case studies, we demonstrate the value of LIBRA, both in architecting optimized fabrics under diverse constraints and in enabling co-optimization opportunities.
△ Less
Submitted 5 May, 2024; v1 submitted 24 September, 2021;
originally announced September 2021.
-
Extending Sparse Tensor Accelerators to Support Multiple Compression Formats
Authors:
Eric Qin,
Geonhwa Jeong,
William Won,
Sheng-Chun Kao,
Hyoukjun Kwon,
Sudarshan Srinivasan,
Dipankar Das,
Gordon E. Moon,
Sivasankaran Rajamanickam,
Tushar Krishna
Abstract:
Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enab…
▽ More
Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ~4X speedup over performing format conversions in software.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
Explainable AI as a Social Microscope: A Case Study on Academic Performance
Authors:
Anahit Sargsyan,
Areg Karapetyan,
Wei Lee Woon,
Aamena Alshamsi
Abstract:
Academic performance is perceived as a product of complex interactions between students' overall experience, personal characteristics and upbringing. Data science techniques, most commonly involving regression analysis and related approaches, serve as a viable means to explore this interplay. However, these tend to extract factors with wide-ranging impact, while overlooking variations specific to…
▽ More
Academic performance is perceived as a product of complex interactions between students' overall experience, personal characteristics and upbringing. Data science techniques, most commonly involving regression analysis and related approaches, serve as a viable means to explore this interplay. However, these tend to extract factors with wide-ranging impact, while overlooking variations specific to individual students. Focusing on each student's peculiarities is generally impossible with thousands or even hundreds of subjects, yet data mining methods might prove effective in devising more targeted approaches. For instance, subjects with shared characteristics can be assigned to clusters, which can then be examined separately with machine learning algorithms, thereby providing a more nuanced view of the factors affecting individuals in a particular group. In this context, we introduce a data science workflow allowing for fine-grained analysis of academic performance correlates that captures the subtle differences in students' sensitivities to these factors. Leveraging the Local Interpretable Model-Agnostic Explanations (LIME) algorithm from the toolbox of Explainable Artificial Intelligence (XAI) techniques, the proposed pipeline yields groups of students having similar academic attainment indicators, rather than similar features (e.g. familial background) as typically practiced in prior studies. As a proof-of-concept case study, a rich longitudinal dataset is selected to evaluate the effectiveness of the proposed approach versus a standard regression model.
△ Less
Submitted 4 June, 2020; v1 submitted 7 June, 2018;
originally announced June 2018.
-
The Preeminence of Ethnic Diversity in Scientific Collaboration
Authors:
Bedoor K AlShebli,
Talal Rahwan,
Wei Lee Woon
Abstract:
Inspired by the social and economic benefits of diversity, we analyze over 9 million papers and 6 million scientists to study the relationship between research impact and five classes of diversity: ethnicity, discipline, gender, affiliation, and academic age. Using randomized baseline models, we establish the presence of homophily in ethnicity, gender and affiliation. We then study the effect of d…
▽ More
Inspired by the social and economic benefits of diversity, we analyze over 9 million papers and 6 million scientists to study the relationship between research impact and five classes of diversity: ethnicity, discipline, gender, affiliation, and academic age. Using randomized baseline models, we establish the presence of homophily in ethnicity, gender and affiliation. We then study the effect of diversity on scientific impact, as reflected in citations. Remarkably, of the classes considered, ethnic diversity had the strongest correlation with scientific impact. To further isolate the effects of ethnic diversity, we used randomized baseline models and again found a clear link between diversity and impact. To further support these findings, we use coarsened exact matching to compare the scientific impact of ethnically diverse papers and scientists with closely-matched control groups. Here, we find that ethnic diversity resulted in an impact gain of 10.63% for papers, and 47.67% for scientists.
△ Less
Submitted 20 November, 2020; v1 submitted 6 March, 2018;
originally announced March 2018.
-
Co-occurrence matrix analysis-based semi-supervised training for object detection
Authors:
Min-Kook Choi,
Jaehyeong Park,
Jihun Jung,
Heechul Jung,
Jin-Hee Lee,
Woong Jae Won,
Woo Young Jung,
Jincheol Kim,
Soon Kwon
Abstract:
One of the most important factors in training object recognition networks using convolutional neural networks (CNNs) is the provision of annotated data accompanying human judgment. Particularly, in object detection or semantic segmentation, the annotation process requires considerable human effort. In this paper, we propose a semi-supervised learning (SSL)-based training methodology for object det…
▽ More
One of the most important factors in training object recognition networks using convolutional neural networks (CNNs) is the provision of annotated data accompanying human judgment. Particularly, in object detection or semantic segmentation, the annotation process requires considerable human effort. In this paper, we propose a semi-supervised learning (SSL)-based training methodology for object detection, which makes use of automatic labeling of un-annotated data by applying a network previously trained from an annotated dataset. Because an inferred label by the trained network is dependent on the learned parameters, it is often meaningless for re-training the network. To transfer a valuable inferred label to the unlabeled data, we propose a re-alignment method based on co-occurrence matrix analysis that takes into account one-hot-vector encoding of the estimated label and the correlation between the objects in the image. We used an MS-COCO detection dataset to verify the performance of the proposed SSL method and deformable neural networks (D-ConvNets) as an object detector for basic training. The performance of the existing state-of-the-art detectors (DConvNets, YOLO v2, and single shot multi-box detector (SSD)) can be improved by the proposed SSL method without using the additional model parameter or modifying the network architecture.
△ Less
Submitted 19 February, 2018;
originally announced February 2018.
-
The Word Problem in the Baumslag group with a non-elementary Dehn function is polynomial time decidable
Authors:
Alexei Miasnikov,
Alexander Ushakov,
Dong Wook Won
Abstract:
We prove that the Word problem in the Baumslag group G(1,2) which has a non-elementary Dehn function is decidable in polynomial time.
We prove that the Word problem in the Baumslag group G(1,2) which has a non-elementary Dehn function is decidable in polynomial time.
△ Less
Submitted 12 February, 2011;
originally announced February 2011.
-
Power Circuits, Exponential Algebra, and Time Complexity
Authors:
Alexei G. Myasnikov,
Alexander Ushakov,
Dong Wook Won
Abstract:
Motivated by algorithmic problems from combinatorial group theory we study computational properties of integers equipped with binary operations +, -, z = x 2^y, z = x 2^{-y} (the former two are partial) and predicates < and =. Notice that in this case very large numbers, which are obtained as n towers of exponentiation in the base 2 can be realized as n applications of the operation x2^y, so worki…
▽ More
Motivated by algorithmic problems from combinatorial group theory we study computational properties of integers equipped with binary operations +, -, z = x 2^y, z = x 2^{-y} (the former two are partial) and predicates < and =. Notice that in this case very large numbers, which are obtained as n towers of exponentiation in the base 2 can be realized as n applications of the operation x2^y, so working with such numbers given in the usual binary expansions requires super exponential space. We define a new compressed representation for integers by power circuits (a particular type of straight-line programs) which is unique and easily computable, and show that the operations above can be performed in polynomial time if the numbers are presented by power circuits. We mention several applications of this technique to algorithmic problems, in particular, we prove that the quantifier-free theories of various exponential algebras are decidable in polynomial time, as well as the word problems in some "hard to crack" one-relator groups.
△ Less
Submitted 13 June, 2010;
originally announced June 2010.