-
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
Authors:
Raghu Prabhakar,
Ram Sivaramakrishnan,
Darshan Gandhi,
Yun Du,
Mingran Wang,
Xiangyu Song,
Kejie Zhang,
Tianren Gao,
Angela Wang,
Karen Li,
Yongning Sheng,
Joshua Brot,
Denis Sokolov,
Apurv Vivek,
Calvin Leung,
Arjun Sabnis,
Jiayu Bai,
Tuowen Zhao,
Mark Gottscho,
David Jackson,
Mark Luttrell,
Manish K. Shah,
Edison Chen,
Kaizhao Liang,
Swayambhoo Jain
, et al. (5 additional authors not shown)
Abstract:
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Expert…
▽ More
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.
In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Sampling and Certifying Symmetric Functions
Authors:
Yuval Filmus,
Itai Leigh,
Artur Riazanov,
Dmitry Sokolov
Abstract:
A circuit $\mathcal{C}$ samples a distribution $\mathbf{X}$ with an error $ε$ if the statistical distance between the output of $\mathcal{C}$ on the uniform input and $\mathbf{X}$ is $ε$. We study the hardness of sampling a uniform distribution over the set of $n$-bit strings of Hamming weight $k$ denoted by $\mathbf{U}^n_k$ for _decision forests_, i.e. every output bit is computed as a decision t…
▽ More
A circuit $\mathcal{C}$ samples a distribution $\mathbf{X}$ with an error $ε$ if the statistical distance between the output of $\mathcal{C}$ on the uniform input and $\mathbf{X}$ is $ε$. We study the hardness of sampling a uniform distribution over the set of $n$-bit strings of Hamming weight $k$ denoted by $\mathbf{U}^n_k$ for _decision forests_, i.e. every output bit is computed as a decision tree of the inputs. For every $k$ there is an $O(\log n)$-depth decision forest sampling $\mathbf{U}^n_k$ with an inverse-polynomial error [Viola 2012, Czumaj 2015]. We show that for every $ε> 0$ there exists $τ$ such that for decision depth $τ\log (n/k) / \log \log (n/k)$, the error for sampling $\mathbf{U}_k^n$ is at least $1-ε$. Our result is based on the recent robust sunflower lemma [Alweiss, Lovett, Wu, Zhang 2021, Rao 2019].
Our second result is about matching a set of $n$-bit strings with the image of a $d$-_local_ circuit, i.e. such that each output bit depends on at most $d$ input bits. We study the set of all $n$-bit strings whose Hamming weight is at least $n/2$. We improve the previously known locality lower bound from $Ω(\log^* n)$ [Beyersdorff, Datta, Krebs, Mahajan, Scharfenberger-Fabian, Sreenivasaiah, Thomas and Vollmer, 2013] to $Ω(\sqrt{\log n})$, leaving only a quartic gap from the best upper bound of $O(\log^2 n)$.
△ Less
Submitted 7 May, 2023;
originally announced May 2023.
-
Top-Down Lower Bounds for Depth-Four Circuits
Authors:
Mika Göös,
Artur Riazanov,
Anastasia Sofronova,
Dmitry Sokolov
Abstract:
We present a top-down lower-bound method for depth-$4$ boolean circuits. In particular, we give a new proof of the well-known result that the parity function requires depth-$4$ circuits of size exponential in $n^{1/3}$. Our proof is an application of robust sunflowers and block unpredictability.
We present a top-down lower-bound method for depth-$4$ boolean circuits. In particular, we give a new proof of the well-known result that the parity function requires depth-$4$ circuits of size exponential in $n^{1/3}$. Our proof is an application of robust sunflowers and block unpredictability.
△ Less
Submitted 2 May, 2024; v1 submitted 5 April, 2023;
originally announced April 2023.
-
Practical lowest distortion mapping
Authors:
Vladimir Garanzha,
Igor Kaporin,
Liudmila Kudryavtseva,
François Protais,
David Desobry,
Dmitry Sokolov
Abstract:
Construction of optimal deformations is one of the long standing problems of computational mathematics. We consider the problem of computing quasi-isometric deformations with minimal possible quasi-isometry constant (global estimate for relative length change).We build our technique upon [Garanzha et al. 2021a], a recently proposed numerical optimization scheme that provably untangles 2D and 3D me…
▽ More
Construction of optimal deformations is one of the long standing problems of computational mathematics. We consider the problem of computing quasi-isometric deformations with minimal possible quasi-isometry constant (global estimate for relative length change).We build our technique upon [Garanzha et al. 2021a], a recently proposed numerical optimization scheme that provably untangles 2D and 3D meshes with inverted elements by partially solving a finite number of minimization problems. In this paper we show the similarity between continuation problems for mesh untangling and for attaining prescribed deformation quality threshold. Both problems can be solved by a finite number of partial solutions of optimization problems which are based on finite element approximations of parameter-dependent hyperelastic functionals. Our method is based on a polyconvex functional which admits a well-posed variational problem. To sum up, we reliably build 2D and 3D mesh deformations with smallest known distortion estimates (quasi-isometry constants) as well as stable quasi conformal parameterizations for very stiff problems.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Foldover-free maps in 50 lines of code
Authors:
Vladimir Garanzha,
Igor Kaporin,
Liudmila Kudryavtseva,
François Protais,
Nicolas Ray,
Dmitry Sokolov
Abstract:
Mapping a triangulated surface to 2D space (or a tetrahedral mesh to 3D space) is the most fundamental problem in geometry processing.In computational physics, untangling plays an important role in mesh generation: it takes a mesh as an input, and moves the vertices to get rid of foldovers.In fact, mesh untangling can be considered as a special case of mapping where the geometry of the object is t…
▽ More
Mapping a triangulated surface to 2D space (or a tetrahedral mesh to 3D space) is the most fundamental problem in geometry processing.In computational physics, untangling plays an important role in mesh generation: it takes a mesh as an input, and moves the vertices to get rid of foldovers.In fact, mesh untangling can be considered as a special case of mapping where the geometry of the object is to be defined in the map space and the geometric domain is not explicit, supposing that each element is regular.In this paper, we propose a mapping method inspired by the untangling problem and compare its performance to the state of the art.The main advantage of our method is that the untangling aims at producing locally injective maps, which is the major challenge of mapping.In practice, our method produces locally injective maps in very difficult settings, and with less distortion than the previous work, both in 2D and 3D. We demonstrate it on a large reference database as well as on more difficult stress tests.For a better reproducibility, we publish the code in Python for a basic evaluation, and in C++ for more advanced applications.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Exponential Resolution Lower Bounds for Weak Pigeonhole Principle and Perfect Matching Formulas over Sparse Graphs
Authors:
Susanna F. de Rezende,
Jakob Nordström,
Kilian Risse,
Dmitry Sokolov
Abstract:
We show exponential lower bounds on resolution proof length for pigeonhole principle (PHP) formulas and perfect matching formulas over highly unbalanced, sparse expander graphs, thus answering the challenge to establish strong lower bounds in the regime between balanced constant-degree expanders as in [Ben-Sasson and Wigderson '01] and highly unbalanced, dense graphs as in [Raz '04] and [Razborov…
▽ More
We show exponential lower bounds on resolution proof length for pigeonhole principle (PHP) formulas and perfect matching formulas over highly unbalanced, sparse expander graphs, thus answering the challenge to establish strong lower bounds in the regime between balanced constant-degree expanders as in [Ben-Sasson and Wigderson '01] and highly unbalanced, dense graphs as in [Raz '04] and [Razborov '03, '04]. We obtain our results by revisiting Razborov's pseudo-width method for PHP formulas over dense graphs and extending it to sparse graphs. This further demonstrates the power of the pseudo-width method, and we believe it could potentially be useful for attacking also other longstanding open problems for resolution and other proof systems.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Anti-aliasing for fused filament deposition
Authors:
Hai-Chuan Song,
Nicolas Ray,
Dmitry Sokolov,
Sylvain Lefebvre
Abstract:
Layered manufacturing inherently suffers from staircase defects along surfaces that are gently slopped with respect to the build direction. Reducing the slice thickness improves the situation but never resolves it completely as flat layers remain a poor approximation of the true surface in these regions. In addition, reducing the slice thickness largely increases the print time. In this work we fo…
▽ More
Layered manufacturing inherently suffers from staircase defects along surfaces that are gently slopped with respect to the build direction. Reducing the slice thickness improves the situation but never resolves it completely as flat layers remain a poor approximation of the true surface in these regions. In addition, reducing the slice thickness largely increases the print time. In this work we focus on a simple yet effective technique to improve the print accuracy for layered manufacturing by filament deposition. Our method works with standard three-axis 3D filament printers (e.g. the typical, widely available 3D printers), using standard extrusion nozzles. It better reproduces the geometry of sloped surfaces without increasing the print time. Our key idea is to perform a local anti-aliasing, working at a sub-layer accuracy to produce slightly curved deposition paths and reduce approximation errors. This is inspired by Computer Graphics anti-aliasing techniques which consider sub-pixel precision to treat aliasing effects. We show that the necessary deviation in height compared to standard slicing is bounded by half the layer thickness. Therefore, the height changes remain small and plastic deposition remains reliable. We further split and order paths to minimize defects due to the extruder nozzle shape, avoiding any change to the existing hardware. We apply and analyze our approach on 3D printed examples, showing that our technique greatly improves surface accuracy and silhouette quality while keeping the print time nearly identical.
△ Less
Submitted 10 April, 2017; v1 submitted 10 September, 2016;
originally announced September 2016.
-
Inappropriate use of L-BFGS, Illustrated on frame field design
Authors:
Nicolas Ray,
Dmitry Sokolov
Abstract:
L-BFGS is a hill climbing method that is guarantied to converge only for convex problems. In computer graphics, it is often used as a black box solver for a more general class of non linear problems, including problems having many local minima. Some works obtain very nice results by solving such difficult problems with L-BFGS. Surprisingly, the method is able to escape local minima: our interpreta…
▽ More
L-BFGS is a hill climbing method that is guarantied to converge only for convex problems. In computer graphics, it is often used as a black box solver for a more general class of non linear problems, including problems having many local minima. Some works obtain very nice results by solving such difficult problems with L-BFGS. Surprisingly, the method is able to escape local minima: our interpretation is that the approximation of the Hessian is smoother than the real Hessian, making it possible to evade the local minima. We analyse the behavior of L-BFGS on the design of 2D frame fields. It involves an energy function that is infinitly continuous, strongly non linear and having many local minima. Moreover, the local minima have a clear visual interpretation: they corresponds to differents frame field topologies. We observe that the performances of LBFGS are almost unpredictables: they are very competitive when the field is sampled on the primal graph, but really poor when they are sampled on the dual graph.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.
-
On Smooth 3D Frame Field Design
Authors:
Nicolas Ray,
Dmitry Sokolov
Abstract:
We analyze actual methods that generate smooth frame fields both in 2D and in 3D. We formalize the 2D problem by representing frames as functions (as it was done in 3D), and show that the derived optimization problem is the one that previous work obtain via "representation vectors." We show (in 2D) why this non linear optimization problem is easier to solve than directly minimizing the rotation an…
▽ More
We analyze actual methods that generate smooth frame fields both in 2D and in 3D. We formalize the 2D problem by representing frames as functions (as it was done in 3D), and show that the derived optimization problem is the one that previous work obtain via "representation vectors." We show (in 2D) why this non linear optimization problem is easier to solve than directly minimizing the rotation angle of the field, and observe that the 2D algorithm is able to find good fields.
Now, the 2D and the 3D optimization problems are derived from the same formulation (based on representing frames by functions). Their energies share some similarities from an optimization point of view (smoothness, local minima, bounds of partial derivatives, etc.), so we applied the 2D resolution mechanism to the 3D problem. Our evaluation of all existing 3D methods suggests to initialize the field by this new algorithm, but possibly use another method for further smoothing.
△ Less
Submitted 13 July, 2015;
originally announced July 2015.
-
Tree-like resolution complexity of two planar problems
Authors:
Dmitry Itsykson,
Anna Malova,
Vsevolod Oparin,
Dmitry Sokolov
Abstract:
We consider two CSP problems: the first CSP encodes 2D Sperner's lemma for the standard triangulation of the right triangle on $n^2$ small triangles; the second CSP encodes the fact that it is impossible to match cells of $n \times n$ square to arrows (two horizontal, two vertical and four diagonal) such that arrows in two cells with a common edge differ by at most $45^\circ$, and all arrows on th…
▽ More
We consider two CSP problems: the first CSP encodes 2D Sperner's lemma for the standard triangulation of the right triangle on $n^2$ small triangles; the second CSP encodes the fact that it is impossible to match cells of $n \times n$ square to arrows (two horizontal, two vertical and four diagonal) such that arrows in two cells with a common edge differ by at most $45^\circ$, and all arrows on the boundary of the square do not look outside (this fact is a corollary of the Brower's fixed point theorem). We prove that the tree-like resolution complexities of these CSPs are $2^{Θ(n)}$. For Sperner's lemma our result implies $Ω(n)$ lower bound on the number of request to colors of vertices that is enough to make in order to find a trichromatic triangle; this lower bound was originally proved by Crescenzi and Silvestri.
CSP based on Sperner's lemma is related with the $\rm PPAD$-complete problem. We show that CSP corresponding to arrows is also related with a $\rm PPAD$-complete problem.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.
-
Tracing cross-free polylines oriented by a N-symmetry direction field on triangulated surfaces
Authors:
Nicolas Ray,
Dmitry Sokolov
Abstract:
We propose an algorithm for tracing polylines on a triangle mesh such that: they are aligned with a N-symmetry direction field, and two such polylines cannot cross or merge. This property is fundamental for mesh segmentation and is very difficult to enforce with numerical integration of vector fields. We propose an alternative solution based on "stream-mesh", a new combinatorial data structure tha…
▽ More
We propose an algorithm for tracing polylines on a triangle mesh such that: they are aligned with a N-symmetry direction field, and two such polylines cannot cross or merge. This property is fundamental for mesh segmentation and is very difficult to enforce with numerical integration of vector fields. We propose an alternative solution based on "stream-mesh", a new combinatorial data structure that defines, for each point of a triangle edge, where the corresponding polyline leaves the triangle. It makes it possible to trace polylines by iteratively crossing triangles. Vector field singularities and polyline/vertex crossing are characterized and consistently handled. The polylines inherits the cross-free property of the stream-mesh, except inside triangles where avoiding local overlaps would require higher order polycurves.
△ Less
Submitted 4 June, 2013;
originally announced June 2013.
-
Visualizing 2D Flows with Animated Arrow Plots
Authors:
Bruno Jobard,
Nicolas Ray,
Dmitry Sokolov
Abstract:
Flow fields are often represented by a set of static arrows to illustrate scientific vulgarization, documentary film, meteorology, etc. This simple schematic representation lets an observer intuitively interpret the main properties of a flow: its orientation and velocity magnitude. We propose to generate dynamic versions of such representations for 2D unsteady flow fields. Our algorithm smoothly a…
▽ More
Flow fields are often represented by a set of static arrows to illustrate scientific vulgarization, documentary film, meteorology, etc. This simple schematic representation lets an observer intuitively interpret the main properties of a flow: its orientation and velocity magnitude. We propose to generate dynamic versions of such representations for 2D unsteady flow fields. Our algorithm smoothly animates arrows along the flow while controlling their density in the domain over time. Several strategies have been combined to lower the unavoidable popping artifacts arising when arrows appear and disappear and to achieve visually pleasing animations. Disturbing arrow rotations in low velocity regions are also handled by continuously morphing arrow glyphs to semi-transparent discs. To substantiate our method, we provide results for synthetic and real velocity field datasets.
△ Less
Submitted 23 May, 2012;
originally announced May 2012.