-
Interpolation with deep neural networks with non-polynomial activations: necessary and sufficient numbers of neurons
Authors:
Liam Madden
Abstract:
The minimal number of neurons required for a feedforward neural network to interpolate $n$ generic input-output pairs from $\mathbb{R}^d\times \mathbb{R}^{d'}$ is $Θ(\sqrt{nd'})$. While previous results have shown that $Θ(\sqrt{nd'})$ neurons are sufficient, they have been limited to sigmoid, Heaviside, and rectified linear unit (ReLU) as the activation function. Using a different approach, we pro…
▽ More
The minimal number of neurons required for a feedforward neural network to interpolate $n$ generic input-output pairs from $\mathbb{R}^d\times \mathbb{R}^{d'}$ is $Θ(\sqrt{nd'})$. While previous results have shown that $Θ(\sqrt{nd'})$ neurons are sufficient, they have been limited to sigmoid, Heaviside, and rectified linear unit (ReLU) as the activation function. Using a different approach, we prove that $Θ(\sqrt{nd'})$ neurons are sufficient as long as the activation function is real analytic at a point and not a polynomial there. Thus, the only practical activation functions that our result does not apply to are piecewise polynomials. Importantly, this means that activation functions can be freely chosen in a problem-dependent manner without loss of interpolation power.
△ Less
Submitted 16 September, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Next-token prediction capacity: general upper bounds and a lower bound for transformers
Authors:
Liam Madden,
Curtis Fox,
Christos Thrampoulidis
Abstract:
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributi…
▽ More
Given a sequence of tokens, such as words, the task of next-token prediction is to predict the next-token conditional probability distribution. Decoder-only transformers have become effective models for this task, but their properties are still not fully understood. In particular, the largest number of distinct context sequences that a decoder-only transformer can interpolate next-token distributions for has not been established. To fill this gap, we prove upper and lower bounds on this number, which are equal up to a multiplicative constant. We prove these bounds in the general setting where next-token distributions can be arbitrary as well as the empirical setting where they are calculated from a finite number of document sequences. Our lower bounds are for one-layer multi-head decoder-only transformers and our proofs highlight an important injectivity property satisfied by self-attention. Furthermore, we provide numerical evidence that the minimal number of parameters for memorization is sufficient for being able to train the model to the entropy lower bound.
△ Less
Submitted 16 September, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Memory capacity of two layer neural networks with smooth activations
Authors:
Liam Madden,
Christos Thrampoulidis
Abstract:
Determining the memory capacity of two layer neural networks with $m$ hidden neurons and input dimension $d$ (i.e., $md+2m$ total trainable parameters), which refers to the largest size of general data the network can memorize, is a fundamental machine learning question. For activations that are real analytic at a point and, if restricting to a polynomial there, have sufficiently high degree, we e…
▽ More
Determining the memory capacity of two layer neural networks with $m$ hidden neurons and input dimension $d$ (i.e., $md+2m$ total trainable parameters), which refers to the largest size of general data the network can memorize, is a fundamental machine learning question. For activations that are real analytic at a point and, if restricting to a polynomial there, have sufficiently high degree, we establish a lower bound of $\lfloor md/2\rfloor$ and optimality up to a factor of approximately $2$. All practical activations, such as sigmoids, Heaviside, and the rectified linear unit (ReLU), are real analytic at a point. Furthermore, the degree condition is mild, requiring, for example, that $\binom{k+d-1}{d-1}\ge n$ if the activation is $x^k$. Analogous prior results were limited to Heaviside and ReLU activations -- our result covers almost everything else. In order to analyze general activations, we derive the precise generic rank of the network's Jacobian, which can be written in terms of Hadamard powers and the Khatri-Rao product. Our analysis extends classical linear algebraic facts about the rank of Hadamard powers. Overall, our approach differs from prior works on memory capacity and holds promise for extending to deeper models and other architectures.
△ Less
Submitted 1 May, 2024; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Infrared Spectroscopic Survey of the Quiescent Medium of Nearby Clouds: II. Ice Formation and Grain Growth in Perseus and Serpens
Authors:
M. C. L. Madden,
A. C. A. Boogert,
J. E. Chiar,
C. Knez,
Y. J. Pendleton,
A. G. G. M. Tielens,
A. Yip
Abstract:
The properties of dust change during the transition from diffuse to dense clouds as a result of ice formation and dust coagulation, but much is still unclear about this transformation. We present 2-20 micron spectra of 49 field stars behind the Perseus and Serpens Molecular Clouds and establish relationships between the near-infrared continuum extinction (AK) and the depths of the 9.7 micron silic…
▽ More
The properties of dust change during the transition from diffuse to dense clouds as a result of ice formation and dust coagulation, but much is still unclear about this transformation. We present 2-20 micron spectra of 49 field stars behind the Perseus and Serpens Molecular Clouds and establish relationships between the near-infrared continuum extinction (AK) and the depths of the 9.7 micron silicate (tau97) and 3.0 micron H2O ice (tau30) absorption bands. The tau97/AK ratio varies from large, diffuse interstellar medium-like values (~0.55), to much lower ratios (~0.26). Above extinctions of AK~1.2 (AV~10; Perseus, Lupus, dense cores) and ~2.0 (AV~17; Serpens), the tau97/AK ratio is lowest. The tau97/AK reduction from diffuse to dense clouds is consistent with a moderate degree of grain growth (sizes up to ~0.5 micron), increasing the near-infrared color excess (and thus AK), but not affecting ice and silicate band profiles. This grain growth process seems to be related to the ice column densities and dense core formation thresholds, highlighting the importance of density. After correction for Serpens foreground extinction, the H2O ice formation threshold is in the range of AK=0.31-0.40 (AV=2.6-3.4) for all clouds, and thus grain growth takes place after the ices are formed. Finally, abundant CH3OH ice (~21% relative to H2O) is reported for 2MASSJ18285266+0028242 (Serpens), a factor of >4 larger than for the other targets.
△ Less
Submitted 23 October, 2022;
originally announced October 2022.
-
Sketching the Best Approximate Quantum Compiling Problem
Authors:
Liam Madden,
Albert Akhriev,
Andrea Simonetto
Abstract:
This paper considers the problem of quantum compilation from an optimization perspective by fixing a circuit structure of CNOTs and rotation gates then optimizing over the rotation angles. We solve the optimization problem classically and consider algorithmic tools to scale it to higher numbers of qubits. We investigate stochastic gradient descent and two sketch-and-solve algorithms. For all three…
▽ More
This paper considers the problem of quantum compilation from an optimization perspective by fixing a circuit structure of CNOTs and rotation gates then optimizing over the rotation angles. We solve the optimization problem classically and consider algorithmic tools to scale it to higher numbers of qubits. We investigate stochastic gradient descent and two sketch-and-solve algorithms. For all three algorithms, we compute the gradient efficiently using matrix-vector instead of matrix-matrix computations. Allowing for a runtime on the order of one hour, our implementation using either sketch-and-solve algorithm is able to compile 9 qubit, 27 CNOT circuits; 12 qubit, 24 CNOT circuits; and 15 qubit, 15 CNOT circuits. Without our algorithmic tools, standard optimization does not scale beyond 9 qubit, 9 CNOT circuits, and, beyond that, is theoretically dominated by barren plateaus.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
Online Stochastic Gradient Methods Under Sub-Weibull Noise and the Polyak-Łojasiewicz Condition
Authors:
Seunghyun Kim,
Liam Madden,
Emiliano Dall'Anese
Abstract:
This paper focuses on the online gradient and proximal-gradient methods with stochastic gradient errors. In particular, we examine the performance of the online gradient descent method when the cost satisfies the Polyak-Łojasiewicz (PL) inequality. We provide bounds in expectation and in high probability (that hold iteration-wise), with the latter derived by leveraging a sub-Weibull model for the…
▽ More
This paper focuses on the online gradient and proximal-gradient methods with stochastic gradient errors. In particular, we examine the performance of the online gradient descent method when the cost satisfies the Polyak-Łojasiewicz (PL) inequality. We provide bounds in expectation and in high probability (that hold iteration-wise), with the latter derived by leveraging a sub-Weibull model for the errors affecting the gradient. The convergence results show that the instantaneous regret converges linearly up to an error that depends on the variability of the problem and the statistics of the sub-Weibull gradient error. Similar convergence results are then provided for the online proximal-gradient method, under the assumption that the composite cost satisfies the proximal-PL condition. In the case of static costs, we provide new bounds for the regret incurred by these methods when the gradient errors are modeled as sub-Weibull random variables. Illustrative simulations are provided to corroborate the technical findings.
△ Less
Submitted 31 March, 2022; v1 submitted 6 August, 2021;
originally announced August 2021.
-
Best Approximate Quantum Compiling Problems
Authors:
Liam Madden,
Andrea Simonetto
Abstract:
We study the problem of finding the best approximate circuit that is the closest (in some pertinent metric) to a target circuit, and which satisfies a number of hardware constraints, like gate alphabet and connectivity. We look at the problem in the CNOT+rotation gate set from a mathematical programming standpoint, offering contributions both in terms of understanding the mathematics of the proble…
▽ More
We study the problem of finding the best approximate circuit that is the closest (in some pertinent metric) to a target circuit, and which satisfies a number of hardware constraints, like gate alphabet and connectivity. We look at the problem in the CNOT+rotation gate set from a mathematical programming standpoint, offering contributions both in terms of understanding the mathematics of the problem and its efficient solution. Among the results that we present, we are able to derive a 14-CNOT 4-qubit Toffoli decomposition from scratch, and show that the Quantum Shannon Decomposition can be compressed by a factor of two without practical loss of fidelity.
△ Less
Submitted 26 November, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
A Stochastic Operator Framework for Optimization and Learning with Sub-Weibull Errors
Authors:
Nicola Bastianello,
Liam Madden,
Ruggero Carli,
Emiliano Dall'Anese
Abstract:
This paper proposes a framework to study the convergence of stochastic optimization and learning algorithms. The framework is modeled over the different challenges that these algorithms pose, such as (i) the presence of random additive errors (e.g. due to stochastic gradients), and (ii) random coordinate updates (e.g. due to asynchrony in distributed set-ups). The paper covers both convex and stro…
▽ More
This paper proposes a framework to study the convergence of stochastic optimization and learning algorithms. The framework is modeled over the different challenges that these algorithms pose, such as (i) the presence of random additive errors (e.g. due to stochastic gradients), and (ii) random coordinate updates (e.g. due to asynchrony in distributed set-ups). The paper covers both convex and strongly convex problems, and it also analyzes online scenarios, involving changes in the data and costs. The paper relies on interpreting stochastic algorithms as the iterated application of stochastic operators, thus allowing us to use the powerful tools of operator theory. In particular, we consider operators characterized by additive errors with sub-Weibull distribution (which parameterize a broad class of errors by their tail probability), and random updates. In this framework we derive convergence results in mean and in high probability, by providing bounds to the distance of the current iteration from a solution of the optimization or learning problem. The contributions are discussed in light of federated learning applications.
△ Less
Submitted 24 June, 2024; v1 submitted 20 May, 2021;
originally announced May 2021.
-
High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise
Authors:
Liam Madden,
Emiliano Dall'Anese,
Stephen Becker
Abstract:
Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other han…
▽ More
Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.
△ Less
Submitted 14 July, 2024; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Bounds for the tracking error of first-order online optimization methods
Authors:
Liam Madden,
Stephen Becker,
Emiliano Dall'Anese
Abstract:
This paper investigates online algorithms for smooth time-varying optimization problems, focusing first on methods with constant step-size, momentum, and extrapolation-length. Assuming strong convexity, precise results for the tracking iterate error (the limit supremum of the norm of the difference between the optimal solution and the iterates) for online gradient descent are derived. The paper th…
▽ More
This paper investigates online algorithms for smooth time-varying optimization problems, focusing first on methods with constant step-size, momentum, and extrapolation-length. Assuming strong convexity, precise results for the tracking iterate error (the limit supremum of the norm of the difference between the optimal solution and the iterates) for online gradient descent are derived. The paper then considers a general first-order framework, where a universal lower bound on the tracking iterate error is established. Furthermore, a method using "long-steps" is proposed and shown to achieve the lower bound up to a fixed constant. This method is then compared with online gradient descent for specific examples. Finally, the paper analyzes the effect of regularization when the cost is not strongly convex. With regularization, it is possible to achieve a non-regret bound. The paper ends by testing the accelerated and regularized methods on synthetic time-varying least-squares and logistic regression problems, respectively.
△ Less
Submitted 6 October, 2020; v1 submitted 4 March, 2020;
originally announced March 2020.
-
Optimization and Learning with Information Streams: Time-varying Algorithms and Applications
Authors:
Emiliano Dall'Anese,
Andrea Simonetto,
Stephen Becker,
Liam Madden
Abstract:
There is a growing cross-disciplinary effort in the broad domain of optimization and learning with streams of data, applied to settings where traditional batch optimization techniques cannot produce solutions at time scales that match the inter-arrival times of the data points due to computational and/or communication bottlenecks. Special types of online algorithms can handle this situation, and t…
▽ More
There is a growing cross-disciplinary effort in the broad domain of optimization and learning with streams of data, applied to settings where traditional batch optimization techniques cannot produce solutions at time scales that match the inter-arrival times of the data points due to computational and/or communication bottlenecks. Special types of online algorithms can handle this situation, and this article focuses on such time-varying optimization algorithms, with emphasis on Machine Leaning and Signal Processing, as well as data-driven Control. Approaches for the design of time-varying or online first-order optimization methods are discussed, with emphasis on algorithms that can handle errors in the gradient, as may arise when the gradient is estimated. Insights on performance metrics and accompanying claims are provided, along with evidence of cases where algorithms that are provably convergent in batch optimization may perform poorly in an online regime. The role of distributed computation is discussed. Illustrative numerical examples for a number of applications of broad interest are provided to convey key ideas.
△ Less
Submitted 16 January, 2020; v1 submitted 17 October, 2019;
originally announced October 2019.
-
Online Sparse Subspace Clustering
Authors:
Liam Madden,
Stephen Becker,
Emiliano Dall'Anese
Abstract:
This paper focuses on the sparse subspace clustering problem, and develops an online algorithmic solution to cluster data points on-the-fly, without revisiting the whole dataset. The strategy involves an online solution of a sparse representation (SR) problem to build a (sparse) dictionary of similarities where points in the same subspace are considered "similar," followed by a spectral clustering…
▽ More
This paper focuses on the sparse subspace clustering problem, and develops an online algorithmic solution to cluster data points on-the-fly, without revisiting the whole dataset. The strategy involves an online solution of a sparse representation (SR) problem to build a (sparse) dictionary of similarities where points in the same subspace are considered "similar," followed by a spectral clustering based on the obtained similarity matrix. When the SR cost is strongly convex, the online solution converges to within a neighborhood of the optimal time-varying batch solution. A dynamic regret analysis is performed when the SR cost is not strongly convex.
△ Less
Submitted 3 June, 2019; v1 submitted 27 February, 2019;
originally announced February 2019.