subscribe to arXiv mailings

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin

Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on th… ▽ More Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable. △ Less

Submitted 13 October, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

Comments: Code available at https://github.com/princeton-nlp/unintentional-unalignment

arXiv:2407.16831 [pdf, ps, other]

Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design

Authors: Jared Quincy Davis, Boris Hanin, Lingjiao Chen, Peter Bailis, Ion Stoica, Matei Zaharia

Abstract: As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in… ▽ More As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in complexity theory that we show empirically extends to Language Models (LMs). We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems. Through experiments on synthetic tasks such as prime factorization, and core benchmarks such as the MMLU, we demonstrate notable performance gains. For instance, in factoring products of two 3-digit primes, a simple NoN improves accuracy from 3.7\% to 36.6\%. On MMLU, a verifier-based judge construction with only 3 generators boosts accuracy over individual GPT-4-Turbo calls by 2.8\%. Our analysis reveals that these gains are most pronounced in domains where verification is notably easier than generation--a characterization which we believe subsumes many reasoning and procedural knowledge tasks, but doesn't often hold for factual and declarative knowledge-based settings. For mathematical and formal logic reasoning-based subjects of MMLU, we observe a 5-8\% or higher gain, whilst no gain on others such as geography and religion. We provide key takeaways for ML practitioners, including the importance of considering verification complexity, the impact of witness format on verifiability, and a simple test to determine the potential benefit of this NoN approach for a given problem distribution. This work aims to inform future research and practice in the design of compound AI systems. △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2405.16630 [pdf, other]

Bayesian Inference with Deep Weakly Nonlinear Networks

Authors: Boris Hanin, Alexander Zlokapa

Abstract: We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $φ(t) = t + ψt^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the… ▽ More We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $φ(t) = t + ψt^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $ψ$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($ψ=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2403.02419 [pdf, other]

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Authors: Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou

Abstract: Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we i… ▽ More Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Language Model (LM) calls and aggregate their responses. However, there is little understanding of how the number of LM calls - e.g., when asking the LM to answer each question multiple times and taking a majority vote - affects such a compound system's performance. In this paper, we initiate the study of scaling properties of compound inference systems. We analyze, theoretically and empirically, how the number of LM calls affects the performance of Vote and Filter-Vote, two of the simplest compound system designs, which aggregate LM responses via majority voting, optionally applying LM filters. We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior can emerge when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LM calls that maximizes system performance, and define an analytical scaling model for both systems. Experiments show that our scaling model can accurately predict the performance of Vote and Filter-Vote systems and thus find the optimal number of LM calls to make. △ Less

Submitted 4 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.17440 [pdf, other]

Principled Architecture-aware Scaling of Hyperparameters

Authors: Wuyang Chen, Junru Wu, Zhangyang Wang, Boris Hanin

Abstract: Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore th… ▽ More Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2309.16620 [pdf, other]

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Authors: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan

Abstract: The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $μ$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across dep… ▽ More The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $μ$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $μ$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit. △ Less

Submitted 8 December, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

arXiv:2309.01592 [pdf, other]

Les Houches Lectures on Deep Learning at Large & Infinite Width

Authors: Yasaman Bahri, Boris Hanin, Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

Abstract: These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural ne… ▽ More These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural networks, linear models, kernels, and Gaussian processes that arise in the infinite-width limit; and perturbative and non-perturbative treatments of large but finite-width networks, at initialization and after training. △ Less

Submitted 12 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: These are notes from lectures delivered by Yasaman Bahri and Boris Hanin at the 2022 Les Houches Summer School on Statistics Physics and Machine Learning and a first version of them were transcribed by Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

arXiv:2307.06092 [pdf, ps, other]

Quantitative CLTs in Deep Neural Networks

Authors: Stefano Favaro, Boris Hanin, Domenico Marinucci, Ivan Nourdin, Giovanni Peccati

Abstract: We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions… ▽ More We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-γ}$ for $γ>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds. △ Less

Submitted 17 June, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

arXiv:2306.11668 [pdf, other]

Principles for Initialization and Architecture Selection in Graph Neural Networks with ReLU Activations

Authors: Gage DeZoort, Boris Hanin

Abstract: This article derives and validates three principles for initialization and architecture selection in finite width graph neural networks (GNNs) with ReLU activations. First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization. Our initialization scheme guarantees that the average scale of network outputs and gradients remains order… ▽ More This article derives and validates three principles for initialization and architecture selection in finite width graph neural networks (GNNs) with ReLU activations. First, we theoretically derive what is essentially the unique generalization to ReLU GNNs of the well-known He-initialization. Our initialization scheme guarantees that the average scale of network outputs and gradients remains order one at initialization. Second, we prove in finite width vanilla ReLU GNNs that oversmoothing is unavoidable at large depth when using fixed aggregation operator, regardless of initialization. We then prove that using residual aggregation operators, obtained by interpolating a fixed aggregation operator with the identity, provably alleviates oversmoothing at initialization. Finally, we show that the common practice of using residual connections with a fixup-type initialization provably avoids correlation collapse in final layer features at initialization. Through ablation studies we find that using the correct initialization, residual aggregation operators, and residual connections in the forward pass significantly and reliably speeds up early training dynamics in deep ReLU GNNs on a variety of tasks. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Comments appreciated

arXiv:2305.07810 [pdf, ps, other]

Depth Dependence of $μ$P Learning Rates in ReLU MLPs

Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. As in prior work on $μ$P of Yang et. al., we find that this maximal update learning rate is independent of $n$ for all but the first and last layer weights. However, we find that it has a non-trivial dependence of $L$, scaling like $L^{-3/2}.$ △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2212.14457 [pdf, other]

Bayesian Interpolation with Deep Linear Networks

Authors: Boris Hanin, Alexander Zlokapa

Abstract: Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network dep… ▽ More Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit. △ Less

Submitted 14 May, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

arXiv:2212.07295 [pdf, other]

Maximal Initial Learning Rates in Deep ReLU Networks

Authors: Gaurav Iyer, Boris Hanin, David Rolnick

Abstract: Training a neural network requires choosing a suitable learning rate, which involves a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $η^{\ast}$ - the largest learning… ▽ More Training a neural network requires choosing a suitable learning rate, which involves a trade-off between speed and effectiveness of convergence. While there has been considerable theoretical and empirical analysis of how large the learning rate can be, most prior work focuses only on late-stage training. In this work, we introduce the maximal initial learning rate $η^{\ast}$ - the largest learning rate at which a randomly initialized neural network can successfully begin training and achieve (at least) a given threshold accuracy. Using a simple approach to estimate $η^{\ast}$, we observe that in constant-width fully-connected ReLU networks, $η^{\ast}$ behaves differently from the maximum learning rate later in training. Specifically, we find that $η^{\ast}$ is well predicted as a power of depth $\times$ width, provided that (i) the width of the network is sufficiently large compared to the depth, and (ii) the input layer is trained at a relatively small learning rate. We further analyze the relationship between $η^{\ast}$ and the sharpness $λ_{1}$ of the network at initialization, indicating they are closely though not inversely related. We formally prove bounds for $λ_{1}$ in terms of depth $\times$ width that align with our empirical results. △ Less

Submitted 25 May, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: International Conference on Machine Learning (ICML) 2023

arXiv:2207.13571 [pdf, ps, other]

doi 10.1088/1751-8121/ac91b4

Scaling asymptotics of spectral Wigner functions

Authors: Boris Hanin, Steve Zelditch

Abstract: We prove that smooth Wigner-Weyl spectral sums at an energy level $E$ exhibit Airy scaling asymptotics across the classical energy surface $Σ_E$. This was proved earlier by the authors for the isotropic harmonic oscillator and the proof is extended in this article to all quantum Hamiltonians $-\hbar^2 Δ+ V$ where $V$ is a confining potential with at most quadratic growth at infinity. The main tool… ▽ More We prove that smooth Wigner-Weyl spectral sums at an energy level $E$ exhibit Airy scaling asymptotics across the classical energy surface $Σ_E$. This was proved earlier by the authors for the isotropic harmonic oscillator and the proof is extended in this article to all quantum Hamiltonians $-\hbar^2 Δ+ V$ where $V$ is a confining potential with at most quadratic growth at infinity. The main tools are the Herman-Kluk initial value parametrix for the propagator and the Chester-Friedman-Ursell normal form for complex phases with a one-dimensional cubic degeneracy. This gives a rigorous account of Airy scaling asymptotics of spectral Wigner distributions of M.V. Berry, A. Ozorio de Almeida and other physicists. △ Less

Submitted 28 July, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

arXiv:2205.05662 [pdf, other]

Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis

Authors: Wuyang Chen, Wei Huang, Xinyu Gong, Boris Hanin, Zhangyang Wang

Abstract: Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusi… ▽ More Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusive to derive any principled understanding of their effects on the DNN capacity or trainability, and to understand why or in which aspect one specific connectivity pattern is better than another. In this work, we theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity. By analyzing a wide network's Neural Network Gaussian Process (NNGP), we are able to depict how the spectrum of an NNGP kernel propagates through a particular connectivity pattern, and how that affects the bound of convergence rates. As one practical implication of our results, we show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead. Code is available at: https://github.com/VITA-Group/architecture_convergence. △ Less

Submitted 12 October, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: Neurips 2022 accepted

arXiv:2204.01058 [pdf, ps, other]

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

Authors: Boris Hanin

Abstract: This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable… ▽ More This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$. △ Less

Submitted 15 January, 2023; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: 86p

arXiv:2109.12960 [pdf, other]

Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions

Authors: Boris Hanin

Abstract: We prove a precise geometric description of all one layer ReLU networks $z(x;θ)$ with a single linear unit and input/output dimensions equal to one that interpolate a given dataset $\mathcal D=\{(x_i,f(x_i))\}$ and, among all such interpolants, minimize the $\ell_2$-norm of the neuron weights. Such networks can intuitively be thought of as those that minimize the mean-squared error over… ▽ More We prove a precise geometric description of all one layer ReLU networks $z(x;θ)$ with a single linear unit and input/output dimensions equal to one that interpolate a given dataset $\mathcal D=\{(x_i,f(x_i))\}$ and, among all such interpolants, minimize the $\ell_2$-norm of the neuron weights. Such networks can intuitively be thought of as those that minimize the mean-squared error over $\mathcal D$ plus an infinitesimal weight decay penalty. We therefore refer to them as ridgeless ReLU interpolants. Our description proves that, to extrapolate values $z(x;θ)$ for inputs $x\in (x_i,x_{i+1})$ lying between two consecutive datapoints, a ridgeless ReLU interpolant simply compares the signs of the discrete estimates for the curvature of $f$ at $x_i$ and $x_{i+1}$ derived from the dataset $\mathcal D$. If the curvature estimates at $x_i$ and $x_{i+1}$ have different signs, then $z(x;θ)$ must be linear on $(x_i,x_{i+1})$. If in contrast the curvature estimates at $x_i$ and $x_{i+1}$ are both positive (resp. negative), then $z(x;θ)$ is convex (resp. concave) on $(x_i,x_{i+1})$. Our results show that ridgeless ReLU interpolants achieve the best possible generalization for learning $1d$ Lipschitz functions, up to universal constants. △ Less

Submitted 27 September, 2021; originally announced September 2021.

arXiv:2107.01562 [pdf, ps, other]

Random Neural Networks in the Infinite Width Limit as Gaussian Processes

Authors: Boris Hanin

Abstract: This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-lineari… ▽ More This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities. △ Less

Submitted 4 July, 2021; originally announced July 2021.

Comments: 26p

arXiv:2106.10165 [pdf, other]

doi 10.1017/9781009023405

The Principles of Deep Learning Theory

Authors: Daniel A. Roberts, Sho Yaida, Boris Hanin

Abstract: This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are… ▽ More This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers. △ Less

Submitted 24 August, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

Comments: 471 pages, to be published by Cambridge University Press; v2: hyperlinks fixed, index added

Report number: MIT-CTP/5306

Journal ref: Cambridge University Press (2022)

arXiv:2102.10492 [pdf, other]

Deep ReLU Networks Preserve Expected Length

Authors: Boris Hanin, Ryan Jeong, David Rolnick

Abstract: Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length - if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove th… ▽ More Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length - if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for ReLU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments. △ Less

Submitted 22 June, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: 18 pages, 4 figures

arXiv:2012.14501 [pdf, other]

Neural Network Approximation

Authors: Ronald DeVore, Boris Hanin, Guergana Petrova

Abstract: Neural Networks (NNs) are the method of choice for building learning algorithms. Their popularity stems from their empirical success on several challenging learning problems. However, most scholars agree that a convincing theoretical explanation for this success is still lacking. This article surveys the known approximation properties of the outputs of NNs with the aim of uncovering the properti… ▽ More Neural Networks (NNs) are the method of choice for building learning algorithms. Their popularity stems from their empirical success on several challenging learning problems. However, most scholars agree that a convincing theoretical explanation for this success is still lacking. This article surveys the known approximation properties of the outputs of NNs with the aim of uncovering the properties that are not present in the more traditional methods of approximation used in numerical analysis. Comparisons are made with traditional approximation methods from the viewpoint of rate distortion. Another major component in the analysis of numerical approximation is the computational time needed to construct the approximation and this in turn is intimately connected with the stability of the approximation algorithm. So the stability of numerical approximation using NNs is a large part of the analysis put forward. The survey, for the most part, is concerned with NNs using the popular ReLU activation function. In this case, the outputs of the NNs are piecewise linear functions on rather complicated partitions of the domain of $f$ into cells that are convex polytopes. When the architecture of the NN is fixed and the parameters are allowed to vary, the set of output functions of the NN is a parameterized nonlinear manifold. It is shown that this manifold has certain space filling properties leading to an increased ability to approximate (better rate distortion) but at the expense of numerical stability. The space filling creates a challenge to the numerical method in finding best or good parameter choices when trying to approximate. △ Less

Submitted 28 December, 2020; originally announced December 2020.

arXiv:2010.11171 [pdf, other]

How Data Augmentation affects Optimization for Linear Regression

Authors: Boris Hanin, Yi Sun

Abstract: Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization… ▽ More Though data augmentation has rapidly emerged as a key tool for optimization in modern machine learning, a clear picture of how augmentation schedules affect optimization and interact with optimization hyperparameters such as learning rate is nascent. In the spirit of classical convex optimization and recent work on implicit bias, the present work analyzes the effect of augmentation on optimization in the simple convex setting of linear regression with MSE loss. We find joint schedules for learning rate and data augmentation scheme under which augmented gradient descent provably converges and characterize the resulting minimum. Our results apply to arbitrary augmentation schemes, revealing complex interactions between learning rates and augmentations even in the convex setting. Our approach interprets augmented (S)GD as a stochastic optimization method for a time-varying sequence of proxy losses. This gives a unified way to analyze learning rate, batch size, and augmentations ranging from additive noise to random projections. From this perspective, our results, which also give rates of convergence, can be viewed as Monro-Robbins type conditions for augmented (S)GD. △ Less

Submitted 26 October, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: 31 pages, 3 figures, NeurIPS 2021

arXiv:2005.08899 [pdf, ps, other]

Non-asymptotic Results for Singular Values of Gaussian Matrix Products

Authors: Boris Hanin, Grigoris Paouris

Abstract: This article concerns the non-asymptotic analysis of the singular values (and Lyapunov exponents) of Gaussian matrix products in the regime where $N,$ the number of term in the product, is large and $n,$ the size of the matrices, may be large or small and may depend on $N$. We obtain concentration estimates for sums of Lyapunov exponents, a quantitative rate of convergence of the empirical measure… ▽ More This article concerns the non-asymptotic analysis of the singular values (and Lyapunov exponents) of Gaussian matrix products in the regime where $N,$ the number of term in the product, is large and $n,$ the size of the matrices, may be large or small and may depend on $N$. We obtain concentration estimates for sums of Lyapunov exponents, a quantitative rate of convergence of the empirical measure of the normalized squared singular values to the uniform distribution on $[0,1]$, and results on the joint normality of Lyapunov exponents when $N$ is sufficiently large as a function of $n.$ Our technique consists of non-asymptotic versions of the ergodic theory approach at $N=\infty$ due originally to Furstenberg and Kesten in the 1960's, which were then further developed by Newman and Isopi-Newman as well as by a number of other authors in the 1980's. Our key technical idea is that small ball probabilities for volumes of random projections give a way to quantify convergence in the multiplicative ergodic theorem for random matrices. △ Less

Submitted 23 March, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 52 pages, to appear in GAFA

arXiv:1909.05989 [pdf, other]

Finite Depth and Width Corrections to the Neural Tangent Kernel

Authors: Boris Hanin, Mihai Nica

Abstract: We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that f… ▽ More We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that for such deep and wide networks, the NTK has a non-trivial evolution during training by showing that the mean of its first SGD update is also exponential in the ratio of network depth to width. This is sharp contrast to the regime where depth is fixed and network width is very large. Our results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime. △ Less

Submitted 12 September, 2019; originally announced September 2019.

Comments: 27 pages, 2 figures, comments welcome

arXiv:1906.00904 [pdf, other]

Deep ReLU Networks Have Surprisingly Few Activation Patterns

Authors: Boris Hanin, David Rolnick

Abstract: The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the number of activation patterns is one measure of expressivity; and the maximum number of patterns grows exponentially with the depth. However, recent work has showed that the practical expressivity of de… ▽ More The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the number of activation patterns is one measure of expressivity; and the maximum number of patterns grows exponentially with the depth. However, recent work has showed that the practical expressivity of deep networks - the functions they can learn rather than express - is often far from the theoretical maximum. In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns. Our work suggests that realizing the full expressivity of deep networks may not be possible in practice, at least with current methods. △ Less

Submitted 20 October, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

Comments: 18 page, 7 figures

Journal ref: NeurIPS 2019

arXiv:1905.02199 [pdf, other]

Nonlinear Approximation and (Deep) ReLU Networks

Authors: I. Daubechies, R. DeVore, S. Foucart, B. Hanin, G. Petrova

Abstract: This article is concerned with the approximation and expressive powers of deep neural networks. This is an active research area currently producing many interesting papers. The results most commonly found in the literature prove that neural networks approximate functions with classical smoothness to the same accuracy as classical linear methods of approximation, e.g. approximation by polynomials o… ▽ More This article is concerned with the approximation and expressive powers of deep neural networks. This is an active research area currently producing many interesting papers. The results most commonly found in the literature prove that neural networks approximate functions with classical smoothness to the same accuracy as classical linear methods of approximation, e.g. approximation by polynomials or by piecewise polynomials on prescribed partitions. However, approximation by neural networks depending on n parameters is a form of nonlinear approximation and as such should be compared with other nonlinear methods such as variable knot splines or n-term approximation from dictionaries. The performance of neural networks in targeted applications such as machine learning indicate that they actually possess even greater approximation power than these traditional methods of nonlinear approximation. The main results of this article prove that this is indeed the case. This is done by exhibiting large classes of functions which can be efficiently captured by neural networks where classical nonlinear methods fall short of the task. The present article purposefully limits itself to studying the approximation of univariate functions by ReLU networks. Many generalizations to functions of several variables and other activation functions can be envisioned. However, even in this simplest of settings considered here, a theory that completely quantifies the approximation power of neural networks is still lacking. △ Less

Submitted 5 May, 2019; originally announced May 2019.

MSC Class: 41A25; 41A30; 41A46; 68T99; 82C32; 92B20;

arXiv:1903.12524 [pdf, other]

Interface Asymptotics of Wigner-Weyl Distributions for the Harmonic Oscillator

Authors: Boris Hanin, Steve Zelditch

Abstract: We prove several types of scaling results for Wigner distributions of spectral projections of the isotropic Harmonic oscillator on $\mathbb R^d$. In prior work, we studied Wigner distributions $W_{\hbar, E_N(\hbar)}(x, ξ)$ of individual eigenspace projections. In this continuation, we study Weyl sums of such Wigner distributions as the eigenvalue $E_N(\hbar)$ ranges over spectral intervals… ▽ More We prove several types of scaling results for Wigner distributions of spectral projections of the isotropic Harmonic oscillator on $\mathbb R^d$. In prior work, we studied Wigner distributions $W_{\hbar, E_N(\hbar)}(x, ξ)$ of individual eigenspace projections. In this continuation, we study Weyl sums of such Wigner distributions as the eigenvalue $E_N(\hbar)$ ranges over spectral intervals $[E - δ(\hbar), E + δ(\hbar)]$ of various widths $δ(\hbar)$ and as $(x, ξ) \in T^*\mathbb R^d$ ranges over tubes of various widths around the classical energy surface $Σ_E \subset T^*\mathbb R^d$. The main results pertain to interface Airy scaling asymptotics around $Σ_E$, which divides phase space into an allowed and a forbidden region. The first result pertains to $δ(\hbar) = \hbar$ widths and generalizes our earlier results on Wigner distributions of individual eigenspace projections. Our second result pertains to $δ(\hbar) = \hbar^{2/3}$ spectral widths and Airy asymptotics of the Wigner distributions in $\hbar^{2/3}$-tubes around $Σ_E$. Our third result pertains to bulk spectral intervals of fixed width and the behavior of the Wigner distributions inside the energy surface, outside the energy surface and in a thin neighborhood of the energy surface. △ Less

Submitted 29 March, 2019; originally announced March 2019.

Comments: v1: 24p., 2 figs

arXiv:1901.09021 [pdf, other]

Complexity of Linear Regions in Deep Networks

Authors: Boris Hanin, David Rolnick

Abstract: It is well-known that the expressivity of a neural network depends on its architecture, with deeper networks expressing more complex functions. In the case of networks that compute piecewise linear functions, such as those with ReLU activation, the number of distinct linear regions is a natural measure of expressivity. It is possible to construct networks with merely a single region, or for which… ▽ More It is well-known that the expressivity of a neural network depends on its architecture, with deeper networks expressing more complex functions. In the case of networks that compute piecewise linear functions, such as those with ReLU activation, the number of distinct linear regions is a natural measure of expressivity. It is possible to construct networks with merely a single region, or for which the number of linear regions grows exponentially with depth; it is not clear where within this range most networks fall in practice, either before or after training. In this paper, we provide a mathematical framework to count the number of linear regions of a piecewise linear network and measure the volume of the boundaries between these regions. In particular, we prove that for networks at initialization, the average number of regions along any one-dimensional subspace grows linearly in the total number of neurons, far below the exponential upper bound. We also find that the average distance to the nearest region boundary at initialization scales like the inverse of the number of neurons. Our theory suggests that, even after training, the number of linear regions is far below exponential, an intuition that matches our empirical observations. We conclude that the practical expressivity of neural networks is likely far below that of the theoretical maximum, and that this gap can be quantified. △ Less

Submitted 11 June, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

Comments: ICML 2019

arXiv:1901.06438 [pdf, other]

Interface Asymptotics of Eigenspace Wigner distributions for the Harmonic Oscillator

Authors: Boris Hanin, Steve Zelditch

Abstract: Eigenspaces of the quantum isotropic Harmonic Oscillator $\hat{H}_{\hbar} : = - \frac{\hbar^2}{2} Δ+ \frac{||x||^2}{2}$ on $\mathbb{R}^d$ have extremally high multiplicites and the eigenspace projections $Π_{\hbar, E_N(\hbar)} $ have special asymptotic properties. This article gives a detailed study of their Wigner distributions $W_{\hbar, E_N(\hbar)}(x, ξ)$. Heuristically, if $E_N(\hbar) = E$,… ▽ More Eigenspaces of the quantum isotropic Harmonic Oscillator $\hat{H}_{\hbar} : = - \frac{\hbar^2}{2} Δ+ \frac{||x||^2}{2}$ on $\mathbb{R}^d$ have extremally high multiplicites and the eigenspace projections $Π_{\hbar, E_N(\hbar)} $ have special asymptotic properties. This article gives a detailed study of their Wigner distributions $W_{\hbar, E_N(\hbar)}(x, ξ)$. Heuristically, if $E_N(\hbar) = E$, $W_{\hbar, E_N(\hbar)}(x, ξ)$ is the `quantization' of the energy surface $Σ_E$, and should be like the delta-function $δ_{Σ_E}$ on $Σ_E$; rigorously, $W_{\hbar, E_N(\hbar)}(x, ξ)$ tends in a weak* sense to $δ_{Σ_E}$. But its pointwise asymptotics and scaling asymptotics have more structure. The main results give Bessel asymptotics of $W_{\hbar, E_N(\hbar)}(x, ξ)$ in the interior $H(x, ξ) < E$ of $Σ_E$; interface Airy scaling asymptotics in tubes of radius $\hbar^{2/3}$ around $Σ_E$, with $(x, ξ)$ either in the interior or exterior of the energy ball; and exponential decay rates in the exterior of the energy surface. △ Less

Submitted 3 February, 2019; v1 submitted 18 January, 2019; originally announced January 2019.

Comments: 28 pages, 3 figures, v2. Statement of Theorem 1.5 slightly modified. Several references added

arXiv:1812.05994 [pdf, ps, other]

doi 10.1007/s00220-019-03624-z

Products of Many Large Random Matrices and Gradients in Deep Neural Networks

Authors: Boris Hanin, Mihai Nica

Abstract: We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $\ell_2$ norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the… ▽ More We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $\ell_2$ norm of such a product applied to any fixed vector is asymptotically Gaussian. The fluctuations we find can be thought of as a finite temperature correction to the limit in which first the size and then the number of matrices tend to infinity. Depending on the scaling limit considered, the mean and variance of the limiting Gaussian depend only on either the first two or the first four moments of the measure from which matrix entries are drawn. We also obtain explicit error bounds on the moments of the norm and the Kolmogorov-Smirnov distance to a Gaussian. Finally, we apply our result to obtain precise information about the stability of gradients in randomly initialized deep neural networks with ReLU activations. This provides a quantitative measure of the extent to which the exploding and vanishing gradient problem occurs in a fully connected neural network with ReLU activations and a given architecture. △ Less

Submitted 14 December, 2018; originally announced December 2018.

Comments: v1. 26p. Comments Welcome

arXiv:1806.00521 [pdf, other]

The lemniscate tree of a random polynomial

Authors: Michael Epstein, Boris Hanin, Erik Lundberg

Abstract: To each generic complex polynomial $p(z)$ there is associated a labeled binary tree (here referred to as a "lemniscate tree") that encodes the topological type of the graph of $|p(z)|$. The branching structure of the lemniscate tree is determined by the configuration (i.e., arrangement in the plane) of the singular components of those level sets $|p(z)|=t$ passing through a critical point. In th… ▽ More To each generic complex polynomial $p(z)$ there is associated a labeled binary tree (here referred to as a "lemniscate tree") that encodes the topological type of the graph of $|p(z)|$. The branching structure of the lemniscate tree is determined by the configuration (i.e., arrangement in the plane) of the singular components of those level sets $|p(z)|=t$ passing through a critical point. In this paper, we address the question "How many branches appear in a typical lemniscate tree?" We answer this question first for a lemniscate tree sampled uniformly from the combinatorial class and second for the lemniscate tree arising from a random polynomial generated by i.i.d. zeros. From a more general perspective, these results take a first step toward a probabilistic treatment (within a specialized setting) of Arnold's program of enumerating algebraic Morse functions. △ Less

Submitted 1 June, 2018; originally announced June 2018.

Comments: 18 pages, 6 figures

MSC Class: 30C15; 60G60; 31A15; 14P25; 05A15; 60C05; 60F05

arXiv:1803.01719 [pdf, other]

How to Start Training: The Effect of Initialization and Architecture

Authors: Boris Hanin, David Rolnick

Abstract: We identify and study two common failure modes for early training in deep ReLU nets. For each we give a rigorous proof of when it occurs and how to avoid it, for fully connected and residual architectures. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly wei… ▽ More We identify and study two common failure modes for early training in deep ReLU nets. For each we give a rigorous proof of when it occurs and how to avoid it, for fully connected and residual architectures. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided. In contrast, for fully connected nets, we prove that this failure mode can happen and is avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained. △ Less

Submitted 13 November, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

Comments: Final Version, 16p, Accepted NIPS 2018

arXiv:1801.03744 [pdf, other]

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?

Authors: Boris Hanin

Abstract: We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large… ▽ More We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos. △ Less

Submitted 26 October, 2018; v1 submitted 11 January, 2018; originally announced January 2018.

Comments: v3. 18p. 1 fig. Accepted at NIPS 2018

arXiv:1710.11278 [pdf, other]

Approximating Continuous Functions by ReLU Nets of Minimal Width

Authors: Boris Hanin, Mark Sellke

Abstract: This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of… ▽ More This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of $d_{in}$ variables arbitrarily well? It turns out that this minimal width is exactly equal to $d_{in}+1.$ That is, if all the hidden layer widths are bounded by $d_{in}$, then even in the infinite depth limit, ReLU nets can only express a very limited class of functions, and, on the other hand, any continuous function on the $d_{in}$-dimensional unit cube can be approximated to arbitrary precision by ReLU nets in which all hidden layers have width exactly $d_{in}+1.$ Our construction in fact shows that any continuous function $f:[0,1]^{d_{in}}\to\mathbb R^{d_{out}}$ can be approximated by a net of width $d_{in}+d_{out}$. We obtain quantitative depth estimates for such an approximation in terms of the modulus of continuity of $f$. △ Less

Submitted 10 March, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: v2. 13p. Extended main result to higher dimensional output. Comments welcome

arXiv:1708.06434 [pdf, other]

Level Spacings and Nodal Sets at Infinity for Radial Perturbations of the Harmonic Oscillator

Authors: Thomas Beck, Boris Hanin

Abstract: We study properties of the nodal sets of high frequency eigenfunctions and quasimodes for radial perturbations of the Harmonic Oscillator. In particular, we consider nodal sets on spheres of large radius (in the classically forbidden region) for quasimodes with energies lying in intervals around a fixed energy $E$. For well chosen intervals we show that these nodal sets exhibit quantitatively diff… ▽ More We study properties of the nodal sets of high frequency eigenfunctions and quasimodes for radial perturbations of the Harmonic Oscillator. In particular, we consider nodal sets on spheres of large radius (in the classically forbidden region) for quasimodes with energies lying in intervals around a fixed energy $E$. For well chosen intervals we show that these nodal sets exhibit quantitatively different behavior compared to those of the unperturbed Harmonic Oscillator. These energy intervals are defined via a careful analysis of the eigenvalue spacings for the perturbed operator, based on analytic perturbation theory and linearization formulas for Laguerre polynomials. △ Less

Submitted 21 August, 2017; originally announced August 2017.

Comments: 22 pages. v1. comments welcome

arXiv:1708.02691 [pdf, ps, other]

doi 10.3390/math7100992

Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations

Authors: Boris Hanin

Abstract: This article concerns the expressive power of depth in neural nets with ReLU activations and bounded width. We are particularly interested in the following questions: what is the minimal width $w_{\text{min}}(d)$ so that ReLU nets of width $w_{\text{min}}(d)$ (and arbitrary depth) can approximate any continuous function on the unit cube $[0,1]^d$ aribitrarily well? For ReLU nets near this minimal… ▽ More This article concerns the expressive power of depth in neural nets with ReLU activations and bounded width. We are particularly interested in the following questions: what is the minimal width $w_{\text{min}}(d)$ so that ReLU nets of width $w_{\text{min}}(d)$ (and arbitrary depth) can approximate any continuous function on the unit cube $[0,1]^d$ aribitrarily well? For ReLU nets near this minimal width, what can one say about the depth necessary to approximate a given function? Our approach to this paper is based on the observation that, due to the convexity of the ReLU activation, ReLU nets are particularly well-suited for representing convex functions. In particular, we prove that ReLU nets with width $d+1$ can approximate any continuous convex function of $d$ variables arbitrarily well. These results then give quantitative depth estimates for the rate of approximation of any continuous scalar function on the $d$-dimensional cube $[0,1]^d$ by ReLU nets with width $d+3.$ △ Less

Submitted 20 December, 2017; v1 submitted 8 August, 2017; originally announced August 2017.

Comments: v3. Theorem 3 removed. Comments Welcome. 9p

Journal ref: Mathematics 2019, 7(10), 992

arXiv:1610.09438 [pdf, ps, other]

Local Universality for Zeros and Critical Points of Monochromatic Random Waves

Authors: Yaiza Canzani, Boris Hanin

Abstract: This paper concerns the asymptotic behavior of zeros and critical points for monochromatic random waves $φ_λ$ of frequency $λ$ on a compact, smooth, Riemannian manifold $(M,g)$ as $λ\rightarrow \infty$. We prove that the measure of integration over the zero set of $φ_λ$ restricted to balls of radius $\approx λ^{-1}$ converges in distribution to the measure of integration over the zero set of a fre… ▽ More This paper concerns the asymptotic behavior of zeros and critical points for monochromatic random waves $φ_λ$ of frequency $λ$ on a compact, smooth, Riemannian manifold $(M,g)$ as $λ\rightarrow \infty$. We prove that the measure of integration over the zero set of $φ_λ$ restricted to balls of radius $\approx λ^{-1}$ converges in distribution to the measure of integration over the zero set of a frequency $1$ random wave on $\mathbb R^n$, where $n$ is the dimension of $M$. We also prove convergence of finite moments for the counting measure of the critical points of φλ, again restricted to balls of radius $\approx λ^{-1}$, to the corresponding moments for frequency $1$ random waves. We then patch together these local results to obtain new global variance estimates on the volume of the zero set and numbers of critical points of $φ_λ$ on all of $M.$ Our local results hold under conditions about the structure of geodesics on $M$ that are generic in the space of all metrics on $M$, while our global results hold whenever $(M,g)$ has no conjugate points (e.g is negatively curved). △ Less

Submitted 9 May, 2020; v1 submitted 28 October, 2016; originally announced October 2016.

Comments: v3. Accepted Comm. Math. Phys

arXiv:1604.04307 [pdf, ps, other]

Nodal Sets of Smooth Functions with Finite Vanishing Order and p-Sweepouts

Authors: Thomas Beck, Spencer T. Becker-Kahn, Boris Hanin

Abstract: We show that on a compact Riemmanian manifold $(M,g)$, nodal sets of linear combinations of any $p+1$ smooth functions form an admissible $p-$sweepout provided these linear combinations have uniformly bounded vanishing order. This applies in particular to finite linear combinations of Laplace eigenfunctions. As a result, we obtain a new proof of the Gromov, Guth, Marques--Neves upper bounds on the… ▽ More We show that on a compact Riemmanian manifold $(M,g)$, nodal sets of linear combinations of any $p+1$ smooth functions form an admissible $p-$sweepout provided these linear combinations have uniformly bounded vanishing order. This applies in particular to finite linear combinations of Laplace eigenfunctions. As a result, we obtain a new proof of the Gromov, Guth, Marques--Neves upper bounds on the min-max $p$-widths of $M.$ We also prove that close to a point at which a smooth function on $\mathbb{R}^{n+1}$ vanishes to order $k$, its nodal set is contained in the union of $k$ $W^{1,p}$ graphs for some $p > 1$. This implies that the nodal set is locally countably $n$-rectifiable and has locally finite $\mathcal{H}^n$ measure, facts which also follow from a previous result of Bär. Finally, we prove the continuity of the Hausdorff measure of nodal sets under heat flow. △ Less

Submitted 17 October, 2016; v1 submitted 14 April, 2016; originally announced April 2016.

Comments: 13 pages. comments welcome. v3

arXiv:1602.06848 [pdf, other]

doi 10.1007/s00220-016-2807-4

Scaling of Harmonic Oscillator Eigenfunctions and Their Nodal Sets Around the Caustic

Authors: Boris Hanin, Steve Zelditch, Peng Zhou

Abstract: We study the scaling asymptotics of the eigenspace projection kernels $Π_{\hbar, E}(x,y)$ of the isotropic Harmonic Oscillator $- \hbar ^2 Δ+ |x|^2$ of eigenvalue $E = \hbar(N + \frac{d}{2})$ in the semi-classical limit $\hbar \to 0$. The principal result is an explicit formula for the scaling asymptotics of $Π_{\hbar, E}(x,y)$ for $x,y$ in a $\hbar^{2/3}$ neighborhood of the caustic… ▽ More We study the scaling asymptotics of the eigenspace projection kernels $Π_{\hbar, E}(x,y)$ of the isotropic Harmonic Oscillator $- \hbar ^2 Δ+ |x|^2$ of eigenvalue $E = \hbar(N + \frac{d}{2})$ in the semi-classical limit $\hbar \to 0$. The principal result is an explicit formula for the scaling asymptotics of $Π_{\hbar, E}(x,y)$ for $x,y$ in a $\hbar^{2/3}$ neighborhood of the caustic $\mathcal C_E$ as $\hbar \to 0.$ The scaling asymptotics are applied to the distribution of nodal sets of Gaussian random eigenfunctions around the caustic as $\hbar \to 0$. In previous work we proved that the density of zeros of Gaussian random eigenfunctions of $\hat{H}_{\hbar}$ have different orders in the Planck constant $\hbar$ in the allowed and forbidden regions: In the allowed region the density is of order $\hbar^{-1}$ while it is $\hbar^{-1/2}$ in the forbidden region. Our main result on nodal sets is that the density of zeros is of order $\hbar^{-\frac{2}{3}}$ in an $\hbar^{\frac{2}{3}}$-tube around the caustic. This tube radius is the `critical radius'. For annuli of larger inner and outer radii $\hbar^α$ with $0< α< \frac{2}{3}$ we obtain density results which interpolate between this critical radius result and our prior ones in the allowed and forbidden region. We also show that the Hausdorff $(d-2)$-dimensional measure of the intersection of the nodal set with the caustic is of order $\hbar^{- \frac{2}{3}}$. △ Less

Submitted 13 November, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: v3. Accepted to Communications in Mathematical Physics

arXiv:1602.00730 [pdf, ps, other]

C-infinity Scaling Asymptotics for the Spectral Function of the Laplacian

Authors: Yaiza Canzani, Boris Hanin

Abstract: This article concerns new off-diagonal estimates on the remainder and its derivatives in the pointwise Weyl law on a compact n-dimensional Riemannian manifold. As an application, we prove that near any non self-focal point, the scaling limit of the spectral projector of the Laplacian onto frequency windows of constant size is a normalized Bessel function depending only on n. This article concerns new off-diagonal estimates on the remainder and its derivatives in the pointwise Weyl law on a compact n-dimensional Riemannian manifold. As an application, we prove that near any non self-focal point, the scaling limit of the spectral projector of the Laplacian onto frequency windows of constant size is a normalized Bessel function depending only on n. △ Less

Submitted 1 February, 2016; originally announced February 2016.

arXiv:1601.06417 [pdf, other]

Pairing of Zeros and Critical Points for Random Polynomials

Authors: Boris Hanin

Abstract: Let p_N be a random degree N polynomial in one complex variable whose zeros are chosen independently from a fixed probability measure mu on the Riemann sphere S^2. This article proves that if we condition p_N to have a zero at some fixed point xi in , then, with high probability, there will be a critical point w_xi a distance 1/N away from xi. This 1/N distance is much smaller than the one over ro… ▽ More Let p_N be a random degree N polynomial in one complex variable whose zeros are chosen independently from a fixed probability measure mu on the Riemann sphere S^2. This article proves that if we condition p_N to have a zero at some fixed point xi in , then, with high probability, there will be a critical point w_xi a distance 1/N away from xi. This 1/N distance is much smaller than the one over root N typical spacing between nearest neighbors for N i.i.d. points on S^2. Moreover, with the same high probability, the argument of w_xi relative to xi is a deterministic function of mu plus fluctuations on the order of 1/N. △ Less

Submitted 24 January, 2016; originally announced January 2016.

Comments: v1 comments welcome

arXiv:1411.0658 [pdf, ps, other]

doi 10.2140/apde.2015.8.1707

Scaling Limit for the Kernel of the Spectral Projector and Remainder Estimates in the Pointwise Weyl Law

Authors: Yaiza Canzani, Boris Hanin

Abstract: Let (M, g) be a compact smooth Riemannian manifold. We obtain new off-diagonal estimates as λ tend to infinity for the remainder in the pointwise Weyl Law for the kernel of the spectral projector of the Laplacian onto functions with frequency at most λ. A corollary is that, when rescaled around a non self-focal point, the kernel of the spectral projector onto the frequency interval (λ, λ+ 1] has a… ▽ More Let (M, g) be a compact smooth Riemannian manifold. We obtain new off-diagonal estimates as λ tend to infinity for the remainder in the pointwise Weyl Law for the kernel of the spectral projector of the Laplacian onto functions with frequency at most λ. A corollary is that, when rescaled around a non self-focal point, the kernel of the spectral projector onto the frequency interval (λ, λ+ 1] has a universal scaling limit as λ goes to infinity (depending only on the dimension of M). Our results also imply that if M has no conjugate points, then immersions of M into Euclidean space by an orthonormal basis of eigenfunctions with frequencies in (λ, λ+ 1] are embeddings for all λ sufficiently large. △ Less

Submitted 27 December, 2015; v1 submitted 3 November, 2014; originally announced November 2014.

Comments: Published version. Modified parametrix construction in Section 3. References added and typos corrected

Journal ref: Analysis and PDE Vol 8 (2015) No 7 1707-1731

arXiv:1406.2309 [pdf, ps, other]

doi 10.3934/era.2015.22.76

High Frequency Eigenfunction Immersions and Supremum Norms of Random Waves

Authors: Yaiza Canzani, Boris Hanin

Abstract: A compact Riemannian manifold may be immersed into Euclidean space by using high frequency Laplace eigenfunctions. We study the geometry of the manifold viewed as a metric space endowed with the distance function from the ambient Euclidean space. As an application we give a new proof of a result of Burq-Lebeau and others on upper bounds for the sup-norms of random linear combinations of high frequ… ▽ More A compact Riemannian manifold may be immersed into Euclidean space by using high frequency Laplace eigenfunctions. We study the geometry of the manifold viewed as a metric space endowed with the distance function from the ambient Euclidean space. As an application we give a new proof of a result of Burq-Lebeau and others on upper bounds for the sup-norms of random linear combinations of high frequency eigenfunctions. △ Less

Submitted 7 June, 2014; originally announced June 2014.

Comments: This article supersedes arXiv:1310.1361, which has now been withdrawn

MSC Class: 35P20; 58J51; 58J37; 58J40; 58J50

Journal ref: ERA - MS 22 no 0 January 2015 76 - 86

arXiv:1310.4532 [pdf, other]

Nodal Sets of Random Eigenfunctions for the Isotropic Harmonic Oscillator

Authors: Boris Hanin, Steve Zelditch, Peng Zhou

Abstract: We consider Gaussian random eigenfunctions (Hermite functions) of fixed energy level of the isotropic semi-classical Harmonic Oscillator on ${\bf R}^n$. We calculate the expected density of zeros of a random eigenfunction in the semi-classical limit $h \to 0.$ In the allowed region the density is of order $h^{-1},$ while in the forbidden region the density is of order $h^{-\frac{1}{2}}$. The compu… ▽ More We consider Gaussian random eigenfunctions (Hermite functions) of fixed energy level of the isotropic semi-classical Harmonic Oscillator on ${\bf R}^n$. We calculate the expected density of zeros of a random eigenfunction in the semi-classical limit $h \to 0.$ In the allowed region the density is of order $h^{-1},$ while in the forbidden region the density is of order $h^{-\frac{1}{2}}$. The computer graphics due to E.J. Heller illustrate this difference in "frequency" between the allowed and forbidden nodal sets. △ Less

Submitted 12 November, 2013; v1 submitted 16 October, 2013; originally announced October 2013.

Comments: 3 figures, 2 due to E. J. Heller. Corrected the calculation in the forbidden region

Journal ref: IMRN Vol. 2015, No. 13, pp. 4813-4839, (2015)

arXiv:1310.1361

Mean of the $L^\infty$-norm for $L^2$-normalized random waves on compact aperiodic Riemannian manifolds

Authors: Yaiza Canzani, Boris Hanin

Abstract: This article concerns upper bounds for $L^\infty$-norms of random approximate eigenfunctions of the Laplace operator on a compact aperiodic Riemannian manifold $(M,g).$ We study $f_λ$ chosen uniformly at random from the space of $L^2$-normalized linear combinations of Laplace eigenfunctions with eigenvalues in the interval $(λ^2, \lr{λ+1}^2].$ Our main result is that the expected value of… ▽ More This article concerns upper bounds for $L^\infty$-norms of random approximate eigenfunctions of the Laplace operator on a compact aperiodic Riemannian manifold $(M,g).$ We study $f_λ$ chosen uniformly at random from the space of $L^2$-normalized linear combinations of Laplace eigenfunctions with eigenvalues in the interval $(λ^2, \lr{λ+1}^2].$ Our main result is that the expected value of $\norm{f_λ}_\infty$ grows at most like $C \sqrt{\log λ}$ as $λ\to \infty$, where $C$ is an explicit constant depending only on the dimension and volume of $(M,g).$ In addition, we obtain concentration of the $L^\infty$-norm around its mean and median and study the analogous problems for Gaussian random waves on $(M,g).$ △ Less

Submitted 7 June, 2014; v1 submitted 4 October, 2013; originally announced October 2013.

Comments: Withdrawn due to significant overlap with the work of Burq-Lebeau (arXiv:1111.7310). The authors have uploaded a new version of the withdrawn paper, "Fixed Frequency Eigenfunction Immersions and Supremum Norms of Random Waves," which acknowledges the preceding work and gives a significantly simper argument that is more geometric and substantively different from the technique of Burq-Lebeau

MSC Class: 35P20 (Primary); 58J51 (Secondary); 58J37; 58J40; 58J50

arXiv:1305.6105 [pdf, other]

doi 10.4310/MRL.2015.v22.n1.a7

Pairing of Zeros and Critical Points for Random Meromorphic Functions on Riemann Surfaces

Authors: Boris Hanin

Abstract: We prove that zeros and critical points of a random polynomial $p_N$ of degree $N$ in one complex variable appear in pairs. More precisely, if $p_N$ is conditioned to have $p_N(ξ)=0$ for a fixed $ξ\in \C\backslash\set{0},$ we prove that there is a unique critical point z in the annulus $N^{-1-\ep}<\abs{z-ξ}< N^{-1+\ep}}$ and no critical points closer to $ξ$ with probability at least… ▽ More We prove that zeros and critical points of a random polynomial $p_N$ of degree $N$ in one complex variable appear in pairs. More precisely, if $p_N$ is conditioned to have $p_N(ξ)=0$ for a fixed $ξ\in \C\backslash\set{0},$ we prove that there is a unique critical point z in the annulus $N^{-1-\ep}<\abs{z-ξ}< N^{-1+\ep}}$ and no critical points closer to $ξ$ with probability at least $1-O(N^{-3/2+3\ep}).$ We also prove an analogous statement in the more general setting of random meromorphic functions on a closed Riemann surface. △ Less

Submitted 27 May, 2013; originally announced May 2013.

Comments: 20 pages, 2 figures

Journal ref: Math Res Letters Volume 22 (2015) Number 1 p111-140

arXiv:1207.4734 [pdf, other]

doi 10.1093/imrn/rnt192

Correlations and Pairing Between Zeros and Critical Points of Gaussian Random Polynomials

Authors: Boris Hanin

Abstract: We study the asymptotics of correlations and nearest neighbor spacings between zeros and holomorphic critical points of $p_N$, a degree N Hermitian Gaussian random polynomial in the sense of Shiffman and Zeldtich, as N goes to infinity. By holomorphic critical point we mean a solution to the equation $\frac{d}{dz}p_N(z)=0.$ Our principal result is an explicit asymptotic formula for the local scali… ▽ More We study the asymptotics of correlations and nearest neighbor spacings between zeros and holomorphic critical points of $p_N$, a degree N Hermitian Gaussian random polynomial in the sense of Shiffman and Zeldtich, as N goes to infinity. By holomorphic critical point we mean a solution to the equation $\frac{d}{dz}p_N(z)=0.$ Our principal result is an explicit asymptotic formula for the local scaling limit of $\E{Z_{p_N}\wedge C_{p_N}},$ the expected joint intensity of zeros and critical points, around any point on the Riemann sphere. Here $Z_{p_N}$ and $C_{p_N}$ are the currents of integration (i.e. counting measures) over the zeros and critical points of $p_N$, respectively. We prove that correlations between zeros and critical points are short range, decaying like $e^{-N\abs{z-w}^2}.$ With $\abs{z-w}$ on the order of $N^{-1/2},$ however, $\E{Z_{p_N}\wedge C_{p_N}}(z,w)$ is sharply peaked near $z=w,$ causing zeros and critical points to appear in rigid pairs. We compute tight bounds on the expected distance and angular dependence between a critical point and its paired zero. △ Less

Submitted 29 August, 2012; v1 submitted 19 July, 2012; originally announced July 2012.

Comments: 35 pages, 3 figures. Some typos corrected and Introduction revised

Journal ref: Int Math Res Notices (2015) 2015 (2): 381-421

Showing 1–46 of 46 results for author: Hanin, B