subscribe to arXiv mailings

Optimal Quantization for Matrix Multiplication

Abstract: Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goa… ▽ More Recent work in machine learning community proposed multiple methods for performing lossy compression (quantization) of large matrices. This quantization is important for accelerating matrix multiplication (main component of large language models), which is often bottlenecked by the speed of loading these matrices from memory. Unlike classical vector quantization and rate-distortion theory, the goal of these new compression algorithms is to be able to approximate not the matrices themselves, but their matrix product. Specifically, given a pair of real matrices $A,B$ an encoder (compressor) is applied to each of them independently producing descriptions with $R$ bits per entry. These representations subsequently are used by the decoder to estimate matrix product $A^\top B$. In this work, we provide a non-asymptotic lower bound on the mean squared error of this approximation (as a function of rate $R$) for the case of matrices $A,B$ with iid Gaussian entries. Algorithmically, we construct a universal quantizer based on nested lattices with an explicit guarantee of approximation error for any (non-random) pair of matrices $A$, $B$ in terms of only Frobenius norms $\|A\|_F, \|B\|_F$ and $\|A^\top B\|_F$. For iid Gaussian matrices our quantizer achieves the lower bound and is, thus, asymptotically optimal. A practical low-complexity version of our quantizer achieves performance quite close to optimal. In information-theoretic terms we derive rate-distortion function for matrix multiplication of iid Gaussian matrices. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.06833 [pdf, other]

Dynamic metastability in the self-attention model

Authors: Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, Philippe Rigollet

Abstract: We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a confi… ▽ More We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2409.08839 [pdf, other]

RF Challenge: The Data-Driven Radio Frequency Signal Separation Challenge

Authors: Alejandro Lancho, Amir Weiss, Gary C. F. Lee, Tejas Jayashankar, Binoy Kurien, Yury Polyanskiy, Gregory W. Wornell

Abstract: This paper addresses the critical problem of interference rejection in radio-frequency (RF) signals using a novel, data-driven approach that leverages state-of-the-art AI models. Traditionally, interference rejection algorithms are manually tailored to specific types of interference. This work introduces a more scalable data-driven solution and contains the following contributions. First, we prese… ▽ More This paper addresses the critical problem of interference rejection in radio-frequency (RF) signals using a novel, data-driven approach that leverages state-of-the-art AI models. Traditionally, interference rejection algorithms are manually tailored to specific types of interference. This work introduces a more scalable data-driven solution and contains the following contributions. First, we present an insightful signal model that serves as a foundation for developing and analyzing interference rejection algorithms. Second, we introduce the RF Challenge, a publicly available dataset featuring diverse RF signals along with code templates, which facilitates data-driven analysis of RF signal problems. Third, we propose novel AI-based rejection algorithms, specifically architectures like UNet and WaveNet, and evaluate their performance across eight different signal mixture types. These models demonstrate superior performance exceeding traditional methods like matched filtering and linear minimum mean square error estimation by up to two orders of magnitude in bit-error rate. Fourth, we summarize the results from an open competition hosted at 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) based on the RF Challenge, highlighting the significant potential for continued advancements in this area. Our findings underscore the promise of deep learning algorithms in mitigating interference, offering a strong foundation for future research. △ Less

Submitted 13 September, 2024; originally announced September 2024.

Comments: 14 pages, 12 figures, submitted to the IEEE Open Journal of the Communications Society

arXiv:2409.07689 [pdf, ps, other]

Entropy Contractions in Markov Chains: Half-Step, Full-Step and Continuous-Time

Authors: Pietro Caputo, Zongchen Chen, Yuzhou Gu, Yury Polyanskiy

Abstract: This paper considers the speed of convergence (mixing) of a finite Markov kernel $P$ with respect to the Kullback-Leibler divergence (entropy). Given a Markov kernel one defines either a discrete-time Markov chain (with the $n$-step transition kernel given by the matrix power $P^n$) or a continuous-time Markov process (with the time-$t$ transition kernel given by $e^{t(P-\mathrm{Id})}$). The contr… ▽ More This paper considers the speed of convergence (mixing) of a finite Markov kernel $P$ with respect to the Kullback-Leibler divergence (entropy). Given a Markov kernel one defines either a discrete-time Markov chain (with the $n$-step transition kernel given by the matrix power $P^n$) or a continuous-time Markov process (with the time-$t$ transition kernel given by $e^{t(P-\mathrm{Id})}$). The contraction of entropy for $n=1$ or $t=0+$ are characterized by the famous functional inequalities, the strong data processing inequality (SDPI) and the modified log-Sobolev inequality (MLSI), respectively. When $P=KK^*$ is written as the product of a kernel and its adjoint, one could also consider the ``half-step'' contraction, which is the SDPI for $K$, while the ``full-step'' contraction refers to the SDPI for $P$. The work [DMLM03] claimed that these contraction coefficients (half-step, full-step, and continuous-time) are generally within a constant factor of each other. We disprove this and related conjectures by working out a number of different counterexamples. In particular, we construct (a) a continuous-time Markov process that contracts arbitrarily faster than its discrete-time counterpart; and (b) a kernel $P$ such that $P^{m+1}$ contracts arbitrarily better than $P^m$. Hence, our main conclusion is that the four standard inequalities comparing five common notions of entropy and variance contraction are generally not improvable. In the process of analyzing the counterexamples, we survey and sharpen the tools for bounding the contraction coefficients and characterize properties of extremizers of the respective functional inequalities. As our examples range from Bernoulli-Laplace model, random walks on graphs, to birth-death chains, the paper is also intended as a tutorial on computing MLSI, SDPI and other constants for these types of commonly occurring Markov chains. △ Less

Submitted 11 September, 2024; originally announced September 2024.

arXiv:2408.09874 [pdf, other]

Unsourced Multiple Access: A Coding Paradigm for Massive Random Access

Authors: Gianluigi Liva, Yury Polyanskiy

Abstract: This paper is a tutorial introduction to the field of unsourced multiple access (UMAC) protocols. We first provide a historical survey of the evolution of random access protocols, focusing specifically on the case in which uncoordinated users share a wireless broadcasting medium. Next, we highlight the change of perspective originated by the UMAC model, in which the physical and medium access laye… ▽ More This paper is a tutorial introduction to the field of unsourced multiple access (UMAC) protocols. We first provide a historical survey of the evolution of random access protocols, focusing specifically on the case in which uncoordinated users share a wireless broadcasting medium. Next, we highlight the change of perspective originated by the UMAC model, in which the physical and medium access layer's protocols cooperate, thus reframing random access as a novel coding-theoretic problem. By now, a large variety of UMAC protocols (codes) emerged, necessitating a certain classification that we indeed propose here. Although some random access schemes require a radical change of the physical layer, others can be implemented with minimal changes to existing industry standards. As an example, we discuss a simple modification to the 5GNR Release 16 random access channel that builds on the UMAC theory and that dramatically improves energy efficiency for systems with even moderate number of simultaneous users (e.g., $5-10$ dB gain for $10-50$ users), and also enables handling of high number of users, something completely out of reach of the state-of-the-art. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Proceedings of the IEEE, accepted for publication

arXiv:2405.03348 [pdf, other]

Evolution of the 5G New Radio Two-Step Random Access towards 6G Unsourced MAC

Authors: Patrick Agostini, Jean-Francois Chamberland, Federico Clazzer, Johannes Dommel, Gianluigi Liva, Andrea Munari, Krishna Narayanan, Yury Polyanskiy, Slawomir Stanczak, Zoran Utkovski

Abstract: This report summarizes some considerations on possible evolutions of grant-free random access in the next generation of the 3GPP wireless cellular standard. The analysis is carried out by mapping the problem to the recently-introduced unsourced multiple access channel (UMAC) setup. By doing so, the performance of existing solutions can be benchmarked with information-theoretic bounds, assessing th… ▽ More This report summarizes some considerations on possible evolutions of grant-free random access in the next generation of the 3GPP wireless cellular standard. The analysis is carried out by mapping the problem to the recently-introduced unsourced multiple access channel (UMAC) setup. By doing so, the performance of existing solutions can be benchmarked with information-theoretic bounds, assessing the potential gains that can be achieved over legacy 3GPP schemes. The study focuses on the two-step random access (2SRA) protocol introduced by Release 16 of the 5G New Radio standard, investigating its applicability to support large MTC / IoT terminal populations in a grant-free fashion. The analysis shows that the existing 2SRA scheme may not succeed in providing energy-efficient support to large user populations. Modifications to the protocol are proposed that enable remarkable gains in both energy and spectral efficiency while retaining a strong resemblance to the legacy protocol. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: Version 1.0 of the report

arXiv:2312.17701 [pdf, ps, other]

Density estimation using the perceptron

Authors: Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy, Rui Sun

Abstract: We propose a new density estimation algorithm. Given $n$ i.i.d. samples from a distribution belonging to a class of densities on $\mathbb{R}^d$, our estimator outputs any density in the class whose ''perceptron discrepancy'' with the empirical distribution is at most $O(\sqrt{d/n})$. The perceptron discrepancy between two distributions is defined as the largest difference in mass that they place o… ▽ More We propose a new density estimation algorithm. Given $n$ i.i.d. samples from a distribution belonging to a class of densities on $\mathbb{R}^d$, our estimator outputs any density in the class whose ''perceptron discrepancy'' with the empirical distribution is at most $O(\sqrt{d/n})$. The perceptron discrepancy between two distributions is defined as the largest difference in mass that they place on any halfspace of $\mathbb{R}^d$. It is shown that this estimator achieves expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that regularity of the prior distribution could be an explanation for the efficiency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (e.g. in the discriminators of GANs). We generalize the above to show that replacing the ''perceptron discrepancy'' with the generalized energy distance of Székeley-Rizzo further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, thus making it especially useful for fitting generative models. To the best of our knowledge, it is the first example of a distance with such properties for which there are minimax statistical guarantees. △ Less

Submitted 20 February, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 47 pages

MSC Class: 62G07

arXiv:2312.10794 [pdf, other]

A mathematical perspective on Transformers

Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

Abstract: Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists. Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists. △ Less

Submitted 12 August, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

arXiv:2308.09043 [pdf, other]

Kernel-Based Tests for Likelihood-Free Hypothesis Testing

Authors: Patrik Róbert Gerber, Tianze Jiang, Yury Polyanskiy, Rui Sun

Abstract: Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when… ▽ More Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off. △ Less

Submitted 23 November, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: 36 pages, 6 figures

arXiv:2307.02070 [pdf, other]

Empirical Bayes via ERM and Rademacher complexities: the Poisson model

Authors: Soham Jana, Yury Polyanskiy, Anzo Teh, Yihong Wu

Abstract: We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under m… ▽ More We consider the problem of empirical Bayes estimation for (multivariate) Poisson means. Existing solutions that have been shown theoretically optimal for minimizing the regret (excess risk over the Bayesian oracle that knows the prior) have several shortcomings. For example, the classical Robbins estimator does not retain the monotonicity property of the Bayes estimator and performs poorly under moderate sample size. Estimators based on the minimum distance and non-parametric maximum likelihood (NPMLE) methods correct these issues, but are computationally expensive with complexity growing exponentially with dimension. Extending the approach of Barbehenn and Zhao (2022), in this work we construct monotone estimators based on empirical risk minimization (ERM) that retain similar theoretical guarantees and can be computed much more efficiently. Adapting the idea of offset Rademacher complexity Liang et al. (2015) to the non-standard loss and function class in empirical Bayes, we show that the shape-constrained ERM estimator attains the minimax regret within constant factors in one dimension and within logarithmic factors in multiple dimensions. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: 34 pages, 1 Figure, to appear in the 2023 Conference of Learning Theory (COLT)

arXiv:2307.01095 [pdf, other]

Coded Orthogonal Modulation for the Multi-Antenna Multiple-Access Channel

Authors: Alexander Fengler, Alejandro Lancho, Yury Polyanskiy

Abstract: This study focuses on (traditional and unsourced) multiple-access communication over a single transmit and multiple ($M$) receive antennas. We assume full or partial channel state information (CSI) at the receiver. It is known that to fully achieve the fundamental limits (even asymptotically) the decoder needs to jointly estimate all user codewords, doing which directly is computationally infeasib… ▽ More This study focuses on (traditional and unsourced) multiple-access communication over a single transmit and multiple ($M$) receive antennas. We assume full or partial channel state information (CSI) at the receiver. It is known that to fully achieve the fundamental limits (even asymptotically) the decoder needs to jointly estimate all user codewords, doing which directly is computationally infeasible. We propose a low-complexity solution, termed coded orthogonal modulation multiple-access (COMMA), in which users first encode their messages via a long (multi-user interference aware) outer code operating over a $q$-ary alphabet. These symbols are modulated onto $q$ orthogonal waveforms. At the decoder a multiple-measurement vector approximate message passing (MMV-AMP) algorithm estimates several candidates (out of $q$) for each user, with the remaining uncertainty resolved by the single-user outer decoders. Numerically, we show that COMMA outperforms a standard solution based on linear multiuser detection (MUD) with Gaussian signaling. Theoretically, we derive bounds and scaling laws for $M$, the number of users $K_a$, SNR, and $q$, allowing to quantify the trade-off between receive antennas and spectral efficiency. The orthogonal signaling scheme is applicable to unsourced random access and, with chirp sequences as basis, allows for low-complexity fast Fourier transform (FFT) based receivers that are resilient to frequency and timing offsets. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.16735 [pdf, ps, other]

Comparing Poisson and Gaussian channels (extended)

Authors: Anzo Teh, Yury Polyanskiy

Abstract: Consider a pair of input distributions which after passing through a Poisson channel become $ε$-close in total variation. We show that they must necessarily then be $ε^{0.5+o(1)}$-close after passing through a Gaussian channel as well. In the opposite direction, we show that distributions inducing $ε$-close outputs over the Gaussian channel must induce $ε^{1+o(1)}$-close outputs over the Poisson.… ▽ More Consider a pair of input distributions which after passing through a Poisson channel become $ε$-close in total variation. We show that they must necessarily then be $ε^{0.5+o(1)}$-close after passing through a Gaussian channel as well. In the opposite direction, we show that distributions inducing $ε$-close outputs over the Gaussian channel must induce $ε^{1+o(1)}$-close outputs over the Poisson. This quantifies a well-known intuition that ''smoothing'' induced by Poissonization and Gaussian convolution are similar. As an application, we improve a recent upper bound of Han-Miao-Shen'2021 for estimating mixing distribution of a Poisson mixture in Gaussian optimal transport distance from $n^{-0.1 + o(1)}$ to $n^{-0.25 + o(1)}$. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 7 pages, 2 figures, to appear in the 2023 IEEE International Symposium on Information Theory (ISIT)

arXiv:2306.14411 [pdf, other]

Score-based Source Separation with Applications to Digital Communication Signals

Authors: Tejas Jayashankar, Gary C. F. Lee, Alejandro Lancho, Amir Weiss, Yury Polyanskiy, Gregory W. Wornell

Abstract: We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we a… ▽ More We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $α$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io △ Less

Submitted 17 January, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: 34 pages, 18 figures, for associated project webpage see https://alpha-rgs.github.io

arXiv:2306.12308 [pdf, ps, other]

Entropic characterization of optimal rates for learning Gaussian mixtures

Authors: Zeyu Jia, Yury Polyanskiy, Yihong Wu

Abstract: We consider the question of estimating multi-dimensional Gaussian mixtures (GM) with compactly supported or subgaussian mixing distributions. Minimax estimation rate for this class (under Hellinger, TV and KL divergences) is a long-standing open question, even in one dimension. In this paper we characterize this rate (for all constant dimensions) in terms of the metric entropy of the class. Such c… ▽ More We consider the question of estimating multi-dimensional Gaussian mixtures (GM) with compactly supported or subgaussian mixing distributions. Minimax estimation rate for this class (under Hellinger, TV and KL divergences) is a long-standing open question, even in one dimension. In this paper we characterize this rate (for all constant dimensions) in terms of the metric entropy of the class. Such characterizations originate from seminal works of Le Cam (1973); Birge (1983); Haussler and Opper (1997); Yang and Barron (1999). However, for GMs a key ingredient missing from earlier work (and widely sought-after) is a comparison result showing that the KL and the squared Hellinger distance are within a constant multiple of each other uniformly over the class. Our main technical contribution is in showing this fact, from which we derive entropy characterization for estimation rate under Hellinger and KL. Interestingly, the sequential (online learning) estimation rate is characterized by the global entropy, while the single-step (batch) rate corresponds to local entropy, paralleling a similar result for the Gaussian sequence model recently discovered by Neykov (2022) and Mourtada (2023). Additionally, since Hellinger is a proper metric, our comparison shows that GMs under KL satisfy the triangle inequality within multiplicative constants, implying that proper and improper estimation rates coincide. △ Less

Submitted 27 June, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.11085 [pdf, ps, other]

Minimax optimal testing by classification

Authors: Patrik Róbert Gerber, Yanjun Han, Yury Polyanskiy

Abstract: This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing ($\mathsf{CAT}$). In $\mathsf{CAT}$, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and… ▽ More This paper considers an ML inspired approach to hypothesis testing known as classifier/classification-accuracy testing ($\mathsf{CAT}$). In $\mathsf{CAT}$, one first trains a classifier by feeding it labeled synthetic samples generated by the null and alternative distributions, which is then used to predict labels of the actual data samples. This method is widely used in practice when the null and alternative are only specified via simulators (as in many scientific experiments). We study goodness-of-fit, two-sample ($\mathsf{TS}$) and likelihood-free hypothesis testing ($\mathsf{LFHT}$), and show that $\mathsf{CAT}$ achieves (near-)minimax optimal sample complexity in both the dependence on the total-variation ($\mathsf{TV}$) separation $ε$ and the probability of error $δ$ in a variety of non-parametric settings, including discrete distributions, $d$-dimensional distributions with a smooth density, and the Gaussian sequence model. In particular, we close the high probability sample complexity of $\mathsf{LFHT}$ for each class. As another highlight, we recover the minimax optimal complexity of $\mathsf{TS}$ over discrete distributions, which was recently established by Diakonikolas et al. (2021). The corresponding $\mathsf{CAT}$ simply compares empirical frequencies in the first half of the data, and rejects the null when the classification accuracy on the second half is better than random. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2305.09995 [pdf, ps, other]

Algorithmic Decorrelation and Planted Clique in Dependent Random Graphs: The Case of Extra Triangles

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy

Abstract: We aim to understand the extent to which the noise distribution in a planted signal-plus-noise problem impacts its computational complexity. To that end, we consider the planted clique and planted dense subgraph problems, but in a different ambient graph. Instead of Erdős-Rényi $G(n,p)$, which has independent edges, we take the ambient graph to be the random graph with triangles (RGT) obtained by… ▽ More We aim to understand the extent to which the noise distribution in a planted signal-plus-noise problem impacts its computational complexity. To that end, we consider the planted clique and planted dense subgraph problems, but in a different ambient graph. Instead of Erdős-Rényi $G(n,p)$, which has independent edges, we take the ambient graph to be the random graph with triangles (RGT) obtained by adding triangles to $G(n,p)$. We show that the RGT can be efficiently mapped to the corresponding $G(n,p)$, and moreover, that the planted clique (or dense subgraph) is approximately preserved under this mapping. This constitutes the first average-case reduction transforming dependent noise to independent noise. Together with the easier direction of mapping the ambient graph from Erdős-Rényi to RGT, our results yield a strong equivalence between models. In order to prove our results, we develop a new general framework for reasoning about the validity of average-case reductions based on low sensitivity to perturbations. △ Less

Submitted 27 June, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: 71 pages

arXiv:2305.06985 [pdf, other]

On the Advantages of Asynchrony in the Unsourced MAC

Authors: Alexander Fengler, Alejandro Lancho, Krishna Narayanan, Yury Polyanskiy

Abstract: In this work we demonstrate how a lack of synchronization can in fact be advantageous in the problem of random access. Specifically, we consider a multiple-access problem over a frame-asynchronous 2-user binary-input adder channel in the unsourced setup (2-UBAC). Previous work has shown that under perfect synchronization the per-user rates achievable with linear codes over the 2-UBAC are limited b… ▽ More In this work we demonstrate how a lack of synchronization can in fact be advantageous in the problem of random access. Specifically, we consider a multiple-access problem over a frame-asynchronous 2-user binary-input adder channel in the unsourced setup (2-UBAC). Previous work has shown that under perfect synchronization the per-user rates achievable with linear codes over the 2-UBAC are limited by 0.5 bit per channel use (compared to the capacity of 0.75). In this paper, we first demonstrate that arbitrary small (even single-bit) shift between the user's frames enables (random) linear codes to attain full capacity of 0.75 bit/user. Furthermore, we derive density evolution equations for irregular LDPC codes, and prove (via concentration arguments) that they correctly track the asymptotic bit-error rate of a BP decoder. Optimizing the degree distributions we construct LDPC codes achieving per-user rates of 0.73 bit per channel use. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted for presentation at IEEE ISIT 2023

arXiv:2305.05465 [pdf, other]

The emergence of clusters in self-attention dynamics

Authors: Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, Philippe Rigollet

Abstract: Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. U… ▽ More Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Cluster locations are determined by the initial tokens, confirming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [VSP'17] that leaders appear in a sequence of tokens when processed by Transformers. △ Less

Submitted 12 February, 2024; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2303.14689 [pdf, other]

Weak Recovery Threshold for the Hypergraph Stochastic Block Model

Authors: Yuzhou Gu, Yury Polyanskiy

Abstract: We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label, and weak recovery asks to recover a non-trivial fraction of the labels. We introduce a multi-terminal version of strong data proce… ▽ More We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In HSBM a random graph is constructed by placing hyperedges with higher density if all vertices of a hyperedge share the same binary label, and weak recovery asks to recover a non-trivial fraction of the labels. We introduce a multi-terminal version of strong data processing inequalities (SDPIs), which we call the multi-terminal SDPI, and use it to prove a variety of impossibility results for weak recovery. In particular, we prove that weak recovery is impossible below the Kesten-Stigum (KS) threshold if $r=3,4$, or a strength parameter $λ$ is at least $\frac 15$. Prior work Pal and Zhu (2021) established that weak recovery in HSBM is always possible above the KS threshold. Consequently, there is no information-computation gap for these cases, which (partially) resolves a conjecture of Angelini et al. (2015). To our knowledge this is the first impossibility result for HSBM weak recovery. As usual, we reduce the study of non-recovery of HSBM to the study of non-reconstruction in a related broadcasting on hypertrees (BOHT) model. While we show that BOHT's reconstruction threshold coincides with KS for $r=3,4$, surprisingly, we demonstrate that for $r\ge 7$ reconstruction is possible also below KS. This shows an interesting phase transition in the parameter $r$, and suggests that for $r\ge 7$, there might be an information-computation gap for the HSBM. For $r=5,6$ and large degree we propose an approach for showing non-reconstruction below KS, suggesting that $r=7$ is the correct threshold for onset of the new phase. △ Less

Submitted 27 June, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.14688 [pdf, ps, other]

Uniqueness of BP fixed point for the Potts model and applications to community detection

Authors: Yuzhou Gu, Yury Polyanskiy

Abstract: In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual info… ▽ More In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual information (Abbe et al. (2021)). The 2-community case corresponds to an Ising model, for which Yu and Polyanskiy (2022) established uniqueness for all cases. In this paper we analyze the $q$-ary Potts model, i.e., broadcasting of $q$-ary spins on a Galton-Watson tree with expected offspring degree $d$ through Potts channels with second-largest eigenvalue $λ$. We allow the intermediate vertices to be observed through noisy channels (side information). We prove that BP uniqueness holds with and without side information when $dλ^2 \ge 1 + C \max\{λ, q^{-1}\}\log q$ for some absolute constant $C>0$ independent of $q,λ,d$. For large $q$ and $λ= o(1/\log q)$, this is asymptotically achieving the Kesten-Stigum threshold $dλ^2=1$. These results imply mutual information formulas and optimal recovery algorithms for the $q$-community SBM in the corresponding ranges. For $q\ge 4$, Sly (2011); Mossel et al. (2022) showed that there exist choices of $q,λ,d$ below Kesten-Stigum (i.e. $dλ^2 < 1$) but reconstruction is possible. Somewhat surprisingly, we show that in such regimes BP uniqueness does not hold at least in the presence of weak side information. Our technical tool is a theory of $q$-ary symmetric channels, that we initiate here, generalizing the classical and widely-utilized information-theoretic characterization of BMS (binary memoryless symmetric) channels. △ Less

Submitted 27 June, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.06438 [pdf, ps, other]

doi 10.1109/ICASSP49357.2023.10096702

On Neural Architectures for Deep Learning-based Source Separation of Co-Channel OFDM Signals

Authors: Gary C. F. Lee, Amir Weiss, Alejandro Lancho, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time serie… ▽ More We study the single-channel source separation problem involving orthogonal frequency-division multiplexing (OFDM) signals, which are ubiquitous in many modern-day digital communication systems. Related efforts have been pursued in monaural source separation, where state-of-the-art neural architectures have been adopted to train an end-to-end separator for audio signals (as 1-dimensional time series). In this work, through a prototype problem based on the OFDM source model, we assess -- and question -- the efficacy of using audio-oriented neural architectures in separating signals based on features pertinent to communication waveforms. Perhaps surprisingly, we demonstrate that in some configurations, where perfect separation is theoretically attainable, these audio-oriented neural architectures perform poorly in separating co-channel OFDM waveforms. Yet, we propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures, that can confer about 30 dB improvement in performance. △ Less

Submitted 15 March, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

arXiv:2302.04658 [pdf, ps, other]

The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning

Authors: Adam Block, Yury Polyanskiy

Abstract: Suppose we are given access to $n$ independent samples from distribution $μ$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $ν$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tildeΘ(\frac{D}{f'(n)})$ over the class of all pairs $ν,μ$ with a bounded $f$-divergence… ▽ More Suppose we are given access to $n$ independent samples from distribution $μ$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $ν$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tildeΘ(\frac{D}{f'(n)})$ over the class of all pairs $ν,μ$ with a bounded $f$-divergence $D_f(ν\|μ)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $ν$ with respect to $μ$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling. △ Less

Submitted 23 February, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

Comments: Corrected mistake in proof of Lemma 27 from the COLT 2023 version of this paper

arXiv:2211.15242 [pdf, other]

Ising Model on Locally Tree-like Graphs: Uniqueness of Solutions to Cavity Equations

Authors: Qian Yu, Yury Polyanskiy

Abstract: In the study of Ising models on large locally tree-like graphs, in both rigorous and non-rigorous methods one is often led to understanding the so-called belief propagation distributional recursions and its fixed points. We prove that there is at most one non-trivial fixed point for Ising models with zero or certain random external fields. Previously this was only known for sufficiently ``low-temp… ▽ More In the study of Ising models on large locally tree-like graphs, in both rigorous and non-rigorous methods one is often led to understanding the so-called belief propagation distributional recursions and its fixed points. We prove that there is at most one non-trivial fixed point for Ising models with zero or certain random external fields. Previously this was only known for sufficiently ``low-temperature'' models. Our main innovation is in applying information-theoretic ideas of channel comparison leading to a new metric (degradation index) between binary-input-symmetric (BMS) channels under which the Belief Propagation (BP) operator is a strict contraction (albeit non-multiplicative). A key ingredient of our proof is a strengthening of the classical stringy tree lemma of (Evans-Kenyon-Peres-Schulman'00). Our result simultaneously closes the following 6 conjectures in the literature: 1) independence of robust reconstruction accuracy to leaf noise in broadcasting on trees (Mossel-Neeman-Sly'16); 2) uselessness of global information for a labeled 2-community stochastic block model, or 2-SBM (Kanade-Mossel-Schramm'16); 3) optimality of local algorithms for 2-SBM under noisy side information (Mossel-Xu'16); 4) uniqueness of BP fixed point in broadcasting on trees in the Gaussian (large degree) limit (ibid); 5) boundary irrelevance in broadcasting on trees (Abbe-Cornacchia-Gu-Polyanskiy'21); 6) characterization of entropy (and mutual information) of community labels given the graph in 2-SBM (ibid). △ Less

Submitted 31 July, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.01126 [pdf, other]

Likelihood-free hypothesis testing

Authors: Patrik Róbert Gerber, Yury Polyanskiy

Abstract: Consider the problem of binary hypothesis testing. Given $Z$ coming from either $\mathbb P^{\otimes m}$ or $\mathbb Q^{\otimes m}$, to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have $m\asymp1/ε^2$, where $ε$ measures the separation between $\mathbb P$ and $\mathbb Q$ in total variation ($\mathsf{TV}$). Achieving this, however, requires… ▽ More Consider the problem of binary hypothesis testing. Given $Z$ coming from either $\mathbb P^{\otimes m}$ or $\mathbb Q^{\otimes m}$, to decide between the two with small probability of error it is sufficient, and in many cases necessary, to have $m\asymp1/ε^2$, where $ε$ measures the separation between $\mathbb P$ and $\mathbb Q$ in total variation ($\mathsf{TV}$). Achieving this, however, requires complete knowledge of the distributions and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem which we call likelihood-free hypothesis testing, where access to $\mathbb P$ and $\mathbb Q$ is given through $n$ i.i.d. observations from each. In the case when $\mathbb P$ and $\mathbb Q$ are assumed to belong to a non-parametric family, we demonstrate the existence of a fundamental trade-off between $n$ and $m$ given by $nm\asymp n_\sf{GoF}^2(ε)$, where $n_\sf{GoF}(ε)$ is the minimax sample complexity of testing between the hypotheses $H_0:\, \mathbb P=\mathbb Q$ vs $H_1:\, \mathsf{TV}(\mathbb P,\mathbb Q)\geqε$. We show this for three families of distributions, in addition to the family of all discrete distributions for which we obtain a more complicated trade-off exhibiting an additional phase-transition. Our results demonstrate the possibility of testing without fully estimating $\mathbb P$ and $\mathbb Q$, provided $m \gg 1/ε^2$. △ Less

Submitted 7 March, 2024; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: 58 pages, 1 figure

arXiv:2210.01951 [pdf, other]

Finite-Blocklength Results for the A-channel: Applications to Unsourced Random Access and Group Testing

Authors: Alejandro Lancho, Alexander Fengler, Yury Polyanskiy

Abstract: We present finite-blocklength achievability bounds for the unsourced A-channel. In this multiple-access channel, users noiselessly transmit codewords picked from a common codebook with entries generated from a $q$-ary alphabet. At each channel use, the receiver observes the set of different transmitted symbols but not their multiplicity. We show that the A-channel finds applications in unsourced r… ▽ More We present finite-blocklength achievability bounds for the unsourced A-channel. In this multiple-access channel, users noiselessly transmit codewords picked from a common codebook with entries generated from a $q$-ary alphabet. At each channel use, the receiver observes the set of different transmitted symbols but not their multiplicity. We show that the A-channel finds applications in unsourced random-access (URA) and group testing. Leveraging the insights provided by the finite-blocklength bounds and the connection between URA and non-adaptive group testing through the A-channel, we propose improved decoding methods for state-of-the-art A-channel codes and we showcase how A-channel codes provide a new class of structured group testing matrices. The developed bounds allow to evaluate the achievable error probabilities of group testing matrices based on random A-channel codes for arbitrary numbers of tests, items and defectives. We show that such a construction asymptotically achieves the optimal number of tests. In addition, every efficiently decodable A-channel code can be used to construct a group testing matrix with sub-linear recovery time. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: 11 pages, 4 figures, extended version of the paper presented at the 58th Annual Allerton Conference on Communication, Control, and Computing (2022)

arXiv:2209.04871 [pdf, other]

doi 10.1109/GLOBECOM48099.2022.10001513

Data-Driven Blind Synchronization and Interference Rejection for Digital Communication Signals

Authors: Alejandro Lancho, Amir Weiss, Gary C. F. Lee, Jennifer Tang, Yuheng Bu, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separati… ▽ More We study the potential of data-driven deep learning methods for separation of two communication signals from an observation of their mixture. In particular, we assume knowledge on the generation process of one of the signals, dubbed signal of interest (SOI), and no knowledge on the generation process of the second signal, referred to as interference. This form of the single-channel source separation problem is also referred to as interference rejection. We show that capturing high-resolution temporal structures (nonstationarities), which enables accurate synchronization to both the SOI and the interference, leads to substantial performance gains. With this key insight, we propose a domain-informed neural network (NN) design that is able to improve upon both "off-the-shelf" NNs and classical detection and interference rejection methods, as demonstrated in our simulations. Our findings highlight the key role communication-specific domain knowledge plays in the development of data-driven approaches that hold the promise of unprecedented gains. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: 9 pages, 6 figures, accepted at IEEE GLOBECOM 2022 (this version contains extended proofs)

arXiv:2209.01328 [pdf, other]

Optimal empirical Bayes estimation for the Poisson model via minimum-distance methods

Authors: Soham Jana, Yury Polyanskiy, Yihong Wu

Abstract: The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins… ▽ More The Robbins estimator is the most iconic and widely used procedure in the empirical Bayes literature for the Poisson model. On one hand, this method has been recently shown to be minimax optimal in terms of the regret (excess risk over the Bayesian oracle that knows the true prior) for various nonparametric classes of priors. On the other hand, it has been long recognized in practice that Robbins estimator lacks the desired smoothness and monotonicity of Bayes estimators and can be easily derailed by those data points that were rarely observed before. Based on the minimum-distance distance method, we propose a suite of empirical Bayes estimators, including the classical nonparametric maximum likelihood, that outperform the Robbins method in a variety of synthetic and real data sets and retain its optimality in terms of minimax regret. △ Less

Submitted 7 May, 2024; v1 submitted 3 September, 2022; originally announced September 2022.

Comments: 28 pages, 7 figures, 3 tables. Added an extension to a multivariate setup and a comparison with the worst case prior

MSC Class: 62G05; 62G07

arXiv:2208.10325 [pdf, other]

doi 10.1109/MLSP55214.2022.9943311

Exploiting Temporal Structures of Cyclostationary Signals for Data-Driven Single-Channel Source Separation

Authors: Gary C. F. Lee, Amir Weiss, Alejandro Lancho, Jennifer Tang, Yuheng Bu, Yury Polyanskiy, Gregory W. Wornell

Abstract: We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian cons… ▽ More We study the problem of single-channel source separation (SCSS), and focus on cyclostationary signals, which are particularly suitable in a variety of application domains. Unlike classical SCSS approaches, we consider a setting where only examples of the sources are available rather than their models, inspiring a data-driven approach. For source models with underlying cyclostationary Gaussian constituents, we establish a lower bound on the attainable mean squared error (MSE) for any separation method, model-based or data-driven. Our analysis further reveals the operation for optimal separation and the associated implementation challenges. As a computationally attractive alternative, we propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator. We demonstrate in simulation that, with suitable domain-informed architectural choices, our U-Net method can approach the optimal performance with substantially reduced computational burden. △ Less

Submitted 22 August, 2022; originally announced August 2022.

arXiv:2205.03752 [pdf, other]

Efficient Representation of Large-Alphabet Probability Distributions

Authors: Aviv Adler, Jennifer Tang, Yury Polyanskiy

Abstract: A number of engineering and scientific problems require representing and manipulating probability distributions over large alphabets, which we may think of as long vectors of reals summing to $1$. In some cases it is required to represent such a vector with only $b$ bits per entry. A natural choice is to partition the interval $[0,1]$ into $2^b$ uniform bins and quantize entries to each bin indepe… ▽ More A number of engineering and scientific problems require representing and manipulating probability distributions over large alphabets, which we may think of as long vectors of reals summing to $1$. In some cases it is required to represent such a vector with only $b$ bits per entry. A natural choice is to partition the interval $[0,1]$ into $2^b$ uniform bins and quantize entries to each bin independently. We show that a minor modification of this procedure -- applying an entrywise non-linear function (compander) $f(x)$ prior to quantization -- yields an extremely effective quantization method. For example, for $b=8 (16)$ and $10^5$-sized alphabets, the quality of representation improves from a loss (under KL divergence) of $0.5 (0.1)$ bits/entry to $10^{-4} (10^{-9})$ bits/entry. Compared to floating point representations, our compander method improves the loss from $10^{-1}(10^{-6})$ to $10^{-4}(10^{-9})$ bits/entry. These numbers hold for both real-world data (word frequencies in books and DNA $k$-mer counts) and for synthetic randomly generated distributions. Theoretically, we set up a minimax optimality criterion and show that the compander $f(x) ~\propto~ \mathrm{ArcSinh}(\sqrt{(1/2) (K \log K) x})$ achieves near-optimal performance, attaining a KL-quantization loss of $\asymp 2^{-2b} \log^2 K$ for a $K$-letter alphabet and $b\to \infty$. Interestingly, a similar minimax criterion for the quadratic loss on the hypercube shows optimality of the standard uniform quantizer. This suggests that the $\mathrm{ArcSinh}$ quantizer is as fundamental for KL-distortion as the uniform quantizer for quadratic distortion. △ Less

Submitted 26 October, 2023; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: corrected typos

arXiv:2205.02128 [pdf, ps, other]

Rate of convergence of the smoothed empirical Wasserstein distance

Authors: Adam Block, Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: Consider an empirical measure $\mathbb{P}_n$ induced by $n$ iid samples from a $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$ and let $γ= N(0,σ^2 I_d)$ be the isotropic Gaussian measure. We study the speed of convergence of the smoothed Wasserstein distance $W_2(\mathbb{P}_n * γ, \mathbb{P}*γ) = n^{-α+ o(1)}$ with $*$ being the convolution of measures. For $K<σ$ and in any dimension… ▽ More Consider an empirical measure $\mathbb{P}_n$ induced by $n$ iid samples from a $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$ and let $γ= N(0,σ^2 I_d)$ be the isotropic Gaussian measure. We study the speed of convergence of the smoothed Wasserstein distance $W_2(\mathbb{P}_n * γ, \mathbb{P}*γ) = n^{-α+ o(1)}$ with $*$ being the convolution of measures. For $K<σ$ and in any dimension $d\ge 1$ we show that $α= {1\over2}$. For $K>σ$ in dimension $d=1$ we show that the rate is slower and is given by $α= {(σ^2 + K^2)^2\over 4 (σ^4 + K^4)} < 1/2$. This resolves several open problems in [GGNWP20], and in particular precisely identifies the amount of smoothing $σ$ needed to obtain a parametric rate. In addition, for any $d$-dimensional $K$-subgaussian distribution $\mathbb{P}$, we also establish that $D_{KL}(\mathbb{P}_n * γ\|\mathbb{P}*γ)$ has rate $O(1/n)$ for $K<σ$ but only slows down to $O({(\log n)^{d+1}\over n})$ for $K>σ$. The surprising difference of the behavior of $W_2^2$ and KL implies the failure of $T_{2}$-transportation inequality when $σ< K$. Consequently, it follows that for $K>σ$ the log-Sobolev inequality (LSI) for the Gaussian mixture $\mathbb{P} * N(0, σ^{2})$ cannot hold. This closes an open problem in [WW+16], who established the LSI under the condition $K<σ$ and asked if their bound can be improved. △ Less

Submitted 15 August, 2024; v1 submitted 4 May, 2022; originally announced May 2022.

arXiv:2204.06357 [pdf, other]

Linear Programs with Polynomial Coefficients and Applications to 1D Cellular Automata

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy

Abstract: Given a matrix $A$ and vector $b$ with polynomial entries in $d$ real variables $δ=(δ_1,\ldots,δ_d)$ we consider the following notion of feasibility: the pair $(A,b)$ is locally feasible if there exists an open neighborhood $U$ of $0$ such that for every $δ\in U$ there exists $x$ satisfying $A(δ)x\ge b(δ)$ entry-wise. For $d=1$ we construct a polynomial time algorithm for deciding local feasibilit… ▽ More Given a matrix $A$ and vector $b$ with polynomial entries in $d$ real variables $δ=(δ_1,\ldots,δ_d)$ we consider the following notion of feasibility: the pair $(A,b)$ is locally feasible if there exists an open neighborhood $U$ of $0$ such that for every $δ\in U$ there exists $x$ satisfying $A(δ)x\ge b(δ)$ entry-wise. For $d=1$ we construct a polynomial time algorithm for deciding local feasibility. For $d \ge 2$ we show local feasibility is NP-hard. This also gives the first polynomial-time algorithm for the asymptotic linear program problem introduced by Jeroslow in 1973. As an application (which was the primary motivation for this work) we give a computer-assisted proof of ergodicity of the following elementary 1D cellular automaton: given the current state $η_t \in \{0,1\}^{\mathbb{Z}}$ the next state $η_{t+1}(n)$ at each vertex $n\in \mathbb{Z}$ is obtained by $η_{t+1}(n)= \text{NAND}\big(\text{BSC}_δ(η_t(n-1)), \text{BSC}_δ(η_t(n))\big)$. Here the binary symmetric channel $\text{BSC}_δ$ takes a bit as input and flips it with probability $δ$ (and leaves it unchanged with probability $1-δ$). It is shown that there exists $δ_0>0$ such that for all $0<δ<δ_0$ the distribution of $η_t$ converges to a unique stationary measure irrespective of the initial condition $η_0$. We also consider the problem of broadcasting information on the 2D-grid of noisy binary-symmetric channels $\text{BSC}_δ$, where each node may apply an arbitrary processing function to its input bits. We prove that there exists $δ_0'>0$ such that for all noise levels $0<δ<δ_0'$ it is impossible to broadcast information for any processing function, as conjectured by Makur, Mossel and Polyanskiy. △ Less

Submitted 9 May, 2023; v1 submitted 13 April, 2022; originally announced April 2022.

arXiv:2111.00559 [pdf, other]

doi 10.1109/TIT.2023.3247812

Capacity of Noisy Permutation Channels

Authors: Jennifer Tang, Yury Polyanskiy

Abstract: We establish the capacity of a class of communication channels introduced in [1]. The $n$-letter input from a finite alphabet is passed through a discrete memoryless channel $P_{Z|X}$ and then the output $n$-letter sequence is uniformly permuted. We show that the maximal communication rate (normalized by $\log n$) equals $1/2 (rank(P_{Z|X})-1)$ whenever $P_{Z|X}$ is strictly positive. This is done… ▽ More We establish the capacity of a class of communication channels introduced in [1]. The $n$-letter input from a finite alphabet is passed through a discrete memoryless channel $P_{Z|X}$ and then the output $n$-letter sequence is uniformly permuted. We show that the maximal communication rate (normalized by $\log n$) equals $1/2 (rank(P_{Z|X})-1)$ whenever $P_{Z|X}$ is strictly positive. This is done by establishing a converse bound matching the achievability of [1]. The two main ingredients of our proof are (1) a sharp bound on the entropy of a uniformly sampled vector from a type class and observed through a DMC; and (2) the covering $ε$-net of a probability simplex with Kullback-Leibler divergence as a metric. In addition to strictly positive DMC we also find the noisy permutation capacity for $q$-ary erasure channels, the Z-channel and others. △ Less

Submitted 26 October, 2023; v1 submitted 31 October, 2021; originally announced November 2021.

Comments: draft version updated to newer version

Journal ref: in IEEE Transactions on Information Theory, vol. 69, no. 7, pp. 4145-4162, July 2023

arXiv:2109.03943 [pdf, ps, other]

Sharp regret bounds for empirical Bayes and compound decision problems

Authors: Yury Polyanskiy, Yihong Wu

Abstract: We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put fo… ▽ More We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put forth by Robbins (1956), combines Bayesian and frequentist mindsets by postulating that the parameters are independent but with an unknown prior and aims to use a fully data-driven estimator to compete with the Bayesian oracle that knows the true prior. The central figure of merit is the regret, namely, the total excess risk over the Bayes risk in the worst case (over the priors). Although this paradigm was introduced more than 60 years ago, little is known about the asymptotic scaling of the optimal regret in the nonparametric setting. We show that for the Poisson model with compactly supported and subexponential priors, the optimal regret scales as $Θ((\frac{\log n}{\log\log n})^2)$ and $Θ(\log^3 n)$, respectively, both attained by the original estimator of Robbins. For the normal mean model, the regret is shown to be at least $Ω((\frac{\log n}{\log\log n})^2)$ and $Ω(\log^2 n)$ for compactly supported and subgaussian priors, respectively, the former of which resolves the conjecture of Singh (1979) on the impossibility of achieving bounded regret; before this work, the best regret lower bound was $Ω(1)$. In addition to the empirical Bayes setting, these results are shown to hold in the compound setting where the parameters are deterministic. As a side application, the construction in this paper also leads to improved or new lower bounds for density estimation of Gaussian and Poisson mixtures. △ Less

Submitted 10 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2106.04018 [pdf, other]

Intrinsic Dimension Estimation Using Wasserstein Distances

Authors: Adam Block, Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin

Abstract: It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guaran… ▽ More It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data. △ Less

Submitted 31 May, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2102.00050 [pdf, ps, other]

Sequential prediction under log-loss and misspecification

Authors: Meir Feder, Yury Polyanskiy

Abstract: We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption… ▽ More We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection. For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case -- the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$. Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov's normalized maximum likelihood achieve the PAC regret and our estimator requires special "robustification" against heavy-tailed data. In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes. △ Less

Submitted 15 September, 2021; v1 submitted 29 January, 2021; originally announced February 2021.

arXiv:2101.12601 [pdf, other]

Stochastic block model entropy and broadcasting on trees with survey

Authors: Emmanuel Abbe, Elisabetta Cornacchia, Yuzhou Gu, Yury Polyanskiy

Abstract: The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides fo… ▽ More The limit of the entropy in the stochastic block model (SBM) has been characterized in the sparse regime for the special case of disassortative communities [COKPZ17] and for the classical case of assortative communities but in the dense regime [DAM16]. The problem has not been closed in the classical sparse and assortative case. This paper establishes the result in this case for any SNR besides for the interval (1, 3.513). It further gives an approximation to the limit in this window. The result is obtained by expressing the global SBM entropy as an integral of local tree entropies in a broadcasting on tree model with erasure side-information. The main technical advancement then relies on showing the irrelevance of the boundary in such a model, also studied with variants in [KMS16], [MNS16] and [MX15]. In particular, we establish the uniqueness of the BP fixed point in the survey model for any SNR above 3.513 or below 1. This only leaves a narrow region in the plane between SNR and survey strength where the uniqueness of BP conjectured in these papers remains unproved. △ Less

Submitted 29 January, 2021; originally announced January 2021.

arXiv:2010.01987 [pdf, ps, other]

Strong data processing constant is achieved by binary inputs

Authors: Or Ordentlich, Yury Polyanskiy

Abstract: For any channel $P_{Y|X}$ the strong data processing constant is defined as the smallest number $η_{KL}\in[0,1]$ such that $I(U;Y)\le η_{KL} I(U;X)$ holds for any Markov chain $U-X-Y$. It is shown that the value of $η_{KL}$ is given by that of the best binary-input subchannel of $P_{Y|X}$. The same result holds for any $f$-divergence, verifying a conjecture of Cohen, Kemperman and Zbaganu (1998). For any channel $P_{Y|X}$ the strong data processing constant is defined as the smallest number $η_{KL}\in[0,1]$ such that $I(U;Y)\le η_{KL} I(U;X)$ holds for any Markov chain $U-X-Y$. It is shown that the value of $η_{KL}$ is given by that of the best binary-input subchannel of $P_{Y|X}$. The same result holds for any $f$-divergence, verifying a conjecture of Cohen, Kemperman and Zbaganu (1998). △ Less

Submitted 3 June, 2021; v1 submitted 15 September, 2020; originally announced October 2020.

Comments: 1 page

arXiv:2010.01390 [pdf, other]

doi 10.1109/TIT.2022.3177667

Broadcasting on Two-Dimensional Regular Grids

Authors: Anuran Makur, Elchanan Mossel, Yury Polyanskiy

Abstract: We study a specialization of the problem of broadcasting on directed acyclic graphs, namely, broadcasting on 2D regular grids. Consider a 2D regular grid with source vertex $X$ at layer $0$ and $k+1$ vertices at layer $k\geq 1$, which are at distance $k$ from $X$. Every vertex of the 2D regular grid has outdegree $2$, the vertices at the boundary have indegree $1$, and all other vertices have inde… ▽ More We study a specialization of the problem of broadcasting on directed acyclic graphs, namely, broadcasting on 2D regular grids. Consider a 2D regular grid with source vertex $X$ at layer $0$ and $k+1$ vertices at layer $k\geq 1$, which are at distance $k$ from $X$. Every vertex of the 2D regular grid has outdegree $2$, the vertices at the boundary have indegree $1$, and all other vertices have indegree $2$. At time $0$, $X$ is given a random bit. At time $k\geq 1$, each vertex in layer $k$ receives transmitted bits from its parents in layer $k-1$, where the bits pass through binary symmetric channels with noise level $δ\in(0,1/2)$. Then, each vertex combines its received bits using a common Boolean processing function to produce an output bit. The objective is to recover $X$ with probability of error better than $1/2$ from all vertices at layer $k$ as $k \rightarrow \infty$. Besides their natural interpretation in communication networks, such broadcasting processes can be construed as 1D probabilistic cellular automata (PCA) with boundary conditions that limit the number of sites at each time $k$ to $k+1$. We conjecture that it is impossible to propagate information in a 2D regular grid regardless of the noise level and the choice of processing function. In this paper, we make progress towards establishing this conjecture, and prove using ideas from percolation and coding theory that recovery of $X$ is impossible for any $δ$ provided that all vertices use either AND or XOR processing functions. Furthermore, we propose a martingale-based approach that establishes the impossibility of recovering $X$ for any $δ$ when all NAND processing functions are used if certain supermartingales can be rigorously constructed. We also provide numerical evidence for the existence of these supermartingales by computing explicit examples for different values of $δ$ via linear programming. △ Less

Submitted 16 September, 2022; v1 submitted 3 October, 2020; originally announced October 2020.

Comments: 38 pages, double column format, 2 figures

Journal ref: IEEE Transactions on Information Theory, vol. 68, no. 10, Oct. 2022

arXiv:2008.13372 [pdf, other]

Note on approximating the Laplace transform of a Gaussian on a complex disk

Authors: Yury Polyanskiy, Yihong Wu

Abstract: In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, i… ▽ More In this short note we study how well a Gaussian distribution can be approximated by distributions supported on $[-a,a]$. Perhaps, the natural conjecture is that for large $a$ the almost optimal choice is given by truncating the Gaussian to $[-a,a]$. Indeed, such approximation achieves the optimal rate of $e^{-Θ(a^2)}$ in terms of the $L_\infty$-distance between characteristic functions. However, if we consider the $L_\infty$-distance between Laplace transforms on a complex disk, the optimal rate is $e^{-Θ(a^2 \log a)}$, while truncation still only attains $e^{-Θ(a^2)}$. The optimal rate can be attained by the Gauss-Hermite quadrature. As corollary, we also construct a ``super-flat'' Gaussian mixture of $Θ(a^2)$ components with means in $[-a,a]$ and whose density has all derivatives bounded by $e^{-Ω(a^2 \log(a))}$ in the $O(1)$-neighborhood of the origin. △ Less

Submitted 31 August, 2020; originally announced August 2020.

arXiv:2008.08244 [pdf, ps, other]

Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models

Authors: Yury Polyanskiy, Yihong Wu

Abstract: Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme fo… ▽ More Introduced by Kiefer and Wolfowitz \cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization. In this paper we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size $n$ has $O(\log n)$ atoms (mass points), significantly improving the deterministic upper bound of $n$ due to Lindsay \cite{lindsay1983geometry1}. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with $O(\log n)$ components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term \emph{self-regularization}. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang \cite{zhang2009generalized}. △ Less

Submitted 6 September, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Refer conjecture of [DYPS20] in Sec 5.4 to the arxiv version arXiv:1901.03264v4

arXiv:2005.10561 [pdf, ps, other]

Extrapolating the profile of a finite population

Authors: Soham Jana, Yury Polyanskiy, Yihong Wu

Abstract: We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of… ▽ More We study a prototypical problem in empirical Bayes. Namely, consider a population consisting of $k$ individuals each belonging to one of $k$ types (some types can be empty). Without any structural restrictions, it is impossible to learn the composition of the full population having observed only a small (random) subsample of size $m = o(k)$. Nevertheless, we show that in the sublinear regime of $m =ω(k/\log k)$, it is possible to consistently estimate in total variation the \emph{profile} of the population, defined as the empirical distribution of the sizes of each type, which determines many symmetric properties of the population. We also prove that in the linear regime of $m=c k$ for any constant $c$ the optimal rate is $Θ(1/\log k)$. Our estimator is based on Wolfowitz's minimum distance method, which entails solving a linear program (LP) of size $k$. We show that there is a single infinite-dimensional LP whose value simultaneously characterizes the risk of the minimum distance estimator and certifies its minimax optimality. The sharp convergence rate is obtained by evaluating this LP using complex-analytic techniques. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2005.07801 [pdf, ps, other]

Broadcasting on trees near criticality

Authors: Yuzhou Gu, Hajir Roozbehani, Yury Polyanskiy

Abstract: We revisit the problem of broadcasting on $d$-ary trees: starting from a Bernoulli$(1/2)$ random variable $X_0$ at a root vertex, each vertex forwards its value across binary symmetric channels $\mathrm{BSC}_δ$ to $d$ descendants. The goal is to reconstruct $X_0$ given the vector $X_{L_h}$ of values of all variables at depth $h$. It is well known that reconstruction (better than a random guess) is… ▽ More We revisit the problem of broadcasting on $d$-ary trees: starting from a Bernoulli$(1/2)$ random variable $X_0$ at a root vertex, each vertex forwards its value across binary symmetric channels $\mathrm{BSC}_δ$ to $d$ descendants. The goal is to reconstruct $X_0$ given the vector $X_{L_h}$ of values of all variables at depth $h$. It is well known that reconstruction (better than a random guess) is possible as $h\to \infty$ if and only if $δ< δ_c(d)$. In this paper, we study the behavior of the mutual information and the probability of error when $δ$ is slightly subcritical. The innovation of our work is application of the recently introduced "less-noisy" channel comparison techniques. For example, we are able to derive the positive part of the phase transition (reconstructability when $δ<δ_c$) using purely information-theoretic ideas. This is in contrast with previous derivations, which explicitly analyze distribution of the Hamming weight of $X_{L_h}$ (a so-called Kesten-Stigum bound). △ Less

Submitted 15 May, 2020; originally announced May 2020.

Comments: Accepted to 2020 IEEE International Symposium on Information Theory (ISIT 2020)

arXiv:2005.05444 [pdf, other]

Non-linear Log-Sobolev inequalities for the Potts semigroup and applications to reconstruction problems

Authors: Yuzhou Gu, Yury Polyanskiy

Abstract: Consider the semigroup of random walk on a complete graph, which we call the Potts semigroup. Diaconis and Saloff-Coste computed the maximum of the ratio of the relative entropy and the Dirichlet form obtaining the constant $α_2$ in the $2$-log-Sobolev inequality ($2$-LSI). In this paper, we obtain the best possible non-linear inequality relating entropy and the Dirichlet form (i.e., $p$-NLSI,… ▽ More Consider the semigroup of random walk on a complete graph, which we call the Potts semigroup. Diaconis and Saloff-Coste computed the maximum of the ratio of the relative entropy and the Dirichlet form obtaining the constant $α_2$ in the $2$-log-Sobolev inequality ($2$-LSI). In this paper, we obtain the best possible non-linear inequality relating entropy and the Dirichlet form (i.e., $p$-NLSI, $p\ge1$). As an example, we show $α_1 = 1+\frac{1+o(1)}{\log k}$. By integrating the $1$-NLSI we obtain the new strong data processing inequality (SDPI), which in turn allows us to improve results of Mossel and Peres on reconstruction thresholds for Potts models on trees. A special case is the problem of reconstructing color of the root of a $k$-colored tree given knowledge of colors of all the leaves. We show that to have a non-trivial reconstruction probability the branching number of the tree should be at least $$\frac{\log k}{\log k - \log(k-1)} = (1-o(1))k\log k.$$ This recovers previous results (of Sly and Bhatnagar et al.) in (slightly) more generality, but more importantly avoids the need for any coloring-specialized arguments. Similarly, we improve the state-of-the-art on the weak recovery threshold for the stochastic block model with $k$ balanced groups, for all $k\ge 3$. To further show the power of our method, we prove optimal non-reconstruction results for a broadcasting on trees model with Gaussian kernels, closing a gap left open by Eldan et al. These improvements advocate information-theoretic methods as a useful complement to the conventional techniques originating from the statistical physics. △ Less

Submitted 2 December, 2022; v1 submitted 11 May, 2020; originally announced May 2020.

arXiv:2004.14941 [pdf, other]

The Information Bottleneck Problem and Its Applications in Machine Learning

Authors: Ziv Goldfeld, Yury Polyanskiy

Abstract: Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lo… ▽ More Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector $T$, from which $Y$ is predicted. The algorithm will successfully make the prediction if $T$ is a good proxy of $Y$, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations $T$ based on real-world data are now available. While these methods are often effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation $T$ should be maximally informative about $Y$ while minimizing the mutual information with $X$. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is particularly important for further leveraging IB and other information-theoretic ideas to study DL models. △ Less

Submitted 1 May, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

arXiv:1911.12263 [pdf, ps, other]

Low density majority codes and the problem of graceful degradation

Authors: Hajir Roozbehani, Yury Polyanskiy

Abstract: We study a problem of constructing codes that transform a channel with high bit error rate (BER) into one with low BER (at the expense of rate). Our focus is on obtaining codes with smooth ("graceful'') input-output BER curves (as opposed to threshold-like curves typical for long error-correcting codes). This paper restricts attention to binary erasure channels (BEC) and contains three contribut… ▽ More We study a problem of constructing codes that transform a channel with high bit error rate (BER) into one with low BER (at the expense of rate). Our focus is on obtaining codes with smooth ("graceful'') input-output BER curves (as opposed to threshold-like curves typical for long error-correcting codes). This paper restricts attention to binary erasure channels (BEC) and contains three contributions. First, we introduce the notion of Low Density Majority Codes (LDMCs). These codes are non-linear sparse-graph codes, which output majority function evaluated on randomly chosen small subsets of the data bits. This is similar to Low Density Generator Matrix codes (LDGMs), except that the XOR function is replaced with the majority. We show that even with a few iterations of belief propagation (BP) the attained input-output curves provably improve upon performance of any linear systematic code. The effect of non-linearity bootstraping the initial iterations of BP, suggests that LDMCs should improve performance in various applications, where LDGMs have been used traditionally. Second, we establish several \textit{two-point converse bounds} that lower bound the BER achievable at one erasure probability as a function of BER achieved at another one. The novel nature of our bounds is that they are specific to subclasses of codes (linear systematic and non-linear systematic) and outperform similar bounds implied by the area theorem for the EXIT function. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1906.09812

arXiv:1910.12678 [pdf, ps, other]

Massive Access for Future Wireless Communication Systems

Authors: Yongpeng Wu, Xiqi Gao, Shidong Zhou, Wei Yang, Yury Polyanskiy, Giuseppe Caire

Abstract: Multiple access technology played an important role in wireless communication in the last decades: it increases the capacity of the channel and allows different users to access the system simultaneously. However, the conventional multiple access technology, as originally designed for current human-centric wireless networks, is not scalable for future machine-centric wireless networks. Massive ac… ▽ More Multiple access technology played an important role in wireless communication in the last decades: it increases the capacity of the channel and allows different users to access the system simultaneously. However, the conventional multiple access technology, as originally designed for current human-centric wireless networks, is not scalable for future machine-centric wireless networks. Massive access (studied in the literature under such names as massive-device multiple access, unsourced massive random access, massive connectivity, massive machine-type communication, and many-access channels) exhibits a clean break with current networks by potentially supporting millions of devices in each cellular network. The tremendous growth in the number of connected devices requires a fundamental rethinking of the conventional multiple access technologies in favor of new schemes suited for massive random access. Among the many new challenges arising in this setting, the most relevant are: the fundamental limits of communication from a massive number of bursty devices transmitting simultaneously with short packets, the design of low complexity and energy-efficient massive access coding and communication schemes, efficient methods for the detection of a relatively small number of active users among a large number of potential user devices with sporadic transmission pattern, and the integration of massive access with massive MIMO and other important wireless communication technologies. This paper presents an overview of the concept of massive access wireless communication and of the contemporary research on this important topic. △ Less

Submitted 25 February, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

Comments: A short version has been accepted by IEEE Wireless Communications

arXiv:1909.01221 [pdf, ps, other]

A Note on the Probability of Rectangles for Correlated Binary Strings

Authors: Or Ordentlich, Yury Polyanskiy, Ofer Shayevitz

Abstract: Consider two sequences of $n$ independent and identically distributed fair coin tosses, $X=(X_1,\ldots,X_n)$ and $Y=(Y_1,\ldots,Y_n)$, which are $ρ$-correlated for each $j$, i.e. $\mathbb{P}[X_j=Y_j] = {1+ρ\over 2}$. We study the question of how large (small) the probability $\mathbb{P}[X \in A, Y\in B]$ can be among all sets $A,B\subset\{0,1\}^n$ of a given cardinality. For sets… ▽ More Consider two sequences of $n$ independent and identically distributed fair coin tosses, $X=(X_1,\ldots,X_n)$ and $Y=(Y_1,\ldots,Y_n)$, which are $ρ$-correlated for each $j$, i.e. $\mathbb{P}[X_j=Y_j] = {1+ρ\over 2}$. We study the question of how large (small) the probability $\mathbb{P}[X \in A, Y\in B]$ can be among all sets $A,B\subset\{0,1\}^n$ of a given cardinality. For sets $|A|,|B| = Θ(2^n)$ it is well known that the largest (smallest) probability is approximately attained by concentric (anti-concentric) Hamming balls, and this can be proved via the hypercontractive inequality (reverse hypercontractivity). Here we consider the case of $|A|,|B| = 2^{Θ(n)}$. By applying a recent extension of the hypercontractive inequality of Polyanskiy-Samorodnitsky (J. Functional Analysis, 2019), we show that Hamming balls of the same size approximately maximize $\mathbb{P}[X \in A, Y\in B]$ in the regime of $ρ\to 1$. We also prove a similar tight lower bound, i.e. show that for $ρ\to 0$ the pair of opposite Hamming balls approximately minimizes the probability $\mathbb{P}[X \in A, Y\in B]$. △ Less

Submitted 18 August, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

arXiv:1907.09448 [pdf, ps, other]

Energy efficient coded random access for the wireless uplink

Authors: Suhas S Kowshik, Kirill Andreev, Alexey Frolov, Yury Polyanskiy

Abstract: We discuss the problem of designing channel access architectures for enabling fast, low-latency, grant-free and uncoordinated uplink for densely packed wireless nodes. Specifically, we study random-access codes, previously introduced for the AWGN multiple-access channel (MAC) by Polyanskiy'2017, in the practically more relevant case of users subject to Rayleigh fading, when channel gains are unkno… ▽ More We discuss the problem of designing channel access architectures for enabling fast, low-latency, grant-free and uncoordinated uplink for densely packed wireless nodes. Specifically, we study random-access codes, previously introduced for the AWGN multiple-access channel (MAC) by Polyanskiy'2017, in the practically more relevant case of users subject to Rayleigh fading, when channel gains are unknown to the decoder. We propose a random coding achievability bound, which we analyze both non-asymptotically (at finite blocklength) and asymptotically. As a candidate practical solution, we propose an explicit sparse-graph based coding scheme together with an alternating belief-propagation decoder. The latter's performance is found to be surprisingly close to the finite-blocklength bounds. Our main findings are twofold. First, just like in the AWGN MAC we see that jointly decoding large number of users leads to a surprising phase transition effect, where at spectral efficiencies below a critical threshold (5-15 bps/Hz depending on reliability) a perfect multi-user interference cancellation is possible. Second, while the presence of Rayleigh fading significantly increases the minimal required energy-per-bit $E_b/N_0$ (from about 0-2 dB to about 8-11 dB), the inherent randomization introduced by the channel makes it much easier to attain the optimal performance via iterative schemes. In all, it is hoped that a principled definition of the random-access model together with our information-theoretic analysis will open the road towards unified benchmarking and comparison performance of various random-access solutions, such as the currently discussed candidates (MUSA, SCMA, RSMA) for the 5G/6G. △ Less

Submitted 22 July, 2019; originally announced July 2019.

Comments: 26 pages

arXiv:1906.09812

On Low Density Majority Codes

Authors: Hajir Roozbehani, Yury Polyanskiy

Abstract: We study a problem of constructing codes that transform a channel with high bit error rate (BER) into one with low BER (at the expense of rate). Our focus is on obtaining codes with smooth ("graceful") input-output BER curves (as opposed to threshold-like curves typical for long error-correcting codes). To that end we introduce the notion of Low Density Majority Codes (LDMCs). These codes are non-… ▽ More We study a problem of constructing codes that transform a channel with high bit error rate (BER) into one with low BER (at the expense of rate). Our focus is on obtaining codes with smooth ("graceful") input-output BER curves (as opposed to threshold-like curves typical for long error-correcting codes). To that end we introduce the notion of Low Density Majority Codes (LDMCs). These codes are non-linear sparse-graph codes, which output majority function evaluated on randomly chosen small subsets of the data bits. This is similar to Low Density Generator Matrix codes (LDGMs), except that the XOR function is replaced with the majority. We show that even with a few iterations of belief propagation (BP) the attained input-output curves provably improve upon performance of any linear systematic code. The effect of non-linearity bootstraping the initial iterations of BP, suggests that LDMCs should improve performance in various applications, where LDGMs have been used traditionally (e.g., pre-coding for optics, tornado raptor codes, protograph constructions). As a side result of separate interest we establish a lower (impossibility) bound for the achievable BER of a systematic linear code at one value of erasure noise given its BER at another value. We show that this new bound is superior to the results inferred from the area theorem for EXIT functions. △ Less

Submitted 1 December, 2019; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: This article has been replaced by the updated version arXiv:1911.12263

arXiv:1905.13576 [pdf, other]

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

Authors: Ziv Goldfeld, Kristjan Greenewald, Yury Polyanskiy, Jonathan Weed

Abstract: This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variati… ▽ More This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $χ^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $χ^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $χ^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $ω(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $χ^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_σ)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_σ)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided. △ Less

Submitted 1 May, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1810.11589

Showing 1–50 of 98 results for author: Polyanskiy, Y