subscribe to arXiv mailings

Gradient Flows and Riemannian Structure in the Gromov-Wasserstein Geometry

Authors: Zhengxin Zhang, Ziv Goldfeld, Kristjan Greenewald, Youssef Mroueh, Bharath K. Sriperumbudur

Abstract: The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows a… ▽ More The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows and Riemannian structure in the Gromov-Wasserstein (GW) geometry, which is particularly suited for such purposes. We focus on the inner product GW (IGW) distance between distributions on $\mathbb{R}^d$. Given a functional $\mathsf{F}:\mathcal{P}_2(\mathbb{R}^d)\to\mathbb{R}$ to optimize, we present an implicit IGW minimizing movement scheme that generates a sequence of distributions $\{ρ_i\}_{i=0}^n$, which are close in IGW and aligned in the 2-Wasserstein sense. Taking the time step to zero, we prove that the discrete solution converges to an IGW generalized minimizing movement (GMM) $(ρ_t)_t$ that follows the continuity equation with a velocity field $v_t\in L^2(ρ_t;\mathbb{R}^d)$, specified by a global transformation of the Wasserstein gradient of $\mathsf{F}$. The transformation is given by a mobility operator that modifies the Wasserstein gradient to encode not only local information, but also global structure. Our gradient flow analysis leads us to identify the Riemannian structure that gives rise to the intrinsic IGW geometry, using which we establish a Benamou-Brenier-like formula for IGW. We conclude with a formal derivation, akin to the Otto calculus, of the IGW gradient as the inverse mobility acting on the Wasserstein gradient. Numerical experiments validating our theory and demonstrating the global nature of IGW interpolations are provided. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 73 pages

arXiv:2406.06509 [pdf, ps, other]

Robust Distribution Learning with Local and Global Adversarial Corruptions

Authors: Sloan Nietert, Ziv Goldfeld, Soroosh Shafiee

Abstract: We consider learning in an adversarial environment, where an $\varepsilon$-fraction of samples from a distribution $P$ are arbitrarily modified (global corruptions) and the remaining perturbations have average magnitude bounded by $ρ$ (local corruptions). Given access to $n$ such corrupted samples, we seek a computationally efficient estimator $\hat{P}_n$ that minimizes the Wasserstein distance… ▽ More We consider learning in an adversarial environment, where an $\varepsilon$-fraction of samples from a distribution $P$ are arbitrarily modified (global corruptions) and the remaining perturbations have average magnitude bounded by $ρ$ (local corruptions). Given access to $n$ such corrupted samples, we seek a computationally efficient estimator $\hat{P}_n$ that minimizes the Wasserstein distance $\mathsf{W}_1(\hat{P}_n,P)$. In fact, we attack the fine-grained task of minimizing $\mathsf{W}_1(Π_\# \hat{P}_n, Π_\# P)$ for all orthogonal projections $Π\in \mathbb{R}^{d \times d}$, with performance scaling with $\mathrm{rank}(Π) = k$. This allows us to account simultaneously for mean estimation ($k=1$), distribution estimation ($k=d$), as well as the settings interpolating between these two extremes. We characterize the optimal population-limit risk for this task and then develop an efficient finite-sample algorithm with error bounded by $\sqrt{\varepsilon k} + ρ+ \tilde{O}(d\sqrt{k}n^{-1/(k \lor 2)})$ when $P$ has bounded covariance. This guarantee holds uniformly in $k$ and is minimax optimal up to the sub-optimality of the plug-in estimator when $ρ= \varepsilon = 0$. Our efficient procedure relies on a novel trace norm approximation of an ideal yet intractable 2-Wasserstein projection estimator. We apply this algorithm to robust stochastic optimization, and, in the process, uncover a new method for overcoming the curse of dimensionality in Wasserstein distributionally robust optimization. △ Less

Submitted 24 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2024

arXiv:2405.06734 [pdf, other]

Neural Estimation Of Entropic Optimal Transport

Authors: Tao Wang, Ziv Goldfeld

Abstract: Optimal transport (OT) serves as a natural framework for comparing probability measures, with applications in statistics, machine learning, and applied mathematics. Alas, statistical estimation and exact computation of the OT distances suffer from the curse of dimensionality. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via pl… ▽ More Optimal transport (OT) serves as a natural framework for comparing probability measures, with applications in statistics, machine learning, and applied mathematics. Alas, statistical estimation and exact computation of the OT distances suffer from the curse of dimensionality. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in and efficient computation using Sinkhorn iterations. Motivated by further scaling up entropic OT (EOT) to data dimensions and sample sizes that appear in modern machine learning applications, we propose a novel neural estimation approach. Our estimator parametrizes a semi-dual representation of the EOT distance by a neural network, approximates expectations by sample means, and optimizes the resulting empirical objective over parameter space. We establish non-asymptotic error bounds on the EOT neural estimator of the cost and optimal plan. Our bounds characterize the effective error in terms of neural network size and the number of samples, revealing optimal scaling laws that guarantee parametric convergence. The bounds hold for compactly supported distributions and imply that the proposed estimator is minimax-rate optimal over that class. Numerical experiments validating our theory are also provided. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2312.07397

arXiv:2404.03176 [pdf, other]

Information-Theoretic Generalization Bounds for Deep Neural Networks

Authors: Haiyun He, Christina Lee Yu, Ziv Goldfeld

Abstract: Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the tra… ▽ More Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: Dropout, DropConnect, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 25 pages, 5 figures

arXiv:2312.07397 [pdf, other]

Neural Entropic Gromov-Wasserstein Alignment

Authors: Tao Wang, Ziv Goldfeld

Abstract: The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, provides a natural framework for aligning heterogeneous datasets. Alas, statistical estimation of the GW distance suffers from the curse of dimensionality and its exact computation is NP hard. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in… ▽ More The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, provides a natural framework for aligning heterogeneous datasets. Alas, statistical estimation of the GW distance suffers from the curse of dimensionality and its exact computation is NP hard. To circumvent these issues, entropic regularization has emerged as a remedy that enables parametric estimation rates via plug-in and efficient computation using Sinkhorn iterations. Motivated by further scaling up entropic GW (EGW) alignment methods to data dimensions and sample sizes that appear in modern machine learning applications, we propose a novel neural estimation approach. Our estimator parametrizes a minimax semi-dual representation of the EGW distance by a neural network, approximates expectations by sample means, and optimizes the resulting empirical objective over parameter space. We establish non-asymptotic error bounds on the EGW neural estimator of the alignment cost and optimal plan. Our bounds characterize the effective error in terms of neural network (NN) size and the number of samples, revealing optimal scaling laws that guarantee parametric convergence. The bounds hold for compactly supported distributions, and imply that the proposed estimator is minimax-rate optimal over that class. Numerical experiments validating our theory are also provided. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.05573 [pdf, other]

Outlier-Robust Wasserstein DRO

Authors: Sloan Nietert, Ziv Goldfeld, Soroosh Shafiee

Abstract: Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails… ▽ More Distributionally robust optimization (DRO) is an effective approach for data-driven decision-making in the presence of uncertainty. Geometric uncertainty due to sampling or localized perturbations of data points is captured by Wasserstein DRO (WDRO), which seeks to learn a model that performs uniformly well over a Wasserstein ball centered around the observed data distribution. However, WDRO fails to account for non-geometric perturbations such as adversarial outliers, which can greatly distort the Wasserstein distance measurement and impede the learned model. We address this gap by proposing a novel outlier-robust WDRO framework for decision-making under both geometric (Wasserstein) perturbations and non-geometric (total variation (TV)) contamination that allows an $\varepsilon$-fraction of data to be arbitrarily corrupted. We design an uncertainty set using a certain robust Wasserstein ball that accounts for both perturbation types and derive minimax optimal excess risk bounds for this procedure that explicitly capture the Wasserstein and TV risks. We prove a strong duality result that enables tractable convex reformulations and efficient computation of our outlier-robust WDRO problem. When the loss function depends only on low-dimensional features of the data, we eliminate certain dimension dependencies from the risk bounds that are unavoidable in the general setting. Finally, we present experiments validating our theory on standard regression and classification tasks. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Appearing at NeurIPS 2023

arXiv:2309.16200 [pdf, other]

Max-Sliced Mutual Information

Authors: Dor Tsur, Ziv Goldfeld, Kristjan Greenewald

Abstract: Quantifying the dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA on… ▽ More Quantifying the dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2307.01171 [pdf, other]

doi 10.1103/PhysRevA.109.032431

Quantum Neural Estimation of Entropies

Authors: Ziv Goldfeld, Dhrumil Patel, Sreejith Sreekumar, Mark M. Wilde

Abstract: Entropy measures quantify the amount of information and correlation present in a quantum system. In practice, when the quantum state is unknown and only copies thereof are available, one must resort to the estimation of such entropy measures. Here we propose a variational quantum algorithm for estimating the von Neumann and Rényi entropies, as well as the measured relative entropy and measured Rén… ▽ More Entropy measures quantify the amount of information and correlation present in a quantum system. In practice, when the quantum state is unknown and only copies thereof are available, one must resort to the estimation of such entropy measures. Here we propose a variational quantum algorithm for estimating the von Neumann and Rényi entropies, as well as the measured relative entropy and measured Rényi relative entropy. Our approach first parameterizes a variational formula for the measure of interest by a quantum circuit and a classical neural network, and then optimizes the resulting objective over parameter space. Numerical simulations of our quantum algorithm are provided, using a noiseless quantum simulator. The algorithm provides accurate estimates of the various entropy measures for the examples tested, which renders it as a promising approach for usage in downstream tasks. △ Less

Submitted 5 February, 2024; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: 14 pages, 2 figures; see also independent works of Shin, Lee, and Jeong at arXiv:2306.14566v1 and Lee, Kwon, and Lee at arXiv:2307.13511v2

Journal ref: Physical Review A, vol. 109, no. 3, page 032431, March 2024

arXiv:2306.13054 [pdf, other]

doi 10.1109/TIT.2024.3404927

Quantum Pufferfish Privacy: A Flexible Privacy Framework for Quantum Systems

Authors: Theshani Nuradha, Ziv Goldfeld, Mark M. Wilde

Abstract: We propose a versatile privacy framework for quantum systems, termed quantum pufferfish privacy (QPP). Inspired by classical pufferfish privacy, our formulation generalizes and addresses limitations of quantum differential privacy by offering flexibility in specifying private information, feasible measurements, and domain knowledge. We show that QPP can be equivalently formulated in terms of the D… ▽ More We propose a versatile privacy framework for quantum systems, termed quantum pufferfish privacy (QPP). Inspired by classical pufferfish privacy, our formulation generalizes and addresses limitations of quantum differential privacy by offering flexibility in specifying private information, feasible measurements, and domain knowledge. We show that QPP can be equivalently formulated in terms of the Datta-Leditzky information spectrum divergence, thus providing the first operational interpretation thereof. We reformulate this divergence as a semi-definite program and derive several properties of it, which are then used to prove convexity, composability, and post-processing of QPP mechanisms. Parameters that guarantee QPP of the depolarization mechanism are also derived. We analyze the privacy-utility tradeoff of general QPP mechanisms and, again, study the depolarization mechanism as an explicit instance. The QPP framework is then applied to privacy auditing for identifying privacy violations via a hypothesis testing pipeline that leverages quantum algorithms. Connections to quantum fairness and other quantum divergences are also explored and several variants of QPP are examined. △ Less

Submitted 28 May, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

Comments: v2: 33 pages, 9 figures, accepted to IEEE Transactions on Information Theory

Journal ref: IEEE Transactions on Information Theory, vol. 70, no. 8, pp. 5731-5762, Aug. 2024

arXiv:2306.00182 [pdf, other]

Entropic Gromov-Wasserstein Distances: Stability and Algorithms

Authors: Gabriel Rioux, Ziv Goldfeld, Kengo Kato

Abstract: The Gromov-Wasserstein (GW) distance quantifies discrepancy between metric measure spaces and provides a natural framework for aligning heterogeneous datasets. Alas, as exact computation of GW alignment is NP hard, entropic regularization provides an avenue towards a computationally tractable proxy. Leveraging a recently derived variational representation for the quadratic entropic GW (EGW) distan… ▽ More The Gromov-Wasserstein (GW) distance quantifies discrepancy between metric measure spaces and provides a natural framework for aligning heterogeneous datasets. Alas, as exact computation of GW alignment is NP hard, entropic regularization provides an avenue towards a computationally tractable proxy. Leveraging a recently derived variational representation for the quadratic entropic GW (EGW) distance, this work derives the first efficient algorithms for solving the EGW problem subject to formal, non-asymptotic convergence guarantees. To that end, we derive smoothness and convexity properties of the objective in this variational problem, which enables its resolution by the accelerated gradient method. Our algorithms employs Sinkhorn's fixed point iterations to compute an approximate gradient, which we model as an inexact oracle. We furnish convergence rates towards local and even global solutions (the latter holds under a precise quantitative condition on the regularization parameter), characterize the effects of gradient inexactness, and prove that stationary points of the EGW problem converge towards a stationary point of the unregularized GW problem, in the limit of vanishing regularization. We provide numerical experiments that validate our theory and empirically demonstrate the state-of-the-art empirical performance of our algorithm. △ Less

Submitted 9 January, 2024; v1 submitted 31 May, 2023; originally announced June 2023.

Comments: Version 3 of this arxiv report has been split into two parts. Version 4 of the arxiv report contains the algorithmic results of the original submission. The statistical results will appear as a separate arxiv submission

arXiv:2303.10155 [pdf, other]

Stability and statistical inference for semidiscrete optimal transport maps

Authors: Ritwik Sadhu, Ziv Goldfeld, Kengo Kato

Abstract: We study statistical inference for the optimal transport (OT) map (also known as the Brenier map) from a known absolutely continuous reference distribution onto an unknown finitely discrete target distribution. We derive limit distributions for the $L^p$-error with arbitrary $p \in [1,\infty)$ and for linear functionals of the empirical OT map, together with their moment convergence. The former ha… ▽ More We study statistical inference for the optimal transport (OT) map (also known as the Brenier map) from a known absolutely continuous reference distribution onto an unknown finitely discrete target distribution. We derive limit distributions for the $L^p$-error with arbitrary $p \in [1,\infty)$ and for linear functionals of the empirical OT map, together with their moment convergence. The former has a non-Gaussian limit, whose explicit density is derived, while the latter attains asymptotic normality. For both cases, we also establish consistency of the nonparametric bootstrap. The derivation of our limit theorems relies on new stability estimates of functionals of the OT map with respect to the dual potential vector, which may be of independent interest. We also discuss applications of our limit theorems to the construction of confidence sets for the OT map and inference for a maximum tail correlation. Finally, we show that, while the empirical OT map does not possess nontrivial weak limits in the $L^2$ space, it satisfies a central limit theorem in a dual Hölder space, and the Gaussian limit law attains the asymptotic efficiency bound. △ Less

Submitted 20 May, 2024; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: 43 pages

arXiv:2302.01237 [pdf, other]

Robust Estimation under the Wasserstein Distance

Authors: Sloan Nietert, Rachel Cummings, Ziv Goldfeld

Abstract: We study the problem of robust distribution estimation under the Wasserstein distance, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. Given $n$ samples from an unknown distribution $μ$, of which $\varepsilon n$ are adversarially corrupted, we seek an estimate for $μ$ with minimal Wasserstein error. To address this task, we draw upon two fra… ▽ More We study the problem of robust distribution estimation under the Wasserstein distance, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. Given $n$ samples from an unknown distribution $μ$, of which $\varepsilon n$ are adversarially corrupted, we seek an estimate for $μ$ with minimal Wasserstein error. To address this task, we draw upon two frameworks from OT and robust statistics: partial OT (POT) and minimum distance estimation (MDE). We prove new structural properties for POT and use them to show that MDE under a partial Wasserstein distance achieves the minimax-optimal robust estimation risk in many settings. Along the way, we derive a novel dual form for POT that adds a sup-norm penalty to the classic Kantorovich dual for standard OT. Since the popular Wasserstein generative adversarial network (WGAN) framework implements Wasserstein MDE via Kantorovich duality, our penalized dual enables large-scale generative modeling with contaminated datasets via an elementary modification to WGAN. Numerical experiments demonstrating the efficacy of our approach in mitigating the impact of adversarial corruptions are provided. △ Less

Submitted 24 September, 2024; v1 submitted 2 February, 2023; originally announced February 2023.

arXiv:2301.00621 [pdf, ps, other]

Data-Driven Optimization of Directed Information over Discrete Alphabets

Authors: Dor Tsur, Ziv Aharoni, Ziv Goldfeld, Haim Permuter

Abstract: Directed information (DI) is a fundamental measure for the study and analysis of sequential stochastic models. In particular, when optimized over input distributions it characterizes the capacity of general communication channels. However, analytic computation of DI is typically intractable and existing optimization techniques over discrete input alphabets require knowledge of the channel model, w… ▽ More Directed information (DI) is a fundamental measure for the study and analysis of sequential stochastic models. In particular, when optimized over input distributions it characterizes the capacity of general communication channels. However, analytic computation of DI is typically intractable and existing optimization techniques over discrete input alphabets require knowledge of the channel model, which renders them inapplicable when only samples are available. To overcome these limitations, we propose a novel estimation-optimization framework for DI over discrete input spaces. We formulate DI optimization as a Markov decision process and leverage reinforcement learning techniques to optimize a deep generative model of the input process probability mass function (PMF). Combining this optimizer with the recently developed DI neural estimator, we obtain an end-to-end estimation-optimization algorithm which is applied to estimating the (feedforward and feedback) capacity of various discrete channels with memory. Furthermore, we demonstrate how to use the optimized PMF model to (i) obtain theoretical bounds on the feedback capacity of unifilar finite-state channels; and (ii) perform probabilistic shaping of constellations in the peak power-constrained additive white Gaussian noise channel. △ Less

Submitted 2 January, 2023; originally announced January 2023.

arXiv:2212.12848 [pdf, other]

Gromov-Wasserstein Distances: Entropic Regularization, Duality, and Sample Complexity

Authors: Zhengxin Zhang, Ziv Goldfeld, Youssef Mroueh, Bharath K. Sriperumbudur

Abstract: The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes the… ▽ More The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. This work closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We treat both the standard and the entropically regularized GW distance, and derive dual forms that represent them in terms of the well-understood OT and entropic OT (EOT) problems, respectively. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rates are $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$) for standard GW and $n^{-1/2}$ for EGW, which matches the corresponding rates for standard and entropic OT. The parametric rate for EGW is evidently optimal, while for standard GW we provide matching lower bounds, which establish sharpness of the derived rates. We also study stability of EGW in the entropic regularization parameter and prove approximation and continuity results for the cost and optimal couplings. Lastly, the duality is leveraged to shed new light on the open problem of the one-dimensional GW distance between uniform distributions on $n$ points, illuminating why the identity and anti-identity permutations may not be optimal. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations. △ Less

Submitted 28 September, 2023; v1 submitted 24 December, 2022; originally announced December 2022.

Comments: 47 pages

arXiv:2211.11184 [pdf, ps, other]

Limit distribution theory for $f$-Divergences

Authors: Sreejith Sreekumar, Ziv Goldfeld, Kengo Kato

Abstract: $f$-divergences, which quantify discrepancy between probability distributions, are ubiquitous in information theory, machine learning, and statistics. While there are numerous methods for estimating $f… ▽ More $f$-divergences, which quantify discrepancy between probability distributions, are ubiquitous in information theory, machine learning, and statistics. While there are numerous methods for estimating $f$-divergences from data, a limit distribution theory, which quantifies fluctuations of the estimation error, is largely obscure. As limit theorems are pivotal for valid statistical inference, to close this gap, we develop a general methodology for deriving distributional limits for $f$-divergences based on the functional delta method and Hadamard directional differentiability. Focusing on four prominent $f$-divergences -- Kullback-Leibler divergence, $χ^2$ divergence, squared Hellinger distance, and total variation distance -- we identify sufficient conditions on the population distributions for the existence of distributional limits and characterize the limiting variables. These results are used to derive one- and two-sample limit theorems for Gaussian-smoothed $f$-divergences, both under the null and the alternative. Finally, an application of the limit distribution theory to auditing differential privacy is proposed and analyzed for significance level and power against local alternatives. △ Less

Submitted 12 October, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2210.12612 [pdf, ps, other]

doi 10.1109/TIT.2023.3296288

Pufferfish Privacy: An Information-Theoretic Study

Authors: Theshani Nuradha, Ziv Goldfeld

Abstract: Pufferfish privacy (PP) is a generalization of differential privacy (DP), that offers flexibility in specifying sensitive information and integrates domain knowledge into the privacy definition. Inspired by the illuminating formulation of DP in terms of mutual information due to Cuff and Yu, this work explores PP through the lens of information theory. We provide an information-theoretic formulati… ▽ More Pufferfish privacy (PP) is a generalization of differential privacy (DP), that offers flexibility in specifying sensitive information and integrates domain knowledge into the privacy definition. Inspired by the illuminating formulation of DP in terms of mutual information due to Cuff and Yu, this work explores PP through the lens of information theory. We provide an information-theoretic formulation of PP, termed mutual information PP (MI PP), in terms of the conditional mutual information between the mechanism and the secret, given the public information. We show that MI PP is implied by the regular PP and characterize conditions under which the reverse implication is also true, recovering the relationship between DP and its information-theoretic variant as a special case. We establish convexity, composability, and post-processing properties for MI PP mechanisms and derive noise levels for the Gaussian and Laplace mechanisms. The obtained mechanisms are applicable under relaxed assumptions and provide improved noise levels in some regimes. Lastly, applications to auditing privacy frameworks, statistical inference tasks, and algorithm stability are explored. △ Less

Submitted 3 May, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

Journal ref: IEEE Transactions on Information Theory, vol. 69, no. 11, pp. 7336-7356, Nov. 2023

arXiv:2210.09160 [pdf, other]

Statistical, Robustness, and Computational Guarantees for Sliced Wasserstein Distances

Authors: Sloan Nietert, Ritwik Sadhu, Ziv Goldfeld, Kengo Kato

Abstract: Sliced Wasserstein distances preserve properties of classic Wasserstein distances while being more scalable for computation and estimation in high dimensions. The goal of this work is to quantify this scalability from three key aspects: (i) empirical convergence rates; (ii) robustness to data contamination; and (iii) efficient computational methods. For empirical convergence, we derive fast rates… ▽ More Sliced Wasserstein distances preserve properties of classic Wasserstein distances while being more scalable for computation and estimation in high dimensions. The goal of this work is to quantify this scalability from three key aspects: (i) empirical convergence rates; (ii) robustness to data contamination; and (iii) efficient computational methods. For empirical convergence, we derive fast rates with explicit dependence of constants on dimension, subject to log-concavity of the population distributions. For robustness, we characterize minimax optimal, dimension-free robust estimation risks, and show an equivalence between robust sliced 1-Wasserstein estimation and robust mean estimation. This enables lifting statistical and algorithmic guarantees available for the latter to the sliced 1-Wasserstein setting. Moving on to computational aspects, we analyze the Monte Carlo estimator for the average-sliced distance, demonstrating that larger dimension can result in faster convergence of the numerical integration error. For the max-sliced distance, we focus on a subgradient-based local optimization algorithm that is frequently used in practice, albeit without formal guarantees, and establish an $O(ε^{-4})$ computational complexity bound for it. Our theory is validated by numerical experiments, which altogether provide a comprehensive quantitative account of the scalability question. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2207.08683 [pdf, other]

Limit Theorems for Entropic Optimal Transport Maps and the Sinkhorn Divergence

Authors: Ziv Goldfeld, Kengo Kato, Gabriel Rioux, Ritwik Sadhu

Abstract: We study limit theorems for entropic optimal transport (EOT) maps, dual potentials, and the Sinkhorn divergence. The key technical tool we use is a first and second-order Hadamard differentiability analysis of EOT potentials with respect to the marginal distributions, which may be of independent interest. Given the differentiability results, the functional delta method is used to obtain central li… ▽ More We study limit theorems for entropic optimal transport (EOT) maps, dual potentials, and the Sinkhorn divergence. The key technical tool we use is a first and second-order Hadamard differentiability analysis of EOT potentials with respect to the marginal distributions, which may be of independent interest. Given the differentiability results, the functional delta method is used to obtain central limit theorems for empirical EOT potentials and maps. The second-order functional delta method is leveraged to establish the limit distribution of the empirical Sinkhorn divergence under the null. Building on the latter result, we further derive the null limit distribution of the Sinkhorn independence test statistic and characterize the correct order. Since our limit theorems follow from Hadamard differentiability of the relevant maps, as a byproduct, we also obtain bootstrap consistency and asymptotic efficiency of the empirical EOT map, potentials, and Sinkhorn divergence. △ Less

Submitted 14 June, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

Comments: 49 pages

arXiv:2206.08526 [pdf, other]

k-Sliced Mutual Information: A Quantitative Study of Scalability with Dimension

Authors: Ziv Goldfeld, Kristjan Greenewald, Theshani Nuradha, Galen Reeves

Abstract: Sliced mutual information (SMI) is defined as an average of mutual information (MI) terms between one-dimensional random projections of the random variables. It serves as a surrogate measure of dependence to classic MI that preserves many of its properties but is more scalable to high dimensions. However, a quantitative characterization of how SMI itself and estimation rates thereof depend on the… ▽ More Sliced mutual information (SMI) is defined as an average of mutual information (MI) terms between one-dimensional random projections of the random variables. It serves as a surrogate measure of dependence to classic MI that preserves many of its properties but is more scalable to high dimensions. However, a quantitative characterization of how SMI itself and estimation rates thereof depend on the ambient dimension, which is crucial to the understanding of scalability, remain obscure. This work provides a multifaceted account of the dependence of SMI on dimension, under a broader framework termed $k$-SMI, which considers projections to $k$-dimensional subspaces. Using a new result on the continuity of differential entropy in the 2-Wasserstein metric, we derive sharp bounds on the error of Monte Carlo (MC)-based estimates of $k$-SMI, with explicit dependence on $k$ and the ambient dimension, revealing their interplay with the number of samples. We then combine the MC integrator with the neural estimation framework to provide an end-to-end $k$-SMI estimator, for which optimal convergence rates are established. We also explore asymptotics of the population $k$-SMI as dimension grows, providing Gaussian approximation results with a residual that decays under appropriate moment bounds. All our results trivially apply to SMI by setting $k=1$. Our theory is validated with numerical experiments and is applied to sliced InfoGAN, which altogether provide a comprehensive quantitative account of the scalability question of $k$-SMI, including SMI as a special case when $k=1$. △ Less

Submitted 14 October, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2205.04283 [pdf, ps, other]

Statistical inference with regularized optimal transport

Authors: Ziv Goldfeld, Kengo Kato, Gabriel Rioux, Ritwik Sadhu

Abstract: Optimal transport (OT) is a versatile framework for comparing probability measures, with many applications to statistics, machine learning, and applied mathematics. However, OT distances suffer from computational and statistical scalability issues to high dimensions, which motivated the study of regularized OT methods like slicing, smoothing, and entropic penalty. This work establishes a unified f… ▽ More Optimal transport (OT) is a versatile framework for comparing probability measures, with many applications to statistics, machine learning, and applied mathematics. However, OT distances suffer from computational and statistical scalability issues to high dimensions, which motivated the study of regularized OT methods like slicing, smoothing, and entropic penalty. This work establishes a unified framework for deriving limit distributions of empirical regularized OT distances, semiparametric efficiency of the plug-in empirical estimator, and bootstrap consistency. We apply the unified framework to provide a comprehensive statistical treatment of: (i) average- and max-sliced $p$-Wasserstein distances, for which several gaps in existing literature are closed; (ii) smooth distances with compactly supported kernels, the analysis of which is motivated by computational considerations; and (iii) entropic OT, for which our method generalizes existing limit distribution results and establishes, for the first time, efficiency and bootstrap consistency. While our focus is on these three regularized OT distances as applications, the flexibility of the proposed framework renders it applicable to broad classes of functionals beyond these examples. △ Less

Submitted 7 June, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

Comments: 71 pages

arXiv:2203.14743 [pdf, ps, other]

Neural Estimation and Optimization of Directed Information over Continuous Spaces

Authors: Dor Tsur, Ziv Aharoni, Ziv Goldfeld, Haim Permuter

Abstract: This work develops a new method for estimating and optimizing the directed information rate between two jointly stationary and ergodic stochastic processes. Building upon recent advances in machine learning, we propose a recurrent neural network (RNN)-based estimator which is optimized via gradient ascent over the RNN parameters. The estimator does not require prior knowledge of the underlying joi… ▽ More This work develops a new method for estimating and optimizing the directed information rate between two jointly stationary and ergodic stochastic processes. Building upon recent advances in machine learning, we propose a recurrent neural network (RNN)-based estimator which is optimized via gradient ascent over the RNN parameters. The estimator does not require prior knowledge of the underlying joint and marginal distributions. The estimator is also readily optimized over continuous input processes realized by a deep generative model. We prove consistency of the proposed estimation and optimization methods and combine them to obtain end-to-end performance guarantees. Applications for channel capacity estimation of continuous channels with memory are explored, and empirical results demonstrating the scalability and accuracy of our method are provided. When the channel is memoryless, we investigate the mapping learned by the optimized input generator. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: 38 pages, 6 figures

arXiv:2203.00159 [pdf, ps, other]

Limit distribution theory for smooth $p$-Wasserstein distances

Authors: Ziv Goldfeld, Kengo Kato, Sloan Nietert, Gabriel Rioux

Abstract: The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was r… ▽ More The Wasserstein distance is a metric on a space of probability measures that has seen a surge of applications in statistics, machine learning, and applied mathematics. However, statistical aspects of Wasserstein distances are bottlenecked by the curse of dimensionality, whereby the number of data points needed to accurately estimate them grows exponentially with dimension. Gaussian smoothing was recently introduced as a means to alleviate the curse of dimensionality, giving rise to a parametric convergence rate in any dimension, while preserving the Wasserstein metric and topological structure. To facilitate valid statistical inference, in this work, we develop a comprehensive limit distribution theory for the empirical smooth Wasserstein distance. The limit distribution results leverage the functional delta method after embedding the domain of the Wasserstein distance into a certain dual Sobolev space, characterizing its Hadamard directional derivative for the dual Sobolev norm, and establishing weak convergence of the smooth empirical process in the dual space. To estimate the distributional limits, we also establish consistency of the nonparametric bootstrap. Finally, we use the limit distribution theory to study applications to generative modeling via minimum distance estimation with the smooth Wasserstein distance, showing asymptotic normality of optimal solutions for the quadratic cost. △ Less

Submitted 28 February, 2022; originally announced March 2022.

arXiv:2111.11328 [pdf, other]

Cycle Consistent Probability Divergences Across Different Spaces

Authors: Zhengxin Zhang, Youssef Mroueh, Ziv Goldfeld, Bharath K. Sriperumbudur

Abstract: Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Mo… ▽ More Discrepancy measures between probability distributions are at the core of statistical inference and machine learning. In many applications, distributions of interest are supported on different spaces, and yet a meaningful correspondence between data points is desired. Motivated to explicitly encode consistent bidirectional maps into the discrepancy measure, this work proposes a novel unbalanced Monge optimal transport formulation for matching, up to isometries, distributions on different spaces. Our formulation arises as a principled relaxation of the Gromov-Haussdroff distance between metric spaces, and employs two cycle-consistent maps that push forward each distribution onto the other. We study structural properties of the proposed discrepancy and, in particular, show that it captures the popular cycle-consistent generative adversarial network (GAN) framework as a special case, thereby providing the theory to explain it. Motivated by computational efficiency, we then kernelize the discrepancy and restrict the mappings to parametric function classes. The resulting kernelized version is coined the generalized maximum mean discrepancy (GMMD). Convergence rates for empirical estimation of GMMD are studied and experiments to support our theory are provided. △ Less

Submitted 22 November, 2021; originally announced November 2021.

Comments: 35 pages

arXiv:2111.01361 [pdf, other]

Outlier-Robust Optimal Transport: Duality, Structure, and Statistical Analysis

Authors: Sloan Nietert, Rachel Cummings, Ziv Goldfeld

Abstract: The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-ro… ▽ More The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to achieve strong robust estimation guarantees under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets. △ Less

Submitted 28 February, 2023; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: updated to match AISTATS publication

arXiv:2110.05279 [pdf, ps, other]

Sliced Mutual Information: A Scalable Measure of Statistical Dependence

Authors: Ziv Goldfeld, Kristjan Greenewald

Abstract: Mutual information (MI) is a fundamental measure of statistical dependence, with a myriad of applications to information theory, statistics, and machine learning. While it possesses many desirable structural properties, the estimation of high-dimensional MI from samples suffers from the curse of dimensionality. Motivated by statistical scalability to high dimensions, this paper proposes sliced MI… ▽ More Mutual information (MI) is a fundamental measure of statistical dependence, with a myriad of applications to information theory, statistics, and machine learning. While it possesses many desirable structural properties, the estimation of high-dimensional MI from samples suffers from the curse of dimensionality. Motivated by statistical scalability to high dimensions, this paper proposes sliced MI (SMI) as a surrogate measure of dependence. SMI is defined as an average of MI terms between one-dimensional random projections. We show that it preserves many of the structural properties of classic MI, while gaining scalable computation and efficient estimation from samples. Furthermore, and in contrast to classic MI, SMI can grow as a result of deterministic transformations. This enables leveraging SMI for feature extraction by optimizing it over processing functions of raw data to identify useful representations thereof. Our theory is supported by numerical studies of independence testing and feature extraction, which demonstrate the potential gains SMI offers over classic MI for high-dimensional inference. △ Less

Submitted 18 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

arXiv:2110.03652 [pdf, ps, other]

Neural Estimation of Statistical Divergences

Authors: Sreejith Sreekumar, Ziv Goldfeld

Abstract: Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corre… ▽ More Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular $\mathsf{f}$-divergences -- Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate. △ Less

Submitted 29 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2107.13494 [pdf, ps, other]

Limit Distribution Theory for the Smooth 1-Wasserstein Distance with Applications

Authors: Ritwik Sadhu, Ziv Goldfeld, Kengo Kato

Abstract: The smooth 1-Wasserstein distance (SWD) $W_1^σ$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation while preserving the Wasserstein structure. Indeed, SWD exhibits parametric convergence rates and inherits the metric and topological structure of the classic Wasserstein distance. Motivated by the above, this work conducts a thorough statistical study… ▽ More The smooth 1-Wasserstein distance (SWD) $W_1^σ$ was recently proposed as a means to mitigate the curse of dimensionality in empirical approximation while preserving the Wasserstein structure. Indeed, SWD exhibits parametric convergence rates and inherits the metric and topological structure of the classic Wasserstein distance. Motivated by the above, this work conducts a thorough statistical study of the SWD, including a high-dimensional limit distribution result for empirical $W_1^σ$, bootstrap consistency, concentration inequalities, and Berry-Esseen type bounds. The derived nondegenerate limit stands in sharp contrast with the classic empirical $W_1$, for which a similar result is known only in the one-dimensional case. We also explore asymptotics and characterize the limit distribution when the smoothing parameter $σ$ is scaled with $n$, converging to $0$ at a sufficiently slow rate. The dimensionality of the sampled distribution enters empirical SWD convergence bounds only through the prefactor (i.e., the constant). We provide a sharp characterization of this prefactor's dependence on the smoothing parameter and the intrinsic dimension. This result is then used to derive new empirical convergence rates for classic $W_1$ in terms of the intrinsic dimension. As applications of the limit distribution theory, we study two-sample testing and minimum distance estimation (MDE) under $W_1^σ$. We establish asymptotic validity of SWD testing, while for MDE, we prove measurability, almost sure convergence, and limit distributions for optimal estimators and their corresponding $W_1^σ$ error. Our results suggest that the SWD is well suited for high-dimensional statistical learning and inference. △ Less

Submitted 24 February, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

MSC Class: 62E17; 60F05; 60F17; 62G10; 62F12; 62F40

arXiv:2103.06923 [pdf, other]

Non-Asymptotic Performance Guarantees for Neural Estimation of $\mathsf{f}$-Divergences

Authors: Sreejith Sreekumar, Zhengxin Zhang, Ziv Goldfeld

Abstract: Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call… ▽ More Statistical distances (SDs), which quantify the dissimilarity between probability distributions, are central to machine learning and statistics. A modern method for estimating such distances from data relies on parametrizing a variational form by a neural network (NN) and optimizing it. These estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. In particular, there seems to be a fundamental tradeoff between the two sources of error involved: approximation and estimation. While the former needs the NN class to be rich and expressive, the latter relies on controlling complexity. This paper explores this tradeoff by means of non-asymptotic error bounds, focusing on three popular choices of SDs -- Kullback-Leibler divergence, chi-squared divergence, and squared Hellinger distance. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory. Numerical results validating the theory are also provided. △ Less

Submitted 16 March, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

arXiv:2101.04039 [pdf, other]

Smooth $p$-Wasserstein Distance: Structure, Empirical Approximation, and Statistical Applications

Authors: Sloan Nietert, Ziv Goldfeld, Kengo Kato

Abstract: Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalabili… ▽ More Discrepancy measures between probability distributions, often termed statistical distances, are ubiquitous in probability theory, statistics and machine learning. To combat the curse of dimensionality when estimating these distances from data, recent work has proposed smoothing out local irregularities in the measured distributions via convolution with a Gaussian kernel. Motivated by the scalability of this framework to high dimensions, we investigate the structural and statistical behavior of the Gaussian-smoothed $p$-Wasserstein distance $\mathsf{W}_p^{(σ)}$, for arbitrary $p\geq 1$. After establishing basic metric and topological properties of $\mathsf{W}_p^{(σ)}$, we explore the asymptotic statistical behavior of $\mathsf{W}_p^{(σ)}(\hatμ_n,μ)$, where $\hatμ_n$ is the empirical distribution of $n$ independent observations from $μ$. We prove that $\mathsf{W}_p^{(σ)}$ enjoys a parametric empirical convergence rate of $n^{-1/2}$, which contrasts the $n^{-1/d}$ rate for unsmoothed $\mathsf{W}_p$ when $d \geq 3$. Our proof relies on controlling $\mathsf{W}_p^{(σ)}$ by a $p$th-order smooth Sobolev distance $\mathsf{d}_p^{(σ)}$ and deriving the limit distribution of $\sqrt{n}\,\mathsf{d}_p^{(σ)}(\hatμ_n,μ)$, for all dimensions $d$. As applications, we provide asymptotic guarantees for two-sample testing and minimum distance estimation using $\mathsf{W}_p^{(σ)}$, with experiments for $p=2$ using a maximum mean discrepancy formulation of $\mathsf{d}_2^{(σ)}$. △ Less

Submitted 17 December, 2021; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: updated to match ICML 2021 paper

arXiv:2004.14941 [pdf, other]

The Information Bottleneck Problem and Its Applications in Machine Learning

Authors: Ziv Goldfeld, Yury Polyanskiy

Abstract: Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lo… ▽ More Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable $Y$ from a correlated observation $X$. Since the dimension of $X$ is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector $T$, from which $Y$ is predicted. The algorithm will successfully make the prediction if $T$ is a good proxy of $Y$, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations $T$ based on real-world data are now available. While these methods are often effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation $T$ should be maximally informative about $Y$ while minimizing the mutual information with $X$. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is particularly important for further leveraging IB and other information-theoretic ideas to study DL models. △ Less

Submitted 1 May, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

arXiv:2004.04330 [pdf, other]

The Secrecy Capacity of Cost-Constrained Wiretap Channels

Authors: Sreejith Sreekumar, Alexander Bunin, Ziv Goldfeld, Haim H. Permuter, Shlomo Shamai

Abstract: In many information-theoretic channel coding problems, adding an input cost constraint to the operational setup amounts to restricting the optimization domain in the capacity formula. This paper shows that, in contrast to common belief, such a simple modification does not hold for the cost-constrained (CC) wiretap channel (WTC). The secrecy-capacity of the discrete memoryless (DM) WTC without cost… ▽ More In many information-theoretic channel coding problems, adding an input cost constraint to the operational setup amounts to restricting the optimization domain in the capacity formula. This paper shows that, in contrast to common belief, such a simple modification does not hold for the cost-constrained (CC) wiretap channel (WTC). The secrecy-capacity of the discrete memoryless (DM) WTC without cost constraints is described by a single auxiliary random variable. For the CC DM-WTC, however, we show that two auxiliaries are necessary to achieve capacity. Specifically, we first derive the secrecy-capacity formula, proving the direct part via superposition coding. Then, we provide an example of a CC DM-WTC whose secrecy-capacity cannot be achieved using a single auxiliary. This establishes the fundamental role of superposition coding over CC WTCs. △ Less

Submitted 26 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

arXiv:2003.04179 [pdf, ps, other]

Capacity of Continuous Channels with Memory via Directed Information Neural Estimator

Authors: Ziv Aharoni, Dor Tsur, Ziv Goldfeld, Haim Henry Permuter

Abstract: Calculating the capacity (with or without feedback) of channels with memory and continuous alphabets is a challenging task. It requires optimizing the directed information (DI) rate over all channel input distributions. The objective is a multi-letter expression, whose analytic solution is only known for a few specific cases. When no analytic solution is present or the channel model is unknown, th… ▽ More Calculating the capacity (with or without feedback) of channels with memory and continuous alphabets is a challenging task. It requires optimizing the directed information (DI) rate over all channel input distributions. The objective is a multi-letter expression, whose analytic solution is only known for a few specific cases. When no analytic solution is present or the channel model is unknown, there is no unified framework for calculating or even approximating capacity. This work proposes a novel capacity estimation algorithm that treats the channel as a `black-box', both when feedback is or is not present. The algorithm has two main ingredients: (i) a neural distribution transformer (NDT) model that shapes a noise variable into the channel input distribution, which we are able to sample, and (ii) the DI neural estimator (DINE) that estimates the communication rate of the current NDT model. These models are trained by an alternating maximization procedure to both estimate the channel capacity and obtain an NDT for the optimal input distribution. The method is demonstrated on the moving average additive Gaussian noise channel, where it is shown that both the capacity and feedback capacity are estimated without knowledge of the channel transition kernel. The proposed estimation framework opens the door to a myriad of capacity approximation results for continuous alphabet channels that were inaccessible until now. △ Less

Submitted 16 May, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

arXiv:2002.01013 [pdf, other]

Limit Distribution for Smooth Total Variation and $χ^2$-Divergence in High Dimensions

Authors: Ziv Goldfeld, Kengo Kato

Abstract: Statistical divergences are ubiquitous in machine learning as tools for measuring discrepancy between probability distributions. As these applications inherently rely on approximating distributions from samples, we consider empirical approximation under two popular $f$-divergences: the total variation (TV) distance and the $χ^2$-divergence. To circumvent the sensitivity of these divergences to sup… ▽ More Statistical divergences are ubiquitous in machine learning as tools for measuring discrepancy between probability distributions. As these applications inherently rely on approximating distributions from samples, we consider empirical approximation under two popular $f$-divergences: the total variation (TV) distance and the $χ^2$-divergence. To circumvent the sensitivity of these divergences to support mismatch, the framework of Gaussian smoothing is adopted. We study the limit distributions of $\sqrt{n}δ_{\mathsf{TV}}(P_n\ast\mathcal{N},P\ast\mathcal{N})$ and $nχ^2(P_n\ast\mathcal{N}\|P\ast\mathcal{N})$, where $P_n$ is the empirical measure based on $n$ independently and identically distributed (i.i.d.) observations from $P$, $\mathcal{N}_σ:=\mathcal{N}(0,σ^2\mathrm{I}_d)$, and $\ast$ stands for convolution. In arbitrary dimension, the limit distributions are characterized in terms of Gaussian process on $\mathbb{R}^d$ with covariance operator that depends on $P$ and the isotropic Gaussian density of parameter $σ$. This, in turn, implies optimality of the $n^{-1/2}$ expected value convergence rates recently derived for $δ_{\mathsf{TV}}(P_n\ast\mathcal{N},P\ast\mathcal{N})$ and $χ^2(P_n\ast\mathcal{N}\|P\ast\mathcal{N})$. These strong statistical guarantees promote empirical approximation under Gaussian smoothing as a potent framework for learning and inference based on high-dimensional data. △ Less

Submitted 30 April, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:2002.01012 [pdf, ps, other]

Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance

Authors: Ziv Goldfeld, Kristjan Greenewald, Kengo Kato

Abstract: Minimum distance estimation (MDE) gained recent attention as a formulation of (implicit) generative modeling. It considers minimizing, over model parameters, a statistical distance between the empirical data distribution and the model. This formulation lends itself well to theoretical analysis, but typical results are hindered by the curse of dimensionality. To overcome this and devise a scalable… ▽ More Minimum distance estimation (MDE) gained recent attention as a formulation of (implicit) generative modeling. It considers minimizing, over model parameters, a statistical distance between the empirical data distribution and the model. This formulation lends itself well to theoretical analysis, but typical results are hindered by the curse of dimensionality. To overcome this and devise a scalable finite-sample statistical MDE theory, we adopt the framework of smooth 1-Wasserstein distance (SWD) $\mathsf{W}_1^{(σ)}$. The SWD was recently shown to preserve the metric and topological structure of classic Wasserstein distances, while enjoying dimension-free empirical convergence rates. In this work, we conduct a thorough statistical study of the minimum smooth Wasserstein estimators (MSWEs), first proving the estimator's measurability and asymptotic consistency. We then characterize the limit distribution of the optimal model parameters and their associated minimal SWD. These results imply an $O(n^{-1/2})$ generalization bound for generative modeling based on MSWE, which holds in arbitrary dimension. Our main technical tool is a novel high-dimensional limit distribution result for empirical $\mathsf{W}_1^{(σ)}$. The characterization of a nondegenerate limit stands in sharp contrast with the classic empirical 1-Wasserstein distance, for which a similar result is known only in the one-dimensional case. The validity of our theory is supported by empirical results, posing the SWD as a potent tool for learning and inference in high dimensions. △ Less

Submitted 19 October, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:2001.09206 [pdf, other]

Gaussian-Smooth Optimal Transport: Metric Structure and Statistical Efficiency

Authors: Ziv Goldfeld, Kristjan Greenewald

Abstract: Optimal transport (OT), and in particular the Wasserstein distance, has seen a surge of interest and applications in machine learning. However, empirical approximation under Wasserstein distances suffers from a severe curse of dimensionality, rendering them impractical in high dimensions. As a result, entropically regularized OT has become a popular workaround. However, while it enjoys fast algori… ▽ More Optimal transport (OT), and in particular the Wasserstein distance, has seen a surge of interest and applications in machine learning. However, empirical approximation under Wasserstein distances suffers from a severe curse of dimensionality, rendering them impractical in high dimensions. As a result, entropically regularized OT has become a popular workaround. However, while it enjoys fast algorithms and better statistical properties, it looses the metric structure that Wasserstein distances enjoy. This work proposes a novel Gaussian-smoothed OT (GOT) framework, that achieves the best of both worlds: preserving the 1-Wasserstein metric structure while alleviating the empirical approximation curse of dimensionality. Furthermore, as the Gaussian-smoothing parameter shrinks to zero, GOT $Γ$-converges towards classic OT (with convergence of optimizers), thus serving as a natural extension. An empirical study that supports the theoretical results is provided, promoting Gaussian-smoothed OT as a powerful alternative to entropic OT. △ Less

Submitted 24 January, 2020; originally announced January 2020.

arXiv:1905.13576 [pdf, other]

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

Authors: Ziv Goldfeld, Kristjan Greenewald, Yury Polyanskiy, Jonathan Weed

Abstract: This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variati… ▽ More This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_σ$, for $\mathcal{N}_σ\triangleq\mathcal{N}(0,σ^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_σ$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $χ^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $χ^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $χ^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $ω(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $χ^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_σ)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_σ)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided. △ Less

Submitted 1 May, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1810.11589

arXiv:1810.11589

Estimating Differential Entropy under Gaussian Convolutions

Authors: Ziv Goldfeld, Kristjan Greenewald, Yury Polyanskiy

Abstract: This paper studies the problem of estimating the differential entropy $h(S+Z)$, where $S$ and $Z$ are independent $d$-dimensional random variables with $Z\sim\mathcal{N}(0,σ^2 \mathrm{I}_d)$. The distribution of $S$ is unknown, but $n$ independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of $S$ as opposed to samples of… ▽ More This paper studies the problem of estimating the differential entropy $h(S+Z)$, where $S$ and $Z$ are independent $d$-dimensional random variables with $Z\sim\mathcal{N}(0,σ^2 \mathrm{I}_d)$. The distribution of $S$ is unknown, but $n$ independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of $S$ as opposed to samples of $S+Z$ can improve estimation performance. We show that the answer is positive. More concretely, we first show that despite the regularizing effect of noise, the number of required samples still needs to scale exponentially in $d$. This result is proven via a random-coding argument that reduces the question to estimating the Shannon entropy on a $2^{O(d)}$-sized alphabet. Next, for a fixed $d$ and $n$ large enough, it is shown that a simple plugin estimator, given by the differential entropy of the empirical distribution from $S$ convolved with the Gaussian density, achieves the loss of $O\left((\log n)^{d/4}/\sqrt{n}\right)$. Note that the plugin estimator amounts here to the differential entropy of a $d$-dimensional Gaussian mixture, for which we propose an efficient Monte Carlo computation algorithm. At the same time, estimating $h(S+Z)$ via popular differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples from $S+Z$ would only attain much slower rates of order $O(n^{-1/d})$, despite the smoothness of $P_{S+Z}$. As an application, which was in fact our original motivation for the problem, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others. △ Less

Submitted 2 June, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: A significantly updated version with a different set of authors replaces this manuscript. New version available at arXiv:1905.13576

arXiv:1810.05728 [pdf, other]

Estimating Information Flow in Deep Neural Networks

Authors: Ziv Goldfeld, Ewout van den Berg, Kristjan Greenewald, Igor Melnyk, Nam Nguyen, Brian Kingsbury, Yury Polyanskiy

Abstract: We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representat… ▽ More We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representations $T$ decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true $I(X;T)$ over these networks is provably either constant (discrete $X$) or infinite (continuous $X$). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which $I(X;T)$ is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models. By relating $I(X;T)$ in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the $T$ space. Finally, we return to the estimator of $I(X;T)$ employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest. △ Less

Submitted 30 May, 2019; v1 submitted 12 October, 2018; originally announced October 2018.

Comments: Main text accepted to ICML 2019. This preprint contains the full version of that paper (including omitted appendices)

arXiv:1805.03027 [pdf, ps, other]

Information Storage in the Stochastic Ising Model

Authors: Ziv Goldfeld, Guy Bresler, Yury Polyanskiy

Abstract: Most information storage devices write data by modifying the local state of matter, in the hope that sub-atomic local interactions stabilize the state for sufficiently long time, thereby allowing later recovery. Motivated to explore how temporal evolution of physical states in magnetic storage media affects their capacity, this work initiates the study of information retention in locally-interacti… ▽ More Most information storage devices write data by modifying the local state of matter, in the hope that sub-atomic local interactions stabilize the state for sufficiently long time, thereby allowing later recovery. Motivated to explore how temporal evolution of physical states in magnetic storage media affects their capacity, this work initiates the study of information retention in locally-interacting particle systems. The system dynamics follow the stochastic Ising model (SIM) over a 2-dimensional $\sqrt{n}\times\sqrt{n}$ grid. The initial spin configuration $X_0$ serves as the user-controlled input. The output configuration $X_t$ is produced by running $t$ steps of Glauber dynamics. Our main goal is to evaluate the information capacity $I_n(t):=\max_{p_{X_0}}I(X_0;X_t)$ when time $t$ scales with the system's size $n$. While the positive (but low) temperature regime is our main interest, we start by exploring the simpler zero-temperature dynamics. We first show that at zero temperature, order of $\sqrt{n}$ bits can be stored in the system indefinitely by coding over stable, striped configurations. While $\sqrt{n}$ is order optimal for infinite time, backing off to $t<\infty$, higher orders of $I_n(t)$ are achievable. First, linear coding arguments imply that $I_n(t) = Θ(n)$ for $t=O(n)$. To go beyond the linear scale, we develop a droplet-based achievability scheme that reliably stores $Ω\left(n/\log n\right)$ for $t=O(n\log n)$ time ($\log n$ can be replaced with any $o(n)$ function). Moving to the positive but low temperature regime, two main results are provided. First, we show that an initial configuration drawn from the Gibbs measure cannot retain more than a single bit for $t\geq \exp(Cβn^{1/4+ε})$ time. On the other hand, when scaling time with the inverse temperature $β$, the stripe-based coding scheme is shown to retain its bits for $e^{cβ}$. △ Less

Submitted 23 December, 2020; v1 submitted 8 May, 2018; originally announced May 2018.

arXiv:1712.10299 [pdf, ps, other]

Wiretap and Gelfand-Pinsker Channels Analogy and its Applications

Authors: Ziv Goldfeld, Haim. H. Permuter

Abstract: An analogy framework between wiretap channels (WTCs) and state-dependent point-to-point channels with non-causal encoder channel state information (referred to as Gelfand-Pinker channels (GPCs)) is proposed. A good sequence of stealth-wiretap codes is shown to induce a good sequence of codes for a corresponding GPC. Consequently, the framework enables exploiting existing results for GPCs to produc… ▽ More An analogy framework between wiretap channels (WTCs) and state-dependent point-to-point channels with non-causal encoder channel state information (referred to as Gelfand-Pinker channels (GPCs)) is proposed. A good sequence of stealth-wiretap codes is shown to induce a good sequence of codes for a corresponding GPC. Consequently, the framework enables exploiting existing results for GPCs to produce converse proofs for their wiretap analogs. The analogy readily extends to multiuser broadcasting scenarios, encompassing broadcast channels (BCs) with deterministic components, degradation ordering between users, and BCs with cooperative receivers. Given a wiretap BC (WTBC) with two receivers and one eavesdropper, an analogous Gelfand-Pinsker BC (GPBC) is constructed by converting the eavesdropper's observation sequence into a state sequence with an appropriate product distribution (induced by the stealth-wiretap code for the WTBC), and non-causally revealing the states to the encoder. The transition matrix of the state-dependent GPBC is extracted from WTBC's transition law, with the eavesdropper's output playing the role of the channel state. Past capacity results for the semi-deterministic (SD) GPBC and the physically-degraded (PD) GPBC with an informed receiver are leveraged to furnish analogy-based converse proofs for the analogous WTBC setups. This characterizes the secrecy-capacity regions of the SD-WTBC and the PD-WTBC, in which the stronger receiver also observes the eavesdropper's channel output. These derivations exemplify how the wiretap-GP analogy enables translating results on one problem into advances in the study of the other. △ Less

Submitted 28 May, 2019; v1 submitted 29 December, 2017; originally announced December 2017.

arXiv:1708.04283 [pdf, ps, other]

Key and Message Semantic-Security over State-Dependent Channels

Authors: Alexander Bunin, Ziv Goldfeld, Haim H. Permuter, Shlomo Shamai, Paul Cuff, Pablo Piantanida

Abstract: We study the trade-off between secret message (SM) and secret key (SK) rates, simultaneously achievable over a state-dependent (SD) wiretap channel (WTC) with non-causal channel state information (CSI) at the encoder. This model subsumes other instances of CSI availability as special cases, and calls for efficient utilization of the state sequence for both reliability and security purposes. An inn… ▽ More We study the trade-off between secret message (SM) and secret key (SK) rates, simultaneously achievable over a state-dependent (SD) wiretap channel (WTC) with non-causal channel state information (CSI) at the encoder. This model subsumes other instances of CSI availability as special cases, and calls for efficient utilization of the state sequence for both reliability and security purposes. An inner bound on the semantic-security (SS) SM-SK capacity region is derived based on a superposition coding scheme inspired by a past work of the authors. The region is shown to attain capacity for a certain class of SD-WTCs. SS is established by virtue of two versions of the strong soft-covering lemma. The derived region yields an improvement upon the previously best known SM-SK trade-off result reported by Prabhakaran et al., and, to the best of our knowledge, upon all other existing lower bounds for either SM or SK for this setup, even if the semantic security requirement is relaxed to weak secrecy. It is demonstrated that our region can be strictly larger than those reported in the preceding works. △ Less

Submitted 7 June, 2019; v1 submitted 14 August, 2017; originally announced August 2017.

arXiv:1610.03990 [pdf, ps, other]

Fourier-Motzkin Elimination Software for Information Theoretic Inequalities

Authors: Ido B. Gattegno, Ziv Goldfeld, Haim H. Permuter

Abstract: We provide open-source software implemented in MATLAB, that performs Fourier-Motzkin elimination (FME) and removes constraints that are redundant due to Shannon-type inequalities (STIs). The FME is often used in information theoretic contexts to simplify rate regions, e.g., by eliminating auxiliary rates. Occasionally, however, the procedure becomes cumbersome, which makes an error-free hand-writt… ▽ More We provide open-source software implemented in MATLAB, that performs Fourier-Motzkin elimination (FME) and removes constraints that are redundant due to Shannon-type inequalities (STIs). The FME is often used in information theoretic contexts to simplify rate regions, e.g., by eliminating auxiliary rates. Occasionally, however, the procedure becomes cumbersome, which makes an error-free hand-written derivation an elusive task. Some computer software have circumvented this difficulty by exploiting an automated FME process. However, the outputs of such software often include constraints that are inactive due to information theoretic properties. By incorporating the notion of STIs (a class of information inequalities provable via a computer program), our algorithm removes such redundant constraints based on non-negativity properties, chain-rules and probability mass function factorization. This newsletter first illustrates the program's abilities, and then reviews the contribution of STIs to the identification of redundant constraints. △ Less

Submitted 13 October, 2016; originally announced October 2016.

arXiv:1608.06057 [pdf, ps, other]

MIMO Gaussian Broadcast Channels with Common, Private and Confidential Messages

Authors: Ziv Goldfeld, Haim H. Permuter

Abstract: The two-user multiple-input multiple-output (MIMO) Gaussian broadcast channel (BC) with common, private and confidential messages is considered. The transmitter sends a common message to both users, a confidential message to User 1 and a private (non-confidential) message to User 2. The secrecy-capacity region is characterized by showing that certain inner and outer bounds coincide and that the bo… ▽ More The two-user multiple-input multiple-output (MIMO) Gaussian broadcast channel (BC) with common, private and confidential messages is considered. The transmitter sends a common message to both users, a confidential message to User 1 and a private (non-confidential) message to User 2. The secrecy-capacity region is characterized by showing that certain inner and outer bounds coincide and that the boundary points are achieved by Gaussian inputs, which enables the development of a tight converse. The proof relies on factorization of upper concave envelopes and a variant of dirty-paper coding (DPC). It is shown that the entire region is exhausted by using DPC to cancel out the signal of the non-confidential message at Receiver 1, thus making DPC against the signal of the confidential message unnecessary. A numerical example illustrates the secrecy-capacity results. △ Less

Submitted 28 May, 2019; v1 submitted 22 August, 2016; originally announced August 2016.

arXiv:1608.00743 [pdf, ps, other]

Wiretap Channels with Random States Non-Causally Available at the Encoder

Authors: Ziv Goldfeld, Paul Cuff, Haim H. Permuter

Abstract: We study the state-dependent (SD) wiretap channel (WTC) with non-causal channel state information (CSI) at the encoder. This model subsumes all other instances of CSI availability as special cases, and calls for an efficient utilization of the state sequence for both reliability and security purposes. A lower bound on the secrecy-capacity, that improves upon the previously best known result publis… ▽ More We study the state-dependent (SD) wiretap channel (WTC) with non-causal channel state information (CSI) at the encoder. This model subsumes all other instances of CSI availability as special cases, and calls for an efficient utilization of the state sequence for both reliability and security purposes. A lower bound on the secrecy-capacity, that improves upon the previously best known result published by Prabhakaran et al., is derived based on a novel superposition coding scheme. Our achievability gives rise to the exact secrecy-capacity characterization of a class of SD-WTCs that decompose into a product of two WTCs, where one is independent of the state and the other one depends only on the state. The results are derived under the strict semantic-security metric that requires negligible information leakage for all message distributions. △ Less

Submitted 28 May, 2019; v1 submitted 2 August, 2016; originally announced August 2016.

arXiv:1601.03660 [pdf, ps, other]

Arbitrarily Varying Wiretap Channels with Type Constrained States

Authors: Ziv Goldfeld, Paul Cuff, Haim H. Permuter

Abstract: An arbitrarily varying wiretap channel (AVWTC) with a type constraint on the allowed state sequences is considered, and a single-letter characterization of its correlated-random (CR) assisted semantic-security (SS) capacity is derived. The allowed state sequences are the ones in a typical set around a single constraining type. SS is established by showing that the mutual information between the me… ▽ More An arbitrarily varying wiretap channel (AVWTC) with a type constraint on the allowed state sequences is considered, and a single-letter characterization of its correlated-random (CR) assisted semantic-security (SS) capacity is derived. The allowed state sequences are the ones in a typical set around a single constraining type. SS is established by showing that the mutual information between the message and the eavesdropper's observations is negligible even when maximized over all message distributions, choices of state sequences and realizations of the CR-code. Both the achievability and the converse proofs of the type constrained coding theorem rely on stronger claims than actually required. The direct part establishes a novel single-letter lower bound on the CR-assisted SS-capacity of an AVWTC with state sequences constrained by any convex and closed set of state probability mass functions. This bound achieves the best known single-letter secrecy rates for a corresponding compound wiretap channel over the same constraint set. In contrast to other single-letter results in the AVWTC literature, this work does not assume the existence of a best channel to the eavesdropper. Instead, SS follows by leveraging the heterogeneous version of the stronger soft-covering lemma and a CR-code reduction argument. Optimality is a consequence of an max-inf upper bound on the CR-assisted SS-capacity of an AVWTC with state sequences constrained to any collection of type-classes. When adjusted to the aforementioned compound WTC, the upper bound simplifies to a max-min structure, thus strengthening the previously best known single-letter upper bound by Liang et al. that has a min-max form. The proof of the upper bound uses a novel distribution coupling argument. △ Less

Submitted 18 October, 2016; v1 submitted 14 January, 2016; originally announced January 2016.

arXiv:1601.01286 [pdf, ps, other]

Strong Secrecy for Cooperative Broadcast Channels

Authors: Ziv Goldfeld, Gerhard Kramer, Haim H. Permuter, Paul Cuff

Abstract: A broadcast channel (BC) where the decoders cooperate via a one-sided link is considered. One common and two private messages are transmitted and the private message to the cooperative user should be kept secret from the cooperation-aided user. The secrecy level is measured in terms of strong secrecy, i.e., a vanishing information leakage. An inner bound on the capacity region is derived by using… ▽ More A broadcast channel (BC) where the decoders cooperate via a one-sided link is considered. One common and two private messages are transmitted and the private message to the cooperative user should be kept secret from the cooperation-aided user. The secrecy level is measured in terms of strong secrecy, i.e., a vanishing information leakage. An inner bound on the capacity region is derived by using a channel-resolvability-based code that double-bins the codebook of the secret message, and by using a likelihood encoder to choose the transmitted codeword. The inner bound is shown to be tight for semi-deterministic and physically degraded BCs and the results are compared to those of the corresponding BCs without a secrecy constraint. Blackwell and Gaussian BC examples illustrate the impact of secrecy on the rate regions. Unlike the case without secrecy, where sharing information about both private messages via the cooperative link is optimal, our protocol conveys parts of the common and non-confidential messages only. This restriction reduces the transmission rates more than the usual rate loss due to secrecy requirements. An example that illustrates this loss is provided. △ Less

Submitted 28 May, 2019; v1 submitted 6 January, 2016; originally announced January 2016.

arXiv:1509.03619 [pdf, ps, other]

Semantic-Security Capacity for Wiretap Channels of Type II

Authors: Ziv Goldfeld, Paul Cuff, Haim H. Permuter

Abstract: The secrecy capacity of the type II wiretap channel (WTC II) with a noisy main channel is currently an open problem. Herein its secrecy-capacity is derived and shown to be equal to its semantic-security (SS) capacity. In this setting, the legitimate users communicate via a discrete-memoryless (DM) channel in the presence of an eavesdropper that has perfect access to a subset of its choosing of the… ▽ More The secrecy capacity of the type II wiretap channel (WTC II) with a noisy main channel is currently an open problem. Herein its secrecy-capacity is derived and shown to be equal to its semantic-security (SS) capacity. In this setting, the legitimate users communicate via a discrete-memoryless (DM) channel in the presence of an eavesdropper that has perfect access to a subset of its choosing of the transmitted symbols, constrained to a fixed fraction of the blocklength. The secrecy criterion is achieved simultaneously for all possible eavesdropper subset choices. The SS criterion demands negligible mutual information between the message and the eavesdropper's observations even when maximized over all message distributions. A key tool for the achievability proof is a novel and stronger version of Wyner's soft covering lemma. Specifically, a random codebook is shown to achieve the soft-covering phenomenon with high probability. The probability of failure is doubly-exponentially small in the blocklength. Since the combined number of messages and subsets grows only exponentially with the blocklength, SS for the WTC II is established by using the union bound and invoking the stronger soft-covering lemma. The direct proof shows that rates up to the weak-secrecy capacity of the classic WTC with a DM erasure channel (EC) to the eavesdropper are achievable. The converse follows by establishing the capacity of this DM wiretap EC as an upper bound for the WTC II. From a broader perspective, the stronger soft-covering lemma constitutes a tool for showing the existence of codebooks that satisfy exponentially many constraints, a beneficial ability for many other applications in information theoretic security. △ Less

Submitted 17 August, 2016; v1 submitted 11 September, 2015; originally announced September 2015.

Journal ref: IEEE Transactions in Information Theory, Vol. 62, No. 7, July 2016

arXiv:1504.06136 [pdf, ps, other]

doi 10.1109/TIT.2017.2708086

Broadcast Channels with Privacy Leakage Constraints

Authors: Ziv Goldfeld, Gerhard Kramer, Haim H. Permuter

Abstract: The broadcast channel (BC) with one common and two private messages with leakage constraints is studied, where leakage rate refers to the normalized mutual information between a message and a channel symbol string. Each private message is destined for a different user and the leakage rate to the other receiver must satisfy a constraint. This model captures several scenarios concerning secrecy, i.e… ▽ More The broadcast channel (BC) with one common and two private messages with leakage constraints is studied, where leakage rate refers to the normalized mutual information between a message and a channel symbol string. Each private message is destined for a different user and the leakage rate to the other receiver must satisfy a constraint. This model captures several scenarios concerning secrecy, i.e., when both, either or neither of the private messages are secret. Inner and outer bounds on the leakage-capacity region are derived when the eavesdropper knows the codebook. The inner bound relies on a Marton-like code construction and the likelihood encoder. A Uniform Approximation Lemma is established that states that the marginal distribution induced by the encoder on each of the bins in the Marton codebook is approximately uniform. Without leakage constraints the inner bound recovers Marton's region and the outer bound reduces to the UVW-outer bound. The bounds match for semi-deterministic (SD) and physically degraded (PD) BCs, as well as for BCs with a degraded message set. The leakage-capacity regions of the SD-BC and the BC with a degraded message set recover past results for different secrecy scenarios. A Blackwell BC example illustrates the results and shows how its leakage-capacity region changes from the capacity region without secrecy to the secrecy-capacity regions for different secrecy scenarios. △ Less

Submitted 28 May, 2017; v1 submitted 23 April, 2015; originally announced April 2015.

arXiv:1405.7812 [pdf, ps, other]

doi 10.1109/TIT.2016.2533479

Duality of a Source Coding Problem and the Semi-Deterministic Broadcast Channel with Rate-Limited Cooperation

Authors: Ziv Goldfeld, Haim H. Permuter, Gerhard Kramer

Abstract: The Wyner-Ahlswede-Körner (WAK) empirical-coordination problem where the encoders cooperate via a finite-capacity one-sided link is considered. The coordination-capacity region is derived by combining several source coding techniques, such as Wyner-Ziv (WZ) coding, binning and superposition coding. Furthermore, a semi-deterministic (SD) broadcast channel (BC) with one-sided decoder cooperation is… ▽ More The Wyner-Ahlswede-Körner (WAK) empirical-coordination problem where the encoders cooperate via a finite-capacity one-sided link is considered. The coordination-capacity region is derived by combining several source coding techniques, such as Wyner-Ziv (WZ) coding, binning and superposition coding. Furthermore, a semi-deterministic (SD) broadcast channel (BC) with one-sided decoder cooperation is considered. Duality principles relating the two problems are presented, and the capacity region for the SD-BC setting is derived. The direct part follows from an achievable region for a general BC that is tight for the SD scenario. A converse is established by using telescoping identities. The SD-BC is shown to be operationally equivalent to a class of relay-BCs (RBCs) and the correspondence between their capacity regions is established. The capacity region of the SD-BC is transformed into an equivalent region that is shown to be dual to the admissible region of the WAK problem in the sense that the information measures defining the corner points of both regions coincide. Achievability and converse proofs for the equivalent region are provided. For the converse, we use a probabilistic construction of auxiliary random variables that depends on the distribution induced by the codebook. Several examples illustrate the results. △ Less

Submitted 17 August, 2016; v1 submitted 30 May, 2014; originally announced May 2014.

Journal ref: IEEE Transactions on Information Theory, Vol. 62, No. 5, May 2016

arXiv:1303.7083 [pdf, ps, other]

doi 10.1109/TIT.2014.2346494

The Finite State MAC with Cooperative Encoders and Delayed CSI

Authors: Ziv Goldfeld, Haim H. Permuter, Benjamin M. Zaidel

Abstract: In this paper, we consider the finite-state multiple access channel (MAC) with partially cooperative encoders and delayed channel state information (CSI). Here partial cooperation refers to the communication between the encoders via finite-capacity links. The channel states are assumed to be governed by a Markov process. Full CSI is assumed at the receiver, while at the transmitters, only delayed… ▽ More In this paper, we consider the finite-state multiple access channel (MAC) with partially cooperative encoders and delayed channel state information (CSI). Here partial cooperation refers to the communication between the encoders via finite-capacity links. The channel states are assumed to be governed by a Markov process. Full CSI is assumed at the receiver, while at the transmitters, only delayed CSI is available. The capacity region of this channel model is derived by first solving the case of the finite-state MAC with a common message. Achievability for the latter case is established using the notion of strategies, however, we show that optimal codes can be constructed directly over the input alphabet. This results in a single codebook construction that is then leveraged to apply simultaneous joint decoding. Simultaneous decoding is crucial here because it circumvents the need to rely on the capacity region's corner points, a task that becomes increasingly cumbersome with the growth in the number of messages to be sent. The common message result is then used to derive the capacity region for the case with partially cooperating encoders. Next, we apply this general result to the special case of the Gaussian vector MAC with diagonal channel transfer matrices, which is suitable for modeling, e.g., orthogonal frequency division multiplexing (OFDM)-based communication systems. The capacity region of the Gaussian channel is presented in terms of a convex optimization problem that can be solved efficiently using numerical tools. The region is derived by first presenting an outer bound on the general capacity region and then suggesting a specific input distribution that achieves this bound. Finally, numerical results are provided that give valuable insight into the practical implications of optimally using conferencing to maximize the transmission rates. △ Less

Submitted 29 January, 2015; v1 submitted 28 March, 2013; originally announced March 2013.

Journal ref: IEEE Transactions on Information Theory, Vol. 60, No. 10, October 2014

Showing 1–50 of 50 results for author: Goldfeld, Z