Machine Learning (stat.ML)

A Review of Barren Plateaus in Variational Quantum Computing
Martin Larocca, Supanut Thanasilp, Samson Wang, Kunal Sharma, Jacob Biamonte, Patrick J. Coles, Lukasz Cincio, Jarrod R. McClean, Zoë Holmes, M. Cerezo
May 03 2024 quant-ph cs.LG stat.ML arXiv:2405.00781v1

@misc{2405.00781, author = {Martin Larocca and Supanut Thanasilp and Samson Wang and Kunal Sharma and Jacob Biamonte and Patrick J.~Coles and Lukasz Cincio and Jarrod R.~McClean and Zoë Holmes and M.~Cerezo}, title = {{A} {R}eview of {B}arren {P}lateaus in {V}ariational {Q}uantum {C}omputing}, year = {2024}, eprint = {2405.00781}, note = {arXiv:2405.00781v1} }
PDF
Variational quantum computing offers a flexible computational paradigm with applications in diverse areas. However, a key obstacle to realizing their potential is the Barren Plateau (BP) phenomenon. When a model exhibits a BP, its parameter optimization landscape becomes exponentially flat and featureless as the problem size increases. Importantly, all the moving pieces of an algorithm -- choices of ansatz, initial state, observable, loss function and hardware noise -- can lead to BPs when ill-suited. Due to the significant impact of BPs on trainability, researchers have dedicated considerable effort to develop theoretical and heuristic methods to understand and mitigate their effects. As a result, the study of BPs has become a thriving area of research, influencing and cross-fertilizing other fields such as quantum optimal control, tensor networks, and learning theory. This article provides a comprehensive review of the current understanding of the BP phenomenon.
Quantum Convolutional Neural Networks are (Effectively) Classically Simulable
Pablo Bermejo, Paolo Braccia, Manuel S. Rudolph, Zoë Holmes, Lukasz Cincio, M. Cerezo
Aug 26 2024 quant-ph cs.LG stat.ML arXiv:2408.12739v1

@misc{2408.12739, author = {Pablo Bermejo and Paolo Braccia and Manuel S.~Rudolph and Zoë Holmes and Lukasz Cincio and M.~Cerezo}, title = {{Q}uantum {C}onvolutional {N}eural {N}etworks are ({E}ffectively) {C}lassically {S}imulable}, year = {2024}, eprint = {2408.12739}, note = {arXiv:2408.12739v1} }
PDF
Quantum Convolutional Neural Networks (QCNNs) are widely regarded as a promising model for Quantum Machine Learning (QML). In this work we tie their heuristic success to two facts. First, that when randomly initialized, they can only operate on the information encoded in low-bodyness measurements of their input states. And second, that they are commonly benchmarked on "locally-easy'' datasets whose states are precisely classifiable by the information encoded in these low-bodyness observables subspace. We further show that the QCNN's action on this subspace can be efficiently classically simulated by a classical algorithm equipped with Pauli shadows on the dataset. Indeed, we present a shadow-based simulation of QCNNs on up-to $1024$ qubits for phases of matter classification. Our results can then be understood as highlighting a deeper symptom of QML: Models could only be showing heuristic success because they are benchmarked on simple problems, for which their action can be classically simulated. This insight points to the fact that non-trivial datasets are a truly necessary ingredient for moving forward with QML. To finish, we discuss how our results can be extrapolated to classically simulate other architectures.
Learning pure quantum states (almost) without regret
Josep Lumbreras, Mikhail Terekhov, Marco Tomamichel
Jun 27 2024 quant-ph cs.AI cs.LG stat.ML arXiv:2406.18370v1

@misc{2406.18370, author = {Josep Lumbreras and Mikhail Terekhov and Marco Tomamichel}, title = {{L}earning pure quantum states (almost) without regret}, year = {2024}, eprint = {2406.18370}, note = {arXiv:2406.18370v1} }
PDF
We initiate the study of quantum state tomography with minimal regret. A learner has sequential oracle access to an unknown pure quantum state, and in each round selects a pure probe state. Regret is incurred if the unknown state is measured orthogonal to this probe, and the learner's goal is to minimise the expected cumulative regret over $T$ rounds. The challenge is to find a balance between the most informative measurements and measurements incurring minimal regret. We show that the cumulative regret scales as $\Theta(\operatorname{polylog} T)$ using a new tomography algorithm based on a median of means least squares estimator. This algorithm employs measurements biased towards the unknown state and produces online estimates that are optimal (up to logarithmic terms) in the number of observed samples.
Online learning of quantum processes
Asad Raza, Matthias C. Caro, Jens Eisert, Sumeet Khatri
Jun 07 2024 quant-ph cs.LG stat.ML arXiv:2406.04250v1

@misc{2406.04250, author = {Asad Raza and Matthias C.~Caro and Jens Eisert and Sumeet Khatri}, title = {{O}nline learning of quantum processes}, year = {2024}, eprint = {2406.04250}, note = {arXiv:2406.04250v1} }
PDF
Among recent insights into learning quantum states, online learning and shadow tomography procedures are notable for their ability to accurately predict expectation values even of adaptively chosen observables. In contrast to the state case, quantum process learning tasks with a similarly adaptive nature have received little attention. In this work, we investigate online learning tasks for quantum processes. Whereas online learning is infeasible for general quantum channels, we show that channels of bounded gate complexity as well as Pauli channels can be online learned in the regret and mistake-bounded models of online learning. In fact, we can online learn probabilistic mixtures of any exponentially large set of known channels. We also provide a provably sample-efficient shadow tomography procedure for Pauli channels. Our results extend beyond quantum channels to non-Markovian multi-time processes, with favorable regret and mistake bounds, as well as a shadow tomography procedure. We complement our online learning upper bounds with mistake as well as computational lower bounds. On the technical side, we make use of the multiplicative weights update algorithm, classical adaptive data analysis, and Bell sampling, as well as tools from the theory of quantum combs for multi-time quantum processes. Our work initiates a study of online learning for classes of quantum channels and, more generally, non-Markovian quantum processes. Given the importance of online learning for state shadow tomography, this may serve as a step towards quantum channel variants of adaptive shadow tomography.
On the relation between trainability and dequantization of variational quantum learning models
Elies Gil-Fuster, Casper Gyurik, Adrián Pérez-Salinas, Vedran Dunjko
Jun 12 2024 quant-ph cs.LG stat.ML arXiv:2406.07072v1

@misc{2406.07072, author = {Elies Gil-Fuster and Casper Gyurik and Adrián Pérez-Salinas and Vedran Dunjko}, title = {{O}n the relation between trainability and dequantization of variational quantum learning models}, year = {2024}, eprint = {2406.07072}, note = {arXiv:2406.07072v1} }
PDF
The quest for successful variational quantum machine learning (QML) relies on the design of suitable parametrized quantum circuits (PQCs), as analogues to neural networks in classical machine learning. Successful QML models must fulfill the properties of trainability and non-dequantization, among others. Recent works have highlighted an intricate interplay between trainability and dequantization of such models, which is still unresolved. In this work we contribute to this debate from the perspective of machine learning, proving a number of results identifying, among others when trainability and non-dequantization are not mutually exclusive. We begin by providing a number of new somewhat broader definitions of the relevant concepts, compared to what is found in other literature, which are operationally motivated, and consistent with prior art. With these precise definitions given and motivated, we then study the relation between trainability and dequantization of variational QML. Next, we also discuss the degrees of "variationalness" of QML models, where we distinguish between models like the hardware efficient ansatz and quantum kernel methods. Finally, we introduce recipes for building PQC-based QML models which are both trainable and nondequantizable, and corresponding to different degrees of variationalness. We do not address the practical utility for such models. Our work however does point toward a way forward for finding more general constructions, for which finding applications may become feasible.
Concept learning of parameterized quantum models from limited measurements
Beng Yee Gan, Po-Wei Huang, Elies Gil-Fuster, Patrick Rebentrost
Aug 12 2024 quant-ph cs.LG stat.ML arXiv:2408.05116v1

@misc{2408.05116, author = {Beng Yee Gan and Po-Wei Huang and Elies Gil-Fuster and Patrick Rebentrost}, title = {{C}oncept learning of parameterized quantum models from limited measurements}, year = {2024}, eprint = {2408.05116}, note = {arXiv:2408.05116v1} }
PDF
Classical learning of the expectation values of observables for quantum states is a natural variant of learning quantum states or channels. While learning-theoretic frameworks establish the sample complexity and the number of measurement shots per sample required for learning such statistical quantities, the interplay between these two variables has not been adequately quantified before. In this work, we take the probabilistic nature of quantum measurements into account in classical modelling and discuss these quantities under a single unified learning framework. We provide provable guarantees for learning parameterized quantum models that also quantify the asymmetrical effects and interplay of the two variables on the performance of learning algorithms. These results show that while increasing the sample size enhances the learning performance of classical machines, even with single-shot estimates, the improvements from increasing measurements become asymptotically trivial beyond a constant factor. We further apply our framework and theoretical guarantees to study the impact of measurement noise on the classical surrogation of parameterized quantum circuit models. Our work provides new tools to analyse the operational influence of finite measurement noise in the classical learning of quantum systems.
Learning topological states from randomized measurements using variational tensor network tomography
Yanting Teng, Rhine Samajdar, Katherine Van Kirk, Frederik Wilde, Subir Sachdev, Jens Eisert, Ryan Sweke, Khadijeh Najafi
Jun 04 2024 quant-ph cond-mat.str-el stat.ML arXiv:2406.00193v3

@misc{2406.00193, author = {Yanting Teng and Rhine Samajdar and Katherine Van Kirk and Frederik Wilde and Subir Sachdev and Jens Eisert and Ryan Sweke and Khadijeh Najafi}, title = {{L}earning topological states from randomized measurements using variational tensor network tomography}, year = {2024}, eprint = {2406.00193}, note = {arXiv:2406.00193v3} }
PDF
Learning faithful representations of quantum states is crucial to fully characterizing the variety of many-body states created on quantum processors. While various tomographic methods such as classical shadow and MPS tomography have shown promise in characterizing a wide class of quantum states, they face unique limitations in detecting topologically ordered two-dimensional states. To address this problem, we implement and study a heuristic tomographic method that combines variational optimization on tensor networks with randomized measurement techniques. Using this approach, we demonstrate its ability to learn the ground state of the surface code Hamiltonian as well as an experimentally realizable quantum spin liquid state. In particular, we perform numerical experiments using MPS ansätze and systematically investigate the sample complexity required to achieve high fidelities for systems of sizes up to $48$ qubits. In addition, we provide theoretical insights into the scaling of our learning algorithm by analyzing the statistical properties of maximum likelihood estimation. Notably, our method is sample-efficient and experimentally friendly, only requiring snapshots of the quantum state measured randomly in the $X$ or $Z$ bases. Using this subset of measurements, our approach can effectively learn any real pure states represented by tensor networks, and we rigorously prove that random-$XZ$ measurements are tomographically complete for such states.
Everything that can be learned about a causal structure with latent variables by observational and interventional probing schemes
Marina Maciel Ansanelli, Elie Wolfe, Robert W. Spekkens
Jul 03 2024 stat.ML cs.LG quant-ph arXiv:2407.01686v1

@misc{2407.01686, author = {Marina Maciel Ansanelli and Elie Wolfe and Robert W.~Spekkens}, title = {{E}verything that can be learned about a causal structure with latent variables by observational and interventional probing schemes}, year = {2024}, eprint = {2407.01686}, note = {arXiv:2407.01686v1} }
PDF
What types of differences among causal structures with latent variables are impossible to distinguish by statistical data obtained by probing each visible variable? If the probing scheme is simply passive observation, then it is well-known that many different causal structures can realize the same joint probability distributions. Even for the simplest case of two visible variables, for instance, one cannot distinguish between one variable being a causal parent of the other and the two variables sharing a latent common cause. However, it is possible to distinguish between these two causal structures if we have recourse to more powerful probing schemes, such as the possibility of intervening on one of the variables and observing the other. Herein, we address the question of which causal structures remain indistinguishable even given the most informative types of probing schemes on the visible variables. We find that two causal structures remain indistinguishable if and only if they are both associated with the same mDAG structure (as defined by Evans (2016)). We also consider the question of when one causal structure dominates another in the sense that it can realize all of the joint probability distributions that can be realized by the other using a given probing scheme. (Equivalence of causal structures is the special case of mutual dominance.) Finally, we investigate to what extent one can weaken the probing schemes implemented on the visible variables and still have the same discrimination power as a maximally informative probing scheme.
Contraction of Private Quantum Channels and Private Quantum Hypothesis Testing
Theshani Nuradha, Mark M. Wilde
Jun 28 2024 quant-ph cs.CR cs.IT cs.LG math.IT stat.ML arXiv:2406.18651v1

@misc{2406.18651, author = {Theshani Nuradha and Mark M.~Wilde}, title = {{C}ontraction of {P}rivate {Q}uantum {C}hannels and {P}rivate {Q}uantum {H}ypothesis {T}esting}, year = {2024}, eprint = {2406.18651}, note = {arXiv:2406.18651v1} }
PDF
A quantum generalized divergence by definition satisfies the data-processing inequality; as such, the relative decrease in such a divergence under the action of a quantum channel is at most one. This relative decrease is formally known as the contraction coefficient of the channel and the divergence. Interestingly, there exist combinations of channels and divergences for which the contraction coefficient is strictly less than one. Furthermore, understanding the contraction coefficient is fundamental for the study of statistical tasks under privacy constraints. To this end, here we establish upper bounds on contraction coefficients for the hockey-stick divergence under privacy constraints, where privacy is quantified with respect to the quantum local differential privacy (QLDP) framework, and we fully characterize the contraction coefficient for the trace distance under privacy constraints. With the machinery developed, we also determine an upper bound on the contraction of both the Bures distance and quantum relative entropy relative to the normalized trace distance, under QLDP constraints. Next, we apply our findings to establish bounds on the sample complexity of quantum hypothesis testing under privacy constraints. Furthermore, we study various scenarios in which the sample complexity bounds are tight, while providing order-optimal quantum channels that achieve those bounds. Lastly, we show how private quantum channels provide fairness and Holevo information stability in quantum learning settings.
Exact gradients for linear optics with single photons
Giorgio Facelli, David D. Roberts, Hugo Wallner, Alexander Makarovskiy, Zoë Holmes, William R. Clements
Sep 26 2024 quant-ph stat.ML arXiv:2409.16369v1

@misc{2409.16369, author = {Giorgio Facelli and David D.~Roberts and Hugo Wallner and Alexander Makarovskiy and Zoë Holmes and William R.~Clements}, title = {{E}xact gradients for linear optics with single photons}, year = {2024}, eprint = {2409.16369}, note = {arXiv:2409.16369v1} }
PDF
Though parameter shift rules have drastically improved gradient estimation methods for several types of quantum circuits, leading to improved performance in downstream tasks, so far they have not been transferable to linear optics with single photons. In this work, we derive an analytical formula for the gradients in these circuits with respect to phaseshifters via a generalized parameter shift rule, where the number of parameter shifts depends linearly on the total number of photons. Experimentally, this enables access to derivatives in photonic systems without the need for finite difference approximations. Building on this, we propose two strategies through which one can reduce the number of shifts in the expression, and hence reduce the overall sample complexity. Numerically, we show that this generalized parameter-shift rule can converge to the minimum of a cost function with fewer parameter update steps than alternative techniques. We anticipate that this method will open up new avenues to solving optimization problems with photonic systems, as well as provide new techniques for the experimental characterization and control of linear optical systems.
Quantum Machine Learning in Drug Discovery: Applications in Academia and Pharmaceutical Industries
Anthony M. Smaldone, Yu Shee, Gregory W. Kyro, Chuzhi Xu, Nam P. Vu, Rishab Dutta, Marwa H. Farag, Alexey Galda, Sandeep Kumar, Elica Kyoseva, Victor S. Batista
Sep 25 2024 quant-ph stat.ML arXiv:2409.15645v1

@misc{2409.15645, author = {Anthony M.~Smaldone and Yu Shee and Gregory W.~Kyro and Chuzhi Xu and Nam P.~Vu and Rishab Dutta and Marwa H.~Farag and Alexey Galda and Sandeep Kumar and Elica Kyoseva and Victor S.~Batista}, title = {{Q}uantum {M}achine {L}earning in {D}rug {D}iscovery: {A}pplications in {A}cademia and {P}harmaceutical {I}ndustries}, year = {2024}, eprint = {2409.15645}, note = {arXiv:2409.15645v1} }
PDF
The nexus of quantum computing and machine learning - quantum machine learning - offers the potential for significant advancements in chemistry. This review specifically explores the potential of quantum neural networks on gate-based quantum computers within the context of drug discovery. We discuss the theoretical foundations of quantum machine learning, including data encoding, variational quantum circuits, and hybrid quantum-classical approaches. Applications to drug discovery are highlighted, including molecular property prediction and molecular generation. We provide a balanced perspective, emphasizing both the potential benefits and the challenges that must be addressed.
Inference, interference and invariance: How the Quantum Fourier Transform can help to learn from data
David Wakeham, Maria Schuld
Sep 04 2024 quant-ph stat.ML arXiv:2409.00172v1

@misc{2409.00172, author = {David Wakeham and Maria Schuld}, title = {{I}nference, interference and invariance: {H}ow the {Q}uantum {F}ourier {T}ransform can help to learn from data}, year = {2024}, eprint = {2409.00172}, note = {arXiv:2409.00172v1} }
PDF
How can we take inspiration from a typical quantum algorithm to design heuristics for machine learning? A common blueprint, used from Deutsch-Josza to Shor's algorithm, is to place labeled information in superposition via an oracle, interfere in Fourier space, and measure. In this paper, we want to understand how this interference strategy can be used for inference, i.e. to generalize from finite data samples to a ground truth. Our investigative framework is built around the Hidden Subgroup Problem (HSP), which we transform into a learning task by replacing the oracle with classical training data. The standard quantum algorithm for solving the HSP uses the Quantum Fourier Transform to expose an invariant subspace, i.e., a subset of Hilbert space in which the hidden symmetry is manifest. Based on this insight, we propose an inference principle that "compares" the data to this invariant subspace, and suggest a concrete implementation via overlaps of quantum states. We hope that this leads to well-motivated quantum heuristics that can leverage symmetries for machine learning applications.
KAN: Kolmogorov-Arnold Networks
Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljačić, Thomas Y. Hou, Max Tegmark
May 01 2024 cs.LG cond-mat.dis-nn cs.AI stat.ML arXiv:2404.19756v4

@misc{2404.19756, author = {Ziming Liu and Yixuan Wang and Sachin Vaidya and Fabian Ruehle and James Halverson and Marin Soljačić and Thomas Y.~Hou and Max Tegmark}, title = {{KAN}: {K}olmogorov-{A}rnold {N}etworks}, year = {2024}, eprint = {2404.19756}, note = {arXiv:2404.19756v4} }
PDF
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.
How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework
Yinuo Ren, Haoxuan Chen, Grant M. Rotskoff, Lexing Ying
Oct 07 2024 cs.LG cs.NA math.NA stat.ML arXiv:2410.03601v1

@misc{2410.03601, author = {Yinuo Ren and Haoxuan Chen and Grant M.~Rotskoff and Lexing Ying}, title = {{H}ow {D}iscrete and {C}ontinuous {D}iffusion {M}eet: {C}omprehensive {A}nalysis of {D}iscrete {D}iffusion {M}odels via a {S}tochastic {I}ntegral {F}ramework}, year = {2024}, eprint = {2410.03601}, note = {arXiv:2410.03601v1} }
PDF
Discrete diffusion models have gained increasing attention for their ability to model complex distributions with tractable sampling and inference. However, the error analysis for discrete diffusion models remains less well-understood. In this work, we propose a comprehensive framework for the error analysis of discrete diffusion models based on Lévy-type stochastic integrals. By generalizing the Poisson random measure to that with a time-independent and state-dependent intensity, we rigorously establish a stochastic integral formulation of discrete diffusion models and provide the corresponding change of measure theorems that are intriguingly analogous to Itô integrals and Girsanov's theorem for their continuous counterparts. Our framework unifies and strengthens the current theoretical results on discrete diffusion models and obtains the first error bound for the $\tau$-leaping scheme in KL divergence. With error sources clearly identified, our analysis gives new insight into the mathematical properties of discrete diffusion models and offers guidance for the design of efficient and accurate algorithms for real-world discrete diffusion model applications.
On the Hardness of Learning One Hidden Layer Neural Networks
Shuchen Li, Ilias Zadik, Manolis Zampetakis
Oct 07 2024 cs.LG cs.CC math.ST stat.ML stat.TH arXiv:2410.03477v1

@misc{2410.03477, author = {Shuchen Li and Ilias Zadik and Manolis Zampetakis}, title = {{O}n the {H}ardness of {L}earning {O}ne {H}idden {L}ayer {N}eural {N}etworks}, year = {2024}, eprint = {2410.03477}, note = {arXiv:2410.03477v1} }
PDF
In this work, we consider the problem of learning one hidden layer ReLU neural networks with inputs from $\mathbb{R}^d$. We show that this learning problem is hard under standard cryptographic assumptions even when: (1) the size of the neural network is polynomial in $d$, (2) its input distribution is a standard Gaussian, and (3) the noise is Gaussian and polynomially small in $d$. Our hardness result is based on the hardness of the Continuous Learning with Errors (CLWE) problem, and in particular, is based on the largely believed worst-case hardness of approximately solving the shortest vector problem up to a multiplicative polynomial factor.
A Statistical Viewpoint on Differential Privacy: Hypothesis Testing, Representation and Blackwell's Theorem
Weijie J. Su
Sep 17 2024 cs.CR cs.LG math.ST stat.ML stat.TH arXiv:2409.09558v1

@misc{2409.09558, author = {Weijie J.~Su}, title = {{A} {S}tatistical {V}iewpoint on {D}ifferential {P}rivacy: {H}ypothesis {T}esting, {R}epresentation and {B}lackwell's {T}heorem}, year = {2024}, eprint = {2409.09558}, note = {arXiv:2409.09558v1} }
PDF
Differential privacy is widely considered the formal privacy for privacy-preserving data analysis due to its robust and rigorous guarantees, with increasingly broad adoption in public services, academia, and industry. Despite originating in the cryptographic context, in this review paper we argue that, fundamentally, differential privacy can be considered a \textitpure statistical concept. By leveraging a theorem due to David Blackwell, our focus is to demonstrate that the definition of differential privacy can be formally motivated from a hypothesis testing perspective, thereby showing that hypothesis testing is not merely convenient but also the right language for reasoning about differential privacy. This insight leads to the definition of $f$-differential privacy, which extends other differential privacy definitions through a representation theorem. We review techniques that render $f$-differential privacy a unified framework for analyzing privacy bounds in data analysis and machine learning. Applications of this differential privacy definition to private deep learning, private convex optimization, shuffled mechanisms, and U.S.~Census data are discussed to highlight the benefits of analyzing privacy bounds under this framework compared to existing alternatives.
Learning to Classify Quantum Phases of Matter with a Few Measurements
Mehran Khosrojerdi, Jason L. Pereira, Alessandro Cuccoli, Leonardo Banchi
Sep 10 2024 quant-ph cond-mat.other cond-mat.stat-mech cs.LG stat.ML arXiv:2409.05188v1

@misc{2409.05188, author = {Mehran Khosrojerdi and Jason L.~Pereira and Alessandro Cuccoli and Leonardo Banchi}, title = {{L}earning to {C}lassify {Q}uantum {P}hases of {M}atter with a {F}ew {M}easurements}, year = {2024}, eprint = {2409.05188}, note = {arXiv:2409.05188v1} }
PDF
We study the identification of quantum phases of matter, at zero temperature, when only part of the phase diagram is known in advance. Following a supervised learning approach, we show how to use our previous knowledge to construct an observable capable of classifying the phase even in the unknown region. By using a combination of classical and quantum techniques, such as tensor networks, kernel methods, generalization bounds, quantum algorithms, and shadow estimators, we show that, in some cases, the certification of new ground states can be obtained with a polynomial number of measurements. An important application of our findings is the classification of the phases of matter obtained in quantum simulators, e.g., cold atom experiments, capable of efficiently preparing ground states of complex many-particle systems and applying simple measurements, e.g., single qubit measurements, but unable to perform a universal set of gates.
A Sharp Convergence Theory for The Probability Flow ODEs of Diffusion Models
Gen Li, Yuting Wei, Yuejie Chi, Yuxin Chen
Aug 06 2024 cs.LG cs.NA eess.SP math.NA math.ST stat.ML stat.TH arXiv:2408.02320v1

@misc{2408.02320, author = {Gen Li and Yuting Wei and Yuejie Chi and Yuxin Chen}, title = {{A} {S}harp {C}onvergence {T}heory for {T}he {P}robability {F}low {ODE}s of {D}iffusion {M}odels}, year = {2024}, eprint = {2408.02320}, note = {arXiv:2408.02320v1} }
PDF
Diffusion models, which convert noise into new data instances by learning to reverse a diffusion process, have become a cornerstone in contemporary generative modeling. In this work, we develop non-asymptotic convergence theory for a popular diffusion-based sampler (i.e., the probability flow ODE sampler) in discrete time, assuming access to $\ell_2$-accurate estimates of the (Stein) score functions. For distributions in $\mathbb{R}^d$, we prove that $d/\varepsilon$ iterations -- modulo some logarithmic and lower-order terms -- are sufficient to approximate the target distribution to within $\varepsilon$ total-variation distance. This is the first result establishing nearly linear dimension-dependency (in $d$) for the probability flow ODE sampler. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results also characterize how $\ell_2$ score estimation errors affect the quality of the data generation processes. In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach without the need of resorting to SDE and ODE toolboxes.
Quantum Curriculum Learning
Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
Jul 03 2024 quant-ph cs.LG stat.ML arXiv:2407.02419v2

@misc{2407.02419, author = {Quoc Hoan Tran and Yasuhiro Endo and Hirotaka Oshima}, title = {{Q}uantum {C}urriculum {L}earning}, year = {2024}, eprint = {2407.02419}, note = {arXiv:2407.02419v2} }
PDF
Quantum machine learning (QML) requires significant quantum resources to achieve quantum advantage. Research should prioritize both the efficient design of quantum architectures and the development of learning strategies to optimize resource usage. We propose a framework called quantum curriculum learning (Q-CurL) for quantum data, where the curriculum introduces simpler tasks or data to the learning model before progressing to more challenging ones. We define the curriculum criteria based on the data density ratio between tasks to determine the curriculum order. We also implement a dynamic learning schedule to emphasize the significance of quantum data in optimizing the loss function. Empirical evidence shows that Q-CurL significantly enhances the training convergence and the generalization for unitary learning tasks and improves the robustness of quantum phase recognition tasks. Our framework provides a general learning strategy, bringing QML closer to realizing practical advantages.
xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
May 08 2024 cs.LG cs.AI stat.ML arXiv:2405.04517v1

@misc{2405.04517, author = {Maximilian Beck and Korbinian Pöppel and Markus Spanring and Andreas Auer and Oleksandra Prudnikova and Michael Kopp and Günter Klambauer and Johannes Brandstetter and Sepp Hochreiter}, title = {x{LSTM}: {E}xtended {L}ong {S}hort-{T}erm {M}emory}, year = {2024}, eprint = {2405.04517}, note = {arXiv:2405.04517v1} }
PDF
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
Learning Mixtures of Gaussians Using Diffusion Models
Khashayar Gatmiry, Jonathan Kelner, Holden Lee
Apr 30 2024 cs.LG cs.DS math.PR math.ST stat.ML stat.TH arXiv:2404.18869v1

@misc{2404.18869, author = {Khashayar Gatmiry and Jonathan Kelner and Holden Lee}, title = {{L}earning {M}ixtures of {G}aussians {U}sing {D}iffusion {M}odels}, year = {2024}, eprint = {2404.18869}, note = {arXiv:2404.18869v1} }
PDF
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number.
On the VC dimension of deep group convolutional neural networks
Anna Sepliarskaia, Sophie Langer, Johannes Schmidt-Hieber
Oct 22 2024 cs.LG math.ST stat.ML stat.TH arXiv:2410.15800v1

@misc{2410.15800, author = {Anna Sepliarskaia and Sophie Langer and Johannes Schmidt-Hieber}, title = {{O}n the {VC} dimension of deep group convolutional neural networks}, year = {2024}, eprint = {2410.15800}, note = {arXiv:2410.15800v1} }
PDF
We study the generalization capabilities of Group Convolutional Neural Networks (GCNNs) with ReLU activation function by deriving upper and lower bounds for their Vapnik-Chervonenkis (VC) dimension. Specifically, we analyze how factors such as the number of layers, weights, and input dimension affect the VC dimension. We further compare the derived bounds to those known for other types of neural networks. Our findings extend previous results on the VC dimension of continuous GCNNs with two layers, thereby providing new insights into the generalization properties of GCNNs, particularly regarding the dependence on the input resolution of the data.
Adaptive Batch Size for Privately Finding Second-Order Stationary Points
Daogao Liu, Kunal Talwar
Oct 11 2024 cs.LG cs.CR cs.DS stat.ML arXiv:2410.07502v1

@misc{2410.07502, author = {Daogao Liu and Kunal Talwar}, title = {{A}daptive {B}atch {S}ize for {P}rivately {F}inding {S}econd-{O}rder {S}tationary {P}oints}, year = {2024}, eprint = {2410.07502}, note = {arXiv:2410.07502v1} }
PDF
There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) demonstrated that an $\alpha$-SOSP can be found with $\alpha=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\epsilon})^{3/7})$, where $n$ is the dataset size, $d$ is the dimension, and $\epsilon$ is the differential privacy parameter. Building on the SpiderBoost algorithm framework, we propose a new approach that uses adaptive batch sizes and incorporates the binary tree mechanism. Our method improves the results for privately finding an SOSP, achieving $\alpha=O(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\epsilon})^{1/2})$. This improved bound matches the state-of-the-art for finding an FOSP, suggesting that privately finding an SOSP may be achievable at no additional cost.
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, Tengyu Ma
Oct 08 2024 cs.LG cs.CL stat.ML arXiv:2410.05192v1

@misc{2410.05192, author = {Kaiyue Wen and Zhiyuan Li and Jason Wang and David Hall and Percy Liang and Tengyu Ma}, title = {{U}nderstanding {W}armup-{S}table-{D}ecay {L}earning {R}ates: {A} {R}iver {V}alley {L}oss {L}andscape {P}erspective}, year = {2024}, eprint = {2410.05192}, note = {arXiv:2410.05192v1} }
PDF
Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates a non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate's oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.
The Central Role of the Loss Function in Reinforcement Learning
Kaiwen Wang, Nathan Kallus, Wen Sun
Sep 20 2024 stat.ML cs.LG math.ST stat.TH arXiv:2409.12799v1

@misc{2409.12799, author = {Kaiwen Wang and Nathan Kallus and Wen Sun}, title = {{T}he {C}entral {R}ole of the {L}oss {F}unction in {R}einforcement {L}earning}, year = {2024}, eprint = {2409.12799}, note = {arXiv:2409.12799v1} }
PDF
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy's cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.
Rényi-infinity constrained sampling with $d^3$ membership queries
Yunbum Kook, Matthew S. Zhang
Jul 19 2024 cs.DS cs.LG math.ST stat.ML stat.TH arXiv:2407.12967v1

@misc{2407.12967, author = {Yunbum Kook and Matthew S.~Zhang}, title = {{R}ényi-infinity constrained sampling with $d^3$ membership queries}, year = {2024}, eprint = {2407.12967}, note = {arXiv:2407.12967v1} }
PDF
Uniform sampling over a convex body is a fundamental algorithmic problem, yet the convergence in KL or Rényi divergence of most samplers remains poorly understood. In this work, we propose a constrained proximal sampler, a principled and simple algorithm that possesses elegant convergence guarantees. Leveraging the uniform ergodicity of this sampler, we show that it converges in the Rényi-infinity divergence ($\mathcal R_\infty$) with no query complexity overhead when starting from a warm start. This is the strongest of commonly considered performance metrics, implying rates in $\{\mathcal R_q, \mathsf{KL}\}$ convergence as special cases. By applying this sampler within an annealing scheme, we propose an algorithm which can approximately sample $\varepsilon$-close to the uniform distribution on convex bodies in $\mathcal R_\infty$-divergence with $\widetilde{\mathcal{O}}(d^3\, \text{polylog} \frac{1}{\varepsilon})$ query complexity. This improves on all prior results in $\{\mathcal R_q, \mathsf{KL}\}$-divergences, without resorting to any algorithmic modifications or post-processing of the sample. It also matches the prior best known complexity in total variation distance.
Ramsey Theorems for Trees and a General 'Private Learning Implies Online Learning' Theorem
Simone Fioravanti, Steve Hanneke, Shay Moran, Hilla Schefler, Iska Tsubari
Jul 11 2024 cs.LG cs.CR cs.DS math.CO stat.ML arXiv:2407.07765v2

@misc{2407.07765, author = {Simone Fioravanti and Steve Hanneke and Shay Moran and Hilla Schefler and Iska Tsubari}, title = {{R}amsey {T}heorems for {T}rees and a {G}eneral '{P}rivate {L}earning {I}mplies {O}nline {L}earning' {T}heorem}, year = {2024}, eprint = {2407.07765}, note = {arXiv:2407.07765v2} }
PDF
This work continues to investigate the link between differentially private (DP) and online learning. Alon, Livni, Malliaris, and Moran (2019) showed that for binary concept classes, DP learnability of a given class implies that it has a finite Littlestone dimension (equivalently, that it is online learnable). Their proof relies on a model-theoretic result by Hodges (1997), which demonstrates that any binary concept class with a large Littlestone dimension contains a large subclass of thresholds. In a follow-up work, Jung, Kim, and Tewari (2020) extended this proof to multiclass PAC learning with a bounded number of labels. Unfortunately, Hodges's result does not apply in other natural settings such as multiclass PAC learning with an unbounded label space, and PAC learning of partial concept classes. This naturally raises the question of whether DP learnability continues to imply online learnability in more general scenarios: indeed, Alon, Hanneke, Holzman, and Moran (2021) explicitly leave it as an open question in the context of partial concept classes, and the same question is open in the general multiclass setting. In this work, we give a positive answer to these questions showing that for general classification tasks, DP learnability implies online learnability. Our proof reasons directly about Littlestone trees, without relying on thresholds. We achieve this by establishing several Ramsey-type theorems for trees, which might be of independent interest.
Probing the effects of broken symmetries in machine learning
Marcel F. Langer, Sergey N. Pozdnyakov, Michele Ceriotti
Jun 26 2024 physics.chem-ph cs.LG stat.ML arXiv:2406.17747v1

@misc{2406.17747, author = {Marcel F.~Langer and Sergey N.~Pozdnyakov and Michele Ceriotti}, title = {{P}robing the effects of broken symmetries in machine learning}, year = {2024}, eprint = {2406.17747}, note = {arXiv:2406.17747v1} }
PDF
Symmetry is one of the most central concepts in physics, and it is no surprise that it has also been widely adopted as an inductive bias for machine-learning models applied to the physical sciences. This is especially true for models targeting the properties of matter at the atomic scale. Both established and state-of-the-art approaches, with almost no exceptions, are built to be exactly equivariant to translations, permutations, and rotations of the atoms. Incorporating symmetries -- rotations in particular -- constrains the model design space and implies more complicated architectures that are often also computationally demanding. There are indications that non-symmetric models can easily learn symmetries from data, and that doing so can even be beneficial for the accuracy of the model. We put a model that obeys rotational invariance only approximately to the test, in realistic scenarios involving simulations of gas-phase, liquid, and solid water. We focus specifically on physical observables that are likely to be affected -- directly or indirectly -- by symmetry breaking, finding negligible consequences when the model is used in an interpolative, bulk, regime. Even for extrapolative gas-phase predictions, the model remains very stable, even though symmetry artifacts are noticeable. We also discuss strategies that can be used to systematically reduce the magnitude of symmetry breaking when it occurs, and assess their impact on the convergence of observables.
Improved Regret Bounds for Bandits with Expert Advice
Nicolò Cesa-Bianchi, Khaled Eldowa, Emmanuel Esposito, Julia Olkhovskaya
Jun 25 2024 cs.LG stat.ML arXiv:2406.16802v1

@misc{2406.16802, author = {Nicolò Cesa-Bianchi and Khaled Eldowa and Emmanuel Esposito and Julia Olkhovskaya}, title = {{I}mproved {R}egret {B}ounds for {B}andits with {E}xpert {A}dvice}, year = {2024}, eprint = {2406.16802}, note = {arXiv:2406.16802v1} }
PDF
In this research note, we revisit the bandits with expert advice problem. Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T \ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon. This matches a previously known upper bound of the same order and improves upon the best available lower bound of $\sqrt{K T (\ln N) / (\ln K)}$. For the standard feedback model, we prove a new instance-based upper bound that depends on the agreement between the experts and provides a logarithmic improvement compared to prior results.
Can Go AIs be adversarially robust?
Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave
Jun 19 2024 cs.LG cs.AI stat.ML arXiv:2406.12843v2

@misc{2406.12843, author = {Tom Tseng and Euan McLean and Kellin Pelrine and Tony T.~Wang and Adam Gleave}, title = {{C}an {G}o {AI}s be adversarially robust?}, year = {2024}, eprint = {2406.12843}, note = {arXiv:2406.12843v2} }
PDF
Prior work found that superhuman Go AIs can be defeated by simple adversarial strategies, especially "cyclic" attacks. In this paper, we study whether adding natural countermeasures can achieve robustness in Go, a favorable domain for robustness since it benefits from incredible average-case capability and a narrow, innately adversarial setting. We test three defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that though some of these defenses protect against previously discovered attacks, none withstand freshly trained adversaries. Furthermore, most of the reliably effective attacks these adversaries discover are different realizations of the same overall class of cyclic attacks. Our results suggest that building robust AI systems is challenging even with extremely superhuman systems in some of the most tractable settings, and highlight two key gaps: efficient generalization in defenses, and diversity in training. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai.
Near-Optimal Learning and Planning in Separated Latent MDPs
Fan Chen, Constantinos Daskalakis, Noah Golowich, Alexander Rakhlin
Jun 13 2024 cs.LG cs.AI cs.CC math.ST stat.ML stat.TH arXiv:2406.07920v1

@misc{2406.07920, author = {Fan Chen and Constantinos Daskalakis and Noah Golowich and Alexander Rakhlin}, title = {{N}ear-{O}ptimal {L}earning and {P}lanning in {S}eparated {L}atent {MDP}s}, year = {2024}, eprint = {2406.07920}, note = {arXiv:2406.07920v1} }
PDF
We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp *statistical threshold* for the horizon length necessary for efficient learning. On the computational side, we show that under a weaker assumption of separability under the optimal policy, there is a quasi-polynomial algorithm with time complexity scaling in terms of the statistical threshold. We further show a near-matching time complexity lower bound under the exponential time hypothesis.
Replicability in High Dimensional Statistics
Max Hopkins, Russell Impagliazzo, Daniel Kane, Sihan Liu, Christopher Ye
Jun 06 2024 stat.ML cs.CC cs.DS cs.LG arXiv:2406.02628v1

@misc{2406.02628, author = {Max Hopkins and Russell Impagliazzo and Daniel Kane and Sihan Liu and Christopher Ye}, title = {{R}eplicability in {H}igh {D}imensional {S}tatistics}, year = {2024}, eprint = {2406.02628}, note = {arXiv:2406.02628v1} }
PDF
The replicability crisis is a major issue across nearly all areas of empirical science, calling for the formal study of replicability in statistics. Motivated in this context, [Impagliazzo, Lei, Pitassi, and Sorrell STOC 2022] introduced the notion of replicable learning algorithms, and gave basic procedures for $1$-dimensional tasks including statistical queries. In this work, we study the computational and statistical cost of replicability for several fundamental high dimensional statistical tasks, including multi-hypothesis testing and mean estimation. Our main contribution establishes a computational and statistical equivalence between optimal replicable algorithms and high dimensional isoperimetric tilings. As a consequence, we obtain matching sample complexity upper and lower bounds for replicable mean estimation of distributions with bounded covariance, resolving an open problem of [Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sivakumar, and Sorrell, STOC2023] and for the $N$-Coin Problem, resolving a problem of [Karbasi, Velegkas, Yang, and Zhou, NeurIPS2023] up to log factors. While our equivalence is computational, allowing us to shave log factors in sample complexity from the best known efficient algorithms, efficient isoperimetric tilings are not known. To circumvent this, we introduce several relaxed paradigms that do allow for sample and computationally efficient algorithms, including allowing pre-processing, adaptivity, and approximate replicability. In these cases we give efficient algorithms matching or beating the best known sample complexity for mean estimation and the coin problem, including a generic procedure that reduces the standard quadratic overhead of replicability to linear in expectation.
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit
Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu
Jun 04 2024 cs.LG stat.ML arXiv:2406.01581v1

@misc{2406.01581, author = {Jason D.~Lee and Kazusato Oko and Taiji Suzuki and Denny Wu}, title = {{N}eural network learns low-dimensional polynomials with {SGD} near the information-theoretic limit}, year = {2024}, eprint = {2406.01581}, note = {arXiv:2406.01581v1} }
PDF
We study the problem of gradient descent learning of a single-index target function $f_*(\boldsymbol{x}) = \textstyle\sigma_*\left(\langle\boldsymbol{x},\boldsymbol{\theta}\rangle\right)$ under isotropic Gaussian data in $\mathbb{R}^d$, where the link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ is an unknown degree $q$ polynomial with information exponent $p$ (defined as the lowest degree in the Hermite expansion). Prior works showed that gradient-based training of neural networks can learn this target with $n\gtrsim d^{\Theta(p)}$ samples, and such statistical complexity is predicted to be necessary by the correlational statistical query lower bound. Surprisingly, we prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary polynomial link function with a sample and runtime complexity of $n \asymp T \asymp C(q) \cdot d\mathrm{polylog} d$, where constant $C(q)$ only depends on the degree of $\sigma_*$, regardless of information exponent; this dimension dependence matches the information theoretic limit up to polylogarithmic factors. Core to our analysis is the reuse of minibatch in the gradient computation, which gives rise to higher-order information beyond correlational queries.
Understanding Memory-Regret Trade-Off for Streaming Stochastic Multi-Armed Bandits
Yuchen He, Zichun Ye, Chihao Zhang
May 31 2024 cs.LG cs.DS stat.ML arXiv:2405.19752v2

@misc{2405.19752, author = {Yuchen He and Zichun Ye and Chihao Zhang}, title = {{U}nderstanding {M}emory-{R}egret {T}rade-{O}ff for {S}treaming {S}tochastic {M}ulti-{A}rmed {B}andits}, year = {2024}, eprint = {2405.19752}, note = {arXiv:2405.19752v2} }
PDF
We study the stochastic multi-armed bandit problem in the $P$-pass streaming model. In this problem, the $n$ arms are present in a stream and at most $m<n$ arms and their statistics can be stored in the memory. We give a complete characterization of the optimal regret in terms of $m, n$ and $P$. Specifically, we design an algorithm with $\tilde O\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ regret and complement it with an $\tilde \Omega\left((n-m)^{1+\frac{2^{P}-2}{2^{P+1}-1}} n^{\frac{2-2^{P+1}}{2^{P+1}-1}} T^{\frac{2^P}{2^{P+1}-1}}\right)$ lower bound when the number of rounds $T$ is sufficiently large. Our results are tight up to a logarithmic factor in $n$ and $P$.
Efficient Certificates of Anti-Concentration Beyond Gaussians
Ainesh Bakshi, Pravesh Kothari, Goutham Rajendran, Madhur Tulsiani, Aravindan Vijayaraghavan
May 27 2024 cs.DS cs.LG stat.ML arXiv:2405.15084v1

@misc{2405.15084, author = {Ainesh Bakshi and Pravesh Kothari and Goutham Rajendran and Madhur Tulsiani and Aravindan Vijayaraghavan}, title = {{E}fficient {C}ertificates of {A}nti-{C}oncentration {B}eyond {G}aussians}, year = {2024}, eprint = {2405.15084}, note = {arXiv:2405.15084v1} }
PDF
A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $\delta$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq \delta$ is at most $O(\delta)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificates of anti-concentration in the average case, when the set of points $X$ corresponds to samples from a Gaussian distribution. Their certificates played a crucial role in several subsequent works in algorithmic robust statistics on list-decodable learning and settling the robust learnability of arbitrary Gaussian mixtures, yet remain limited to rotationally invariant distributions. This work presents a new (and arguably the most natural) formulation for anti-concentration. Using this formulation, we give quasi-polynomial time verifiable sum-of-squares certificates of anti-concentration that hold for a wide class of non-Gaussian distributions including anti-concentrated bounded product distributions and uniform distributions over $L_p$ balls (and their affine transformations). Consequently, our method upgrades and extends results in algorithmic robust statistics e.g., list-decodable learning and clustering, to such distributions. Our approach constructs a canonical integer program for anti-concentration and analysis a sum-of-squares relaxation of it, independent of the intended application. We rely on duality and analyze a pseudo-expectation on large subsets of the input points that take a small value in some direction. Our analysis uses the method of polynomial reweightings to reduce the problem to analyzing only analytically dense or sparse directions.
Constrained Exploration via Reflected Replica Exchange Stochastic Gradient Langevin Dynamics
Haoyang Zheng, Hengrong Du, Qi Feng, Wei Deng, Guang Lin
May 14 2024 cs.LG cs.AI stat.ML arXiv:2405.07839v2

@misc{2405.07839, author = {Haoyang Zheng and Hengrong Du and Qi Feng and Wei Deng and Guang Lin}, title = {{C}onstrained {E}xploration via {R}eflected {R}eplica {E}xchange {S}tochastic {G}radient {L}angevin {D}ynamics}, year = {2024}, eprint = {2405.07839}, note = {arXiv:2405.07839v2} }
PDF
Replica exchange stochastic gradient Langevin dynamics (reSGLD) is an effective sampler for non-convex learning in large-scale datasets. However, the simulation may encounter stagnation issues when the high-temperature chain delves too deeply into the distribution tails. To tackle this issue, we propose reflected reSGLD (r2SGLD): an algorithm tailored for constrained non-convex exploration by utilizing reflection steps within a bounded domain. Theoretically, we observe that reducing the diameter of the domain enhances mixing rates, exhibiting a $\textit{quadratic}$ behavior. Empirically, we test its performance through extensive experiments, including identifying dynamical systems with physical constraints, simulations of constrained multi-modal distributions, and image classification tasks. The theoretical and empirical findings highlight the crucial role of constrained exploration in improving the simulation efficiency.
Stochastic Bandits with ReLU Neural Networks
Kan Xu, Hamsa Bastani, Surbhi Goel, Osbert Bastani
May 14 2024 cs.LG cs.DS stat.ML arXiv:2405.07331v1

@misc{2405.07331, author = {Kan Xu and Hamsa Bastani and Surbhi Goel and Osbert Bastani}, title = {{S}tochastic {B}andits with {R}e{LU} {N}eural {N}etworks}, year = {2024}, eprint = {2405.07331}, note = {arXiv:2405.07331v1} }
PDF
We study the stochastic bandit problem with ReLU neural network structure. We show that a $\tilde{O}(\sqrt{T})$ regret guarantee is achievable by considering bandits with one-layer ReLU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-ReLU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of ReLU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of ReLU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-ReLU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.
Wilsonian Renormalization of Neural Network Gaussian Processes
Jessica N. Howard, Ro Jefferson, Anindita Maiti, Zohar Ringel
May 13 2024 cs.LG cond-mat.dis-nn hep-th stat.ML arXiv:2405.06008v2

@misc{2405.06008, author = {Jessica N.~Howard and Ro Jefferson and Anindita Maiti and Zohar Ringel}, title = {{W}ilsonian {R}enormalization of {N}eural {N}etwork {G}aussian {P}rocesses}, year = {2024}, eprint = {2405.06008}, note = {arXiv:2405.06008v2} }
PDF
Separating relevant and irrelevant information is key to any modeling process or scientific inquiry. Theoretical physics offers a powerful tool for achieving this in the form of the renormalization group (RG). Here we demonstrate a practical approach to performing Wilsonian RG in the context of Gaussian Process (GP) Regression. We systematically integrate out the unlearnable modes of the GP kernel, thereby obtaining an RG flow of the GP in which the data sets the IR scale. In simple cases, this results in a universal flow of the ridge parameter, which becomes input-dependent in the richer scenario in which non-Gaussianities are included. In addition to being analytically tractable, this approach goes beyond structural analogies between RG and neural networks by providing a natural connection between RG flow and learnable vs. unlearnable modes. Studying such flows may improve our understanding of feature learning in deep neural networks, and enable us to identify potential universality classes in these models.
In-and-Out: Algorithmic Diffusion for Sampling Convex Bodies
Yunbum Kook, Santosh S. Vempala, Matthew S. Zhang
May 03 2024 cs.DS cs.LG math.ST stat.ML stat.TH arXiv:2405.01425v1

@misc{2405.01425, author = {Yunbum Kook and Santosh S.~Vempala and Matthew S.~Zhang}, title = {{I}n-and-{O}ut: {A}lgorithmic {D}iffusion for {S}ampling {C}onvex {B}odies}, year = {2024}, eprint = {2405.01425}, note = {arXiv:2405.01425v1} }
PDF
We present a new random walk for uniformly sampling high-dimensional convex bodies. It achieves state-of-the-art runtime complexity with stronger guarantees on the output than previously known, namely in Rényi divergence (which implies TV, $\mathcal{W}_2$, KL, $\chi^2$). The proof departs from known approaches for polytime algorithms for the problem -- we utilize a stochastic diffusion perspective to show contraction to the target distribution with the rate of convergence determined by functional isoperimetric constants of the stationary density.
Learning general Gaussian mixtures with efficient score matching
Sitan Chen, Vasilis Kontonis, Kulin Shah
Apr 30 2024 cs.DS cs.LG stat.ML arXiv:2404.18893v1

@misc{2404.18893, author = {Sitan Chen and Vasilis Kontonis and Kulin Shah}, title = {{L}earning general {G}aussian mixtures with efficient score matching}, year = {2024}, eprint = {2404.18893}, note = {arXiv:2404.18893v1} }
PDF
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\mathrm{poly}(k/\varepsilon)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\varepsilon$-far from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$. Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.
Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error
Thanasis Pittas, Ankit Pensia
Oct 23 2024 cs.DS cs.LG math.ST stat.ML stat.TH arXiv:2410.17230v1

@misc{2410.17230, author = {Thanasis Pittas and Ankit Pensia}, title = {{O}ptimal {R}obust {E}stimation under {L}ocal and {G}lobal {C}orruptions: {S}tronger {A}dversary and {S}maller {E}rror}, year = {2024}, eprint = {2410.17230}, note = {arXiv:2410.17230v1} }
PDF
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led [NGS24, COLT'24] to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emphstronger local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.
Covariance estimation using Markov chain Monte Carlo
Yunbum Kook, Matthew S. Zhang
Oct 23 2024 math.ST cs.DS cs.LG stat.ML stat.TH arXiv:2410.17147v1

@misc{2410.17147, author = {Yunbum Kook and Matthew S.~Zhang}, title = {{C}ovariance estimation using {M}arkov chain {M}onte {C}arlo}, year = {2024}, eprint = {2410.17147}, note = {arXiv:2410.17147v1} }
PDF
We investigate the complexity of covariance matrix estimation for Gibbs distributions based on dependent samples from a Markov chain. We show that when $\pi$ satisfies a Poincaré inequality and the chain possesses a spectral gap, we can achieve similar sample complexity using MCMC as compared to an estimator constructed using i.i.d. samples, with potentially much better query complexity. As an application of our methods, we show improvements for the query complexity in both constrained and unconstrained settings for concrete instances of MCMC. In particular, we provide guarantees regarding isotropic rounding procedures for sampling uniformly on convex bodies.
Optimal Design for Reward Modeling in RLHF
Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan, Pierre Ménard, Eric Moulines, Michal Valko
Oct 23 2024 cs.LG stat.ML arXiv:2410.17055v1

@misc{2410.17055, author = {Antoine Scheid and Etienne Boursier and Alain Durmus and Michael I.~Jordan and Pierre Ménard and Eric Moulines and Michal Valko}, title = {{O}ptimal {D}esign for {R}eward {M}odeling in {RLHF}}, year = {2024}, eprint = {2410.17055}, note = {arXiv:2410.17055v1} }
PDF
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to align language models (LMs) with human preferences. This method involves collecting a large dataset of human pairwise preferences across various text generations and using it to infer (implicitly or explicitly) a reward model. Numerous methods have been proposed to learn the reward model and align a LM with it. However, the costly process of collecting human preferences has received little attention and could benefit from theoretical insights. This paper addresses this issue and aims to formalize the reward training model in RLHF. We frame the selection of an effective dataset as a simple regret minimization task, using a linear contextual dueling bandit method. Given the potentially large number of arms, this approach is more coherent than the best-arm identification setting. We then propose an offline framework for solving this problem. Under appropriate assumptions - linearity of the reward model in the embedding space, and boundedness of the reward parameter - we derive bounds on the simple regret. Finally, we provide a lower bound that matches our upper bound up to constant and logarithmic terms. To our knowledge, this is the first theoretical contribution in this area to provide an offline approach as well as worst-case guarantees.
Building Conformal Prediction Intervals with Approximate Message Passing
Lucas Clarté, Lenka Zdeborová
Oct 23 2024 stat.ML cond-mat.dis-nn cs.LG arXiv:2410.16493v1

@misc{2410.16493, author = {Lucas Clarté and Lenka Zdeborová}, title = {{B}uilding {C}onformal {P}rediction {I}ntervals with {A}pproximate {M}essage {P}assing}, year = {2024}, eprint = {2410.16493}, note = {arXiv:2410.16493v1} }
PDF
Conformal prediction has emerged as a powerful tool for building prediction intervals that are valid in a distribution-free way. However, its evaluation may be computationally costly, especially in the high-dimensional setting where the dimensionality and sample sizes are both large and of comparable magnitudes. To address this challenge in the context of generalized linear regression, we propose a novel algorithm based on Approximate Message Passing (AMP) to accelerate the computation of prediction intervals using full conformal prediction, by approximating the computation of conformity scores. Our work bridges a gap between modern uncertainty quantification techniques and tools for high-dimensional problems involving the AMP algorithm. We evaluate our method on both synthetic and real data, and show that it produces prediction intervals that are close to the baseline methods, while being orders of magnitude faster. Additionally, in the high-dimensional limit and under assumptions on the data distribution, the conformity scores computed by AMP converge to the one computed exactly, which allows theoretical study and benchmarking of conformal methods in high dimensions.
Asymptotically Optimal Change Detection for Unnormalized Pre- and Post-Change Distributions
Arman Adibi, Sanjeev Kulkarni, H. Vincent Poor, Taposh Banerjee, Vahid Tarokh
Oct 21 2024 stat.ML cs.AI cs.IT cs.LG eess.SP math.IT arXiv:2410.14615v1

@misc{2410.14615, author = {Arman Adibi and Sanjeev Kulkarni and H.~Vincent Poor and Taposh Banerjee and Vahid Tarokh}, title = {{A}symptotically {O}ptimal {C}hange {D}etection for {U}nnormalized {P}re- and {P}ost-{C}hange {D}istributions}, year = {2024}, eprint = {2410.14615}, note = {arXiv:2410.14615v1} }
PDF
This paper addresses the problem of detecting changes when only unnormalized pre- and post-change distributions are accessible. This situation happens in many scenarios in physics such as in ferromagnetism, crystallography, magneto-hydrodynamics, and thermodynamics, where the energy models are difficult to normalize. Our approach is based on the estimation of the Cumulative Sum (CUSUM) statistics, which is known to produce optimal performance. We first present an intuitively appealing approximation method. Unfortunately, this produces a biased estimator of the CUSUM statistics and may cause performance degradation. We then propose the Log-Partition Approximation Cumulative Sum (LPA-CUSUM) algorithm based on thermodynamic integration (TI) in order to estimate the log-ratio of normalizing constants of pre- and post-change distributions. It is proved that this approach gives an unbiased estimate of the log-partition function and the CUSUM statistics, and leads to an asymptotically optimal performance. Moreover, we derive a relationship between the required sample size for thermodynamic integration and the desired detection delay performance, offering guidelines for practical parameter selection. Numerical studies are provided demonstrating the efficacy of our approach.
How Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs
Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, Liwei Wang
Oct 18 2024 cs.LG cs.AI cs.CL stat.ML arXiv:2410.13857v1

@misc{2410.13857, author = {Guhao Feng and Kai Yang and Yuntian Gu and Xinyue Ai and Shengjie Luo and Jiacheng Sun and Di He and Zhenguo Li and Liwei Wang}, title = {{H}ow {N}umerical {P}recision {A}ffects {M}athematical {R}easoning {C}apabilities of {LLM}s}, year = {2024}, eprint = {2410.13857}, note = {arXiv:2410.13857v1} }
PDF
Despite the remarkable success of Transformer-based Large Language Models (LLMs) across various domains, understanding and enhancing their mathematical capabilities remains a significant challenge. In this paper, we conduct a rigorous theoretical analysis of LLMs' mathematical abilities, with a specific focus on their arithmetic performances. We identify numerical precision as a key factor that influences their effectiveness in mathematical tasks. Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length. In contrast, Transformers with standard numerical precision can efficiently handle these tasks with significantly smaller model sizes. We further support our theoretical findings through empirical experiments that explore the impact of varying numerical precision on arithmetic tasks, providing valuable insights for improving the mathematical reasoning capabilities of LLMs.
Geometry-Aware Generative Autoencoders for Warped Riemannian Metric Learning and Generative Modeling on Data Manifolds
Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Chen Liu, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim G. J. Rudner, Smita Krishnaswamy
Oct 17 2024 cs.LG math.DG stat.ML arXiv:2410.12779v2

@misc{2410.12779, author = {Xingzhi Sun and Danqi Liao and Kincaid MacDonald and Yanlei Zhang and Chen Liu and Guillaume Huguet and Guy Wolf and Ian Adelstein and Tim G.~J.~Rudner and Smita Krishnaswamy}, title = {{G}eometry-{A}ware {G}enerative {A}utoencoders for {W}arped {R}iemannian {M}etric {L}earning and {G}enerative {M}odeling on {D}ata {M}anifolds}, year = {2024}, eprint = {2410.12779}, note = {arXiv:2410.12779v2} }
PDF
Rapid growth of high-dimensional datasets in fields such as single-cell RNA sequencing and spatial genomics has led to unprecedented opportunities for scientific discovery, but it also presents unique computational and statistical challenges. Traditional methods struggle with geometry-aware data generation, interpolation along meaningful trajectories, and transporting populations via feasible paths. To address these issues, we introduce Geometry-Aware Generative Autoencoder (GAGA), a novel framework that combines extensible manifold learning with generative modeling. GAGA constructs a neural network embedding space that respects the intrinsic geometries discovered by manifold learning and learns a novel warped Riemannian metric on the data space. This warped metric is derived from both the points on the data manifold and negative samples off the manifold, allowing it to characterize a meaningful geometry across the entire latent space. Using this metric, GAGA can uniformly sample points on the manifold, generate points along geodesics, and interpolate between populations across the learned manifold using geodesic-guided flows. GAGA shows competitive performance in simulated and real-world datasets, including a 30% improvement over the state-of-the-art methods in single-cell population-level trajectory inference.
Replicable Uniformity Testing
Sihan Liu, Christopher Ye
Oct 16 2024 stat.ML cs.DS cs.LG arXiv:2410.10892v1

@misc{2410.10892, author = {Sihan Liu and Christopher Ye}, title = {{R}eplicable {U}niformity {T}esting}, year = {2024}, eprint = {2410.10892}, note = {arXiv:2410.10892v1} }
PDF
Uniformity testing is arguably one of the most fundamental distribution testing problems. Given sample access to an unknown distribution $\mathbf{p}$ on $[n]$, one must decide if $\mathbf{p}$ is uniform or $\varepsilon$-far from uniform (in total variation distance). A long line of work established that uniformity testing has sample complexity $\Theta(\sqrt{n}\varepsilon^{-2})$. However, when the input distribution is neither uniform nor far from uniform, known algorithms may have highly non-replicable behavior. Consequently, if these algorithms are applied in scientific studies, they may lead to contradictory results that erode public trust in science. In this work, we revisit uniformity testing under the framework of algorithmic replicability [STOC '22], requiring the algorithm to be replicable under arbitrary distributions. While replicability typically incurs a $\rho^{-2}$ factor overhead in sample complexity, we obtain a replicable uniformity tester using only $\tilde{O}(\sqrt{n} \varepsilon^{-2} \rho^{-1})$ samples. To our knowledge, this is the first replicable learning algorithm with (nearly) linear dependence on $\rho$. Lastly, we consider a class of ``symmetric" algorithms [FOCS '00] whose outputs are invariant under relabeling of the domain $[n]$, which includes all existing uniformity testers (including ours). For this natural class of algorithms, we prove a nearly matching sample complexity lower bound for replicable uniformity testing.
Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics
Omar Chehab, Anna Korba, Austin Stromme, Adrien Vacher
Oct 15 2024 stat.ML cs.LG stat.CO arXiv:2410.09697v1

@misc{2410.09697, author = {Omar Chehab and Anna Korba and Austin Stromme and Adrien Vacher}, title = {{P}rovable {C}onvergence and {L}imitations of {G}eometric {T}empering for {L}angevin {D}ynamics}, year = {2024}, eprint = {2410.09697}, note = {arXiv:2410.09697v1} }
PDF
Geometric tempering is a popular approach to sampling from challenging multi-modal probability distributions by instead sampling from a sequence of distributions which interpolate, using the geometric mean, between an easier proposal distribution and the target distribution. In this paper, we theoretically investigate the soundness of this approach when the sampling algorithm is Langevin dynamics, proving both upper and lower bounds. Our upper bounds are the first analysis in the literature under functional inequalities. They assert the convergence of tempered Langevin in continuous and discrete-time, and their minimization leads to closed-form optimal tempering schedules for some pairs of proposal and target distributions. Our lower bounds demonstrate a simple case where the geometric tempering takes exponential time, and further reveal that the geometric tempering can suffer from poor functional inequalities and slow convergence, even when the target distribution is well-conditioned. Overall, our results indicate that geometric tempering may not help, and can even be harmful for convergence.
The 2020 United States Decennial Census Is More Private Than You (Might) Think
Buxin Su, Weijie J. Su, Chendi Wang
Oct 15 2024 cs.CR cs.DS stat.AP stat.ML arXiv:2410.09296v1

@misc{2410.09296, author = {Buxin Su and Weijie J.~Su and Chendi Wang}, title = {{T}he 2020 {U}nited {S}tates {D}ecennial {C}ensus {I}s {M}ore {P}rivate {T}han {Y}ou ({M}ight) {T}hink}, year = {2024}, eprint = {2410.09296}, note = {arXiv:2410.09296v1} }
PDF
The U.S. Decennial Census serves as the foundation for many high-profile policy decision-making processes, including federal funding allocation and redistricting. In 2020, the Census Bureau adopted differential privacy to protect the confidentiality of individual responses through a disclosure avoidance system that injects noise into census data tabulations. The Bureau subsequently posed an open question: Could sharper privacy guarantees be obtained for the 2020 U.S. Census compared to their published guarantees, or equivalently, had the nominal privacy budgets been fully utilized? In this paper, we affirmatively address this open problem by demonstrating that between 8.50% and 13.76% of the privacy budget for the 2020 U.S. Census remains unused for each of the eight geographical levels, from the national level down to the block level. This finding is made possible through our precise tracking of privacy losses using $f$-differential privacy, applied to the composition of private queries across various geographical levels. Our analysis indicates that the Census Bureau introduced unnecessarily high levels of injected noise to achieve the claimed privacy guarantee for the 2020 U.S. Census. Consequently, our results enable the Bureau to reduce noise variances by 15.08% to 24.82% while maintaining the same privacy budget for each geographical level, thereby enhancing the accuracy of privatized census statistics. We empirically demonstrate that reducing noise injection into census statistics mitigates distortion caused by privacy constraints in downstream applications of private census data, illustrated through a study examining the relationship between earnings and education.

Recent comments