Statistics Theory (math.ST)

Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error
Thanasis Pittas, Ankit Pensia
Oct 23 2024 cs.DS cs.LG math.ST stat.ML stat.TH arXiv:2410.17230v1

@misc{2410.17230, author = {Thanasis Pittas and Ankit Pensia}, title = {{O}ptimal {R}obust {E}stimation under {L}ocal and {G}lobal {C}orruptions: {S}tronger {A}dversary and {S}maller {E}rror}, year = {2024}, eprint = {2410.17230}, note = {arXiv:2410.17230v1} }
PDF
Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led [NGS24, COLT'24] to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even \emphstronger local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.
Covariance estimation using Markov chain Monte Carlo
Yunbum Kook, Matthew S. Zhang
Oct 23 2024 math.ST cs.DS cs.LG stat.ML stat.TH arXiv:2410.17147v1

@misc{2410.17147, author = {Yunbum Kook and Matthew S.~Zhang}, title = {{C}ovariance estimation using {M}arkov chain {M}onte {C}arlo}, year = {2024}, eprint = {2410.17147}, note = {arXiv:2410.17147v1} }
PDF
We investigate the complexity of covariance matrix estimation for Gibbs distributions based on dependent samples from a Markov chain. We show that when $\pi$ satisfies a Poincaré inequality and the chain possesses a spectral gap, we can achieve similar sample complexity using MCMC as compared to an estimator constructed using i.i.d. samples, with potentially much better query complexity. As an application of our methods, we show improvements for the query complexity in both constrained and unconstrained settings for concrete instances of MCMC. In particular, we provide guarantees regarding isotropic rounding procedures for sampling uniformly on convex bodies.
A Short Note on the Geometric Ergodicity of Markov Chains for Bayesian Linear Regression Models with Heavy-Tailed Errors
Yasuyuki Hamura
Oct 23 2024 math.ST stat.CO stat.TH arXiv:2410.17070v1

@misc{2410.17070, author = {Yasuyuki Hamura}, title = {{A} {S}hort {N}ote on the {G}eometric {E}rgodicity of {M}arkov {C}hains for {B}ayesian {L}inear {R}egression {M}odels with {H}eavy-{T}ailed {E}rrors}, year = {2024}, eprint = {2410.17070}, note = {arXiv:2410.17070v1} }
PDF
In this short note, we consider posterior simulation for a linear regression model when the error distribution is given by a scale mixture of multivariate normals. We show that the sampler of Backlund and Hobert (2020) for the case of the conditionally conjugate normal-inverse Wishart prior is geometrically ergodic even when the error density is heavier-tailed.
Federated Causal Inference: Multi-Centric ATE Estimation beyond Meta-Analysis
Rémi Khellaf, Aurélien Bellet, Julie Josse
Oct 23 2024 stat.ML cs.LG math.ST stat.TH arXiv:2410.16870v1

@misc{2410.16870, author = {Rémi Khellaf and Aurélien Bellet and Julie Josse}, title = {{F}ederated {C}ausal {I}nference: {M}ulti-{C}entric {ATE} {E}stimation beyond {M}eta-{A}nalysis}, year = {2024}, eprint = {2410.16870}, note = {arXiv:2410.16870v1} }
PDF
We study Federated Causal Inference, an approach to estimate treatment effects from decentralized data across centers. We compare three classes of Average Treatment Effect (ATE) estimators derived from the Plug-in G-Formula, ranging from simple meta-analysis to one-shot and multi-shot federated learning, the latter leveraging the full data to learn the outcome model (albeit requiring more communication). Focusing on Randomized Controlled Trials (RCTs), we derive the asymptotic variance of these estimators for linear models. Our results provide practical guidance on selecting the appropriate estimator for various scenarios, including heterogeneity in sample sizes, covariate distributions, treatment assignment schemes, and center effects. We validate these findings with a simulation study.
General Frameworks for Conditional Two-Sample Testing
Seongchan Lee, Suman Cha, Ilmun Kim
Oct 23 2024 stat.ML cs.LG math.ST stat.TH arXiv:2410.16636v1

@misc{2410.16636, author = {Seongchan Lee and Suman Cha and Ilmun Kim}, title = {{G}eneral {F}rameworks for {C}onditional {T}wo-{S}ample {T}esting}, year = {2024}, eprint = {2410.16636}, note = {arXiv:2410.16636v1} }
PDF
We study the problem of conditional two-sample testing, which aims to determine whether two populations have the same distribution after accounting for confounding factors. This problem commonly arises in various applications, such as domain adaptation and algorithmic fairness, where comparing two groups is essential while controlling for confounding variables. We begin by establishing a hardness result for conditional two-sample testing, demonstrating that no valid test can have significant power against any single alternative without proper assumptions. We then introduce two general frameworks that implicitly or explicitly target specific classes of distributions for their validity and power. Our first framework allows us to convert any conditional independence test into a conditional two-sample test in a black-box manner, while preserving the asymptotic properties of the original conditional independence test. The second framework transforms the problem into comparing marginal distributions with estimated density ratios, which allows us to leverage existing methods for marginal two-sample testing. We demonstrate this idea in a concrete manner with classification and kernel-based methods. Finally, simulation studies are conducted to illustrate the proposed frameworks in finite-sample scenarios.
High-dimensional Grouped-regression using Bayesian Sparse Projection-posterior
Samhita Pal, Subhashis Ghoshal
Oct 23 2024 stat.ME math.ST stat.TH arXiv:2410.16577v1

@misc{2410.16577, author = {Samhita Pal and Subhashis Ghoshal}, title = {{H}igh-dimensional {G}rouped-regression using {B}ayesian {S}parse {P}rojection-posterior}, year = {2024}, eprint = {2410.16577}, note = {arXiv:2410.16577v1} }
PDF
We consider a novel Bayesian approach to estimation, uncertainty quantification, and variable selection for a high-dimensional linear regression model under sparsity. The number of predictors can be nearly exponentially large relative to the sample size. We put a conjugate normal prior initially disregarding sparsity, but for making an inference, instead of the original multivariate normal posterior, we use the posterior distribution induced by a map transforming the vector of regression coefficients to a sparse vector obtained by minimizing the sum of squares of deviations plus a suitably scaled $\ell_1$-penalty on the vector. We show that the resulting sparse projection-posterior distribution contracts around the true value of the parameter at the optimal rate adapted to the sparsity of the vector. We show that the true sparsity structure gets a large sparse projection-posterior probability. We further show that an appropriately recentred credible ball has the correct asymptotic frequentist coverage. Finally, we describe how the computational burden can be distributed to many machines, each dealing with only a small fraction of the whole dataset. We conduct a comprehensive simulation study under a variety of settings and found that the proposed method performs well for finite sample sizes. We also apply the method to several real datasets, including the ADNI data, and compare its performance with the state-of-the-art methods. We implemented the method in the \textttR package called \textttsparseProj, and all computations have been carried out using this package.
On the breakdown point of transport-based quantiles
Marco Avella-Medina, Alberto González-Sanz
Oct 23 2024 math.ST stat.ME stat.TH arXiv:2410.16554v1

@misc{2410.16554, author = {Marco Avella-Medina and Alberto González-Sanz}, title = {{O}n the breakdown point of transport-based quantiles}, year = {2024}, eprint = {2410.16554}, note = {arXiv:2410.16554v1} }
PDF
Recent work has used optimal transport ideas to generalize the notion of (center-outward) quantiles to dimension $d\geq 2$. We study the robustness properties of these transport-based quantiles by deriving their breakdown point, roughly, the smallest amount of contamination required to make these quantiles take arbitrarily aberrant values. We prove that the transport median defined in Chernozhukov et al.~(2017) and Hallin et al.~(2021) has breakdown point of $1/2$. Moreover, a point in the transport depth contour of order $\tau\in [0,1/2]$ has breakdown point of $\tau$. This shows that the multivariate transport depth shares the same breakdown properties as its univariate counterpart. Our proof relies on a general argument connecting the breakdown point of transport maps evaluated at a point to the Tukey depth of that point in the reference measure.
On The Variance of Schatten $p$-Norm Estimation with Gaussian Sketching Matrices
Lior Horesh, Vasileios Kalantzis, Yingdong Lu, Tomasz Nowicki
Oct 23 2024 math.ST cs.NA math.NA math.PR stat.TH arXiv:2410.16455v1

@misc{2410.16455, author = {Lior Horesh and Vasileios Kalantzis and Yingdong Lu and Tomasz Nowicki}, title = {{O}n {T}he {V}ariance of {S}chatten $p$-{N}orm {E}stimation with {G}aussian {S}ketching {M}atrices}, year = {2024}, eprint = {2410.16455}, note = {arXiv:2410.16455v1} }
PDF
Monte Carlo matrix trace estimation is a popular randomized technique to estimate the trace of implicitly-defined matrices via averaging quadratic forms across several observations of a random vector. The most common approach to analyze the quality of such estimators is to consider the variance over the total number of observations. In this paper we present a procedure to compute the variance of the estimator proposed by Kong and Valiant [Ann. Statist. 45 (5), pp. 2218 - 2247] for the case of Gaussian random vectors and provide a sharper bound than previously available.
Data Augmentation of Multivariate Sensor Time Series using Autoregressive Models and Application to Failure Prognostics
Douglas Baptista de Souza, Bruno Paes Leao
Oct 23 2024 stat.ML cs.LG math.ST stat.ME stat.TH arXiv:2410.16419v1

@misc{2410.16419, author = {Douglas Baptista de Souza and Bruno Paes Leao}, title = {{D}ata {A}ugmentation of {M}ultivariate {S}ensor {T}ime {S}eries using {A}utoregressive {M}odels and {A}pplication to {F}ailure {P}rognostics}, year = {2024}, eprint = {2410.16419}, note = {arXiv:2410.16419v1} }
PDF
This work presents a novel data augmentation solution for non-stationary multivariate time series and its application to failure prognostics. The method extends previous work from the authors which is based on time-varying autoregressive processes. It can be employed to extract key information from a limited number of samples and generate new synthetic samples in a way that potentially improves the performance of PHM solutions. This is especially valuable in situations of data scarcity which are very usual in PHM, especially for failure prognostics. The proposed approach is tested based on the CMAPSS dataset, commonly employed for prognostics experiments and benchmarks. An AutoML approach from PHM literature is employed for automating the design of the prognostics solution. The empirical evaluation provides evidence that the proposed method can substantially improve the performance of PHM solutions.
Simplicity Bias via Global Convergence of Sharpness Minimization
Khashayar Gatmiry, Zhiyuan Li, Sashank J. Reddi, Stefanie Jegelka
Oct 23 2024 cs.LG math.ST stat.ML stat.TH arXiv:2410.16401v1

@misc{2410.16401, author = {Khashayar Gatmiry and Zhiyuan Li and Sashank J.~Reddi and Stefanie Jegelka}, title = {{S}implicity {B}ias via {G}lobal {C}onvergence of {S}harpness {M}inimization}, year = {2024}, eprint = {2410.16401}, note = {arXiv:2410.16401v1} }
PDF
The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are 'simple', the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby, implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property -- a local geodesic convexity -- of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.
Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent
Santhosh Karnik, Anna Veselovska, Mark Iwen, Felix Krahmer
Oct 23 2024 cs.LG math.OC math.ST stat.ML stat.TH arXiv:2410.16247v1

@misc{2410.16247, author = {Santhosh Karnik and Anna Veselovska and Mark Iwen and Felix Krahmer}, title = {{I}mplicit {R}egularization for {T}ubal {T}ensor {F}actorizations via {G}radient {D}escent}, year = {2024}, eprint = {2410.16247}, note = {arXiv:2410.16247v1} }
PDF
We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.

Recent comments