Statistics
See recent articles
Showing new listings for Friday, 25 October 2024
- [1] arXiv:2410.18144 [pdf, other]
-
Title: Using Platt's scaling for calibration after undersampling -- limitations and how to address themSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study.
- [2] arXiv:2410.18162 [pdf, html, other]
-
Title: Stochastic gradient descent in high dimensions for multi-spiked tensor PCAComments: 58 pages, 10 figures. This is part of our manuscript arXiv:2408.06401Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
We study the dynamics in high dimensions of online stochastic gradient descent for the multi-spiked tensor model. This multi-index model arises from the tensor principal component analysis (PCA) problem with multiple spikes, where the goal is to estimate $r$ unknown signal vectors within the $N$-dimensional unit sphere through maximum likelihood estimation from noisy observations of a $p$-tensor. We determine the number of samples and the conditions on the signal-to-noise ratios (SNRs) required to efficiently recover the unknown spikes from natural random initializations. We show that full recovery of all spikes is possible provided a number of sample scaling as $N^{p-2}$, matching the algorithmic threshold identified in the rank-one case [Ben Arous, Gheissari, Jagannath 2020, 2021]. Our results are obtained through a detailed analysis of a low-dimensional system that describes the evolution of the correlations between the estimators and the spikes, while controlling the noise in the dynamics. We find that the spikes are recovered sequentially in a process we term "sequential elimination": once a correlation exceeds a critical threshold, all correlations sharing a row or column index become sufficiently small, allowing the next correlation to grow and become macroscopic. The order in which correlations become macroscopic depends on their initial values and the corresponding SNRs, leading to either exact recovery or recovery of a permutation of the spikes. In the matrix case, when $p=2$, if the SNRs are sufficiently separated, we achieve exact recovery of the spikes, whereas equal SNRs lead to recovery of the subspace spanned by the spikes.
- [3] arXiv:2410.18243 [pdf, html, other]
-
Title: Saddlepoint Monte Carlo and its Application to Exact Ecological InferenceComments: 27 pages, 9 figures, 3 tablesSubjects: Computation (stat.CO); Applications (stat.AP); Methodology (stat.ME)
Assuming X is a random vector and A a non-invertible matrix, one sometimes need to perform inference while only having access to samples of Y = AX. The corresponding likelihood is typically intractable. One may still be able to perform exact Bayesian inference using a pseudo-marginal sampler, but this requires an unbiased estimator of the intractable likelihood.
We propose saddlepoint Monte Carlo, a method for obtaining an unbiased estimate of the density of Y with very low variance, for any model belonging to an exponential family. Our method relies on importance sampling of the characteristic function, with insights brought by the standard saddlepoint approximation scheme with exponential tilting. We show that saddlepoint Monte Carlo makes it possible to perform exact inference on particularly challenging problems and datasets. We focus on the ecological inference problem, where one observes only aggregates at a fine level. We present in particular a study of the carryover of votes between the two rounds of various French elections, using the finest available data (number of votes for each candidate in about 60,000 polling stations over most of the French territory).
We show that existing, popular approximate methods for ecological inference can lead to substantial bias, which saddlepoint Monte Carlo is immune from. We also present original results for the 2024 legislative elections on political centre-to-left and left-to-centre conversion rates when the far-right is present in the second round. Finally, we discuss other exciting applications for saddlepoint Monte Carlo, such as dealing with aggregate data in privacy or inverse problems. - [4] arXiv:2410.18261 [pdf, html, other]
-
Title: Detecting Spatial Outliers: the Role of the Local Influence FunctionSubjects: Methodology (stat.ME); Econometrics (econ.EM); Applications (stat.AP)
In the analysis of large spatial datasets, identifying and treating spatial outliers is essential for accurately interpreting geographical phenomena. While spatial correlation measures, particularly Local Indicators of Spatial Association (LISA), are widely used to detect spatial patterns, the presence of abnormal observations frequently distorts the landscape and conceals critical spatial relationships. These outliers can significantly impact analysis due to the inherent spatial dependencies present in the data. Traditional influence function (IF) methodologies, commonly used in statistical analysis to measure the impact of individual observations, are not directly applicable in the spatial context because the influence of an observation is determined not only by its own value but also by its spatial location, its connections with neighboring regions, and the values of those neighboring observations. In this paper, we introduce a local version of the influence function (LIF) that accounts for these spatial dependencies. Through the analysis of both simulated and real-world datasets, we demonstrate how the LIF provides a more nuanced and accurate detection of spatial outliers compared to traditional LISA measures and local impact assessments, improving our understanding of spatial patterns.
- [5] arXiv:2410.18268 [pdf, html, other]
-
Title: Stabilizing black-box model selection with the inflated argmaxSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. This paper presents a new approach to stabilizing model selection that leverages a combination of bagging and an "inflated" argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. In addition to developing theoretical guarantees, we illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable and (b) a Lotka-Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances. In both settings, the proposed method yields stable and compact collections of selected models, outperforming a variety of benchmarks.
- [6] arXiv:2410.18282 [pdf, html, other]
-
Title: Improving measurement error and representativeness in nonprobability surveysSubjects: Applications (stat.AP)
Nonprobability surveys suffer from representativeness issues due to their unknown selection mechanism. Recent research on nonprobability surveys has primarily focused on reducing such selection bias. But bias due to measurement error is also present, as pointed out by Kennedy, Mercer and Lau (2024) using a benchmarking study in the case of commercial online nonprobability surveys in the United States. Before this study, measurement error bias in nonprobability surveys has mostly been overlooked and statistical methods have been devised for reducing only the selection bias, under the assumption of accuracy of survey responses. Motivated by this case study, our research focuses on combining two key areas in nonprobability sampling research: representativeness and measurement error, specifically aiming to mitigate bias from both sampling and measurement errors. In the context of finite population mean estimation, we propose a new composite estimator that integrates both probability and nonprobability surveys, promising improved results compared to benchmark values from large government surveys. Its performance in comparison to an existing composite estimator is analyzed in terms of mean squared error, analytically and empirically. In the context of the aforementioned case study, we further investigate when the proposed composite estimator outperforms estimator from probability surveys alone.
- [7] arXiv:2410.18310 [pdf, html, other]
-
Title: About the matrix variate problem involved in the distribution of $\mathbf{E}^{-1}\mathbf{H}$Comments: 10 pagesSubjects: Statistics Theory (math.ST)
This work studies the distribution of the nonsymmetric matrix $\mathbf{E}^{-1}\mathbf{H}$. This random product is of fundamental interest under the general multivariate linear hypothesis setting. Specifically when $\mathbf{H}$ and $\mathbf{E}$ are seen as the sums of squares and the sums of products due to the hypothesis and due to the error, respectively.
- [8] arXiv:2410.18338 [pdf, html, other]
-
Title: Robust function-on-function interaction regressionComments: 35 pages, 3 tablesSubjects: Methodology (stat.ME)
A function-on-function regression model with quadratic and interaction effects of the covariates provides a more flexible model. Despite several attempts to estimate the model's parameters, almost all existing estimation strategies are non-robust against outliers. Outliers in the quadratic and interaction effects may deteriorate the model structure more severely than their effects in the main effect. We propose a robust estimation strategy based on the robust functional principal component decomposition of the function-valued variables and $\tau$-estimator. The performance of the proposed method relies on the truncation parameters in the robust functional principal component decomposition of the function-valued variables. A robust Bayesian information criterion is used to determine the optimum truncation constants. A forward stepwise variable selection procedure is employed to determine relevant main, quadratic, and interaction effects to address a possible model misspecification. The finite-sample performance of the proposed method is investigated via a series of Monte-Carlo experiments. The proposed method's asymptotic consistency and influence function are also studied in the supplement, and its empirical performance is further investigated using a U.S. COVID-19 dataset.
- [9] arXiv:2410.18409 [pdf, html, other]
-
Title: Doubly protected estimation for survival outcomes utilizing external controls for randomized clinical trialsSubjects: Methodology (stat.ME); Applications (stat.AP)
Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach incorporates machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework.
- [10] arXiv:2410.18435 [pdf, html, other]
-
Title: Forecasting Australian fertility by age, region, and birthplaceComments: 34 pages, 6 figures, 3 tablesSubjects: Applications (stat.AP); Methodology (stat.ME)
Fertility differentials by urban-rural residence and nativity of women in Australia significantly impact population composition at sub-national levels. We aim to provide consistent fertility forecasts for Australian women characterized by age, region, and birthplace. Age-specific fertility rates at the national and sub-national levels obtained from census data between 1981-2011 are jointly modeled and forecast by the grouped functional time series method. Forecasts for women of each region and birthplace are reconciled following the chosen hierarchies to ensure that results at various disaggregation levels consistently sum up to the respective national total. Coupling the region of residence disaggregation structure with the trace minimization reconciliation method produces the most accurate point and interval forecasts. In addition, age-specific fertility rates disaggregated by the birthplace of women show significant heterogeneity that supports the application of the grouped forecasting method.
- [11] arXiv:2410.18437 [pdf, other]
-
Title: Studentized Tests of Independence: Random-Lifter approachSubjects: Methodology (stat.ME)
The exploration of associations between random objects with complex geometric structures has catalyzed the development of various novel statistical tests encompassing distance-based and kernel-based statistics. These methods have various strengths and limitations. One problem is that their test statistics tend to converge to asymptotic null distributions involving second-order Wiener chaos, which are hard to compute and need approximation or permutation techniques that use much computing power to build rejection regions. In this work, we take an entirely different and novel strategy by using the so-called ``Random-Lifter''. This method is engineered to yield test statistics with the standard normal limit under null distributions without the need for sample splitting. In other words, we set our sights on having simple limiting distributions and finding the proper statistics through reverse engineering. We use the Central Limit Theorems (CLTs) for degenerate U-statistics derived from our novel association measures to do this. As a result, the asymptotic distributions of our proposed tests are straightforward to compute. Our test statistics also have the minimax property. We further substantiate that our method maintains competitive power against existing methods with minimal adjustments to constant factors. Both numerical simulations and real-data analysis corroborate the efficacy of the Random-Lifter method.
- [12] arXiv:2410.18445 [pdf, html, other]
-
Title: Inferring Latent Graphs from Stationary Signals Using a Graphical Autoregressive ModelSubjects: Methodology (stat.ME)
Graphs are an intuitive way to represent relationships between variables in fields such as finance and neuroscience. However, these graphs often need to be inferred from data. In this paper, we propose a novel framework to infer a latent graph by treating the observed multidimensional data as graph-referenced stationary signals. Specifically, we introduce the graphical autoregressive model (GAR), where the inverse covariance matrix of the observed signals is expressed as a second-order polynomial of the normalized graph Laplacian of the latent graph. The GAR model extends the autoregressive model from time series analysis to general undirected graphs, offering a new approach to graph inference. To estimate the latent graph, we develop a three-step procedure based on penalized maximum likelihood, supported by theoretical analysis and numerical experiments. Simulation studies and an application to S&P 500 stock price data show that the GAR model can outperform Gaussian graphical models when it fits the observed data well. Our results suggest that the GAR model offers a promising new direction for inferring latent graphs across diverse applications. Codes and example scripts are available at this https URL .
- [13] arXiv:2410.18486 [pdf, html, other]
-
Title: Evolving Voices Based on Temporal Poisson FactorisationJan Vávra (1 and 2), Bettina Grün (1), Paul Hofmarcher (2) ((1) Vienna University of Economics and Business, (2) Paris-Lodron University of Salzburg)Comments: main paper: 19 pages (2 single figures, 3 double figures, 3 tables) appendix: 9 pages (3 quadruple figures, 1 table) references: 3 pagesSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
The world is evolving and so is the vocabulary used to discuss topics in speech. Analysing political speech data from more than 30 years requires the use of flexible topic models to uncover the latent topics and their change in prevalence over time as well as the change in the vocabulary of the topics. We propose the temporal Poisson factorisation (TPF) model as an extension to the Poisson factorisation model to model sparse count data matrices obtained based on the bag-of-words assumption from text documents with time stamps. We discuss and empirically compare different model specifications for the time-varying latent variables consisting either of a flexible auto-regressive structure of order one or a random walk. Estimation is based on variational inference where we consider a combination of coordinate ascent updates with automatic differentiation using batching of documents. Suitable variational families are proposed to ease inference. We compare results obtained using independent univariate variational distributions for the time-varying latent variables to those obtained with a multivariate variant. We discuss in detail the results of the TPF model when analysing speeches from 18 sessions in the U.S. Senate (1981-2016).
- [14] arXiv:2410.18688 [pdf, html, other]
-
Title: Multiple imputation and full law identifiabilitySubjects: Statistics Theory (math.ST)
The key problems in missing data models involve the identifiability of two distributions: the target law and the full law. The target law refers to the joint distribution of the data variables, while the full law refers to the joint distribution of both the data variables and the response indicators. It has not been clearly stated how identifiability of the target law and the full law relate to multiple imputation. We show that imputations can be drawn from the correct conditional distributions if only if the full law is identifiable. This result means that direct application of multiple imputation may not be the method of choice in cases where the target law is identifiable but the full law is not.
- [15] arXiv:2410.18692 [pdf, html, other]
-
Title: Equity in the Distribution of Regulatory PM2.5 MonitorsComments: 53 pages, 7 figuresSubjects: Applications (stat.AP)
Unequal exposure to air pollution by race and socioeconomic status is well-documented in the U.S. However, there has been relatively little research on inequities in the collection of PM2.5 data, creating a critical gap in understanding which neighborhood exposures are represented in these datasets. In this study we use multilevel models with random intercepts by county and state, stratified by urbanicity to investigate the association between six key environmental justice (EJ) attributes (%AIAN, %Asian %Black, %Hispanic, %White, %Poverty) and proximity to the nearest regulatory monitor at the census tract-level across the contiguous 48 states. We also separately stratify our models by EPA region. Our results show that most EJ attributes exhibit weak or statistically insignificant associations with monitor proximity, except in rural areas where higher poverty levels are significantly linked to greater monitor distances ($\beta$ = 0.6, 95%CI = [0.49, 0.71]). While the US EPA's siting criteria may be effective in ensuring equitable monitor distribution in some contexts, the low density of monitors in rural areas may impact the accuracy of national-level air pollution monitoring.
- [16] arXiv:2410.18696 [pdf, html, other]
-
Title: Latent Functional PARAFAC for modeling multidimensional longitudinal dataSubjects: Methodology (stat.ME)
In numerous settings, it is increasingly common to deal with longitudinal data organized as high-dimensional multi-dimensional arrays, also known as tensors. Within this framework, the time-continuous property of longitudinal data often implies a smooth functional structure on one of the tensor modes. To help researchers investigate such data, we introduce a new tensor decomposition approach based on the CANDECOMP/PARAFAC decomposition. Our approach allows for representing a high-dimensional functional tensor as a low-dimensional set of functions and feature matrices. Furthermore, to capture the underlying randomness of the statistical setting more efficiently, we introduce a probabilistic latent model in the decomposition. A covariance-based block-relaxation algorithm is derived to obtain estimates of model parameters. Thanks to the covariance formulation of the solving procedure and thanks to the probabilistic modeling, the method can be used in sparse and irregular sampling schemes, making it applicable in numerous settings. We apply our approach to help characterize multiple neurocognitive scores observed over time in the Alzheimer's Disease Neuroimaging Initiative (ADNI) study. Finally, intensive simulations show a notable advantage of our method in reconstructing tensors.
- [17] arXiv:2410.18726 [pdf, html, other]
-
Title: Limit Theorems for the Symbolic Correlation Integral and the Renyi-2 Entropy under Short-range DependenceComments: 36 pages, 1 figure, 3 tablesSubjects: Statistics Theory (math.ST); Probability (math.PR)
The symbolic correlation integral provides a way to measure the complexity of time series and dynamical systems. In the present article we prove limit results for an estimator of this quantity which is based on U-statistics under the assumption of short-range dependence. To this end, we slightly generalize classical limit results in the framework of 1-approximating functionals. Furthermore, we carefully analyze the limit variance. A simulation study with ARMA and ARCH time series as well as a real world data example are also provided. In the latter we show how our method could be used to analyze EEG data in the context of epileptic seizures.
- [18] arXiv:2410.18730 [pdf, other]
-
Title: A Nonparametric Clustering Stopping Rule Based on Spatial MedianSubjects: Computation (stat.CO)
In this work, we introduce a nonparametric clustering stopping rule algorithm based on the spatial median. Our proposed method aims to achieve the balance between the homogeneity within the clusters and the heterogeneity between clusters. The proposed algorithm maximises the ratio of the variation between clusters and the variation within clusters while adjusting for the number of clusters and number of observations. The proposed algorithm is robust against distributional assumptions and the presence of outliers. Simulations have been used to validate the algorithm. We further evaluated the stability and the efficacy of the proposed algorithm using three real-world datasets. Moreover, we compared the performance of our model with 13 other traditional algorithms for determining the number of clusters. We found that the proposed algorithm outperformed 11 of the algorithms considered for comparison in terms of clustering number determination. The finding demonstrates that the proposed method provides a reliable alternative to determine the number of clusters for multivariate data.
- [19] arXiv:2410.18734 [pdf, html, other]
-
Title: Response Surface Designs for Crossed and Nested Multi-Stratum StructuresComments: Submitted to Technometrics, 43 pages, 4 figuresSubjects: Methodology (stat.ME)
Response surface designs are usually described as being run under complete randomization of the treatment combinations to the experimental units. In practice, however, it is often necessary or beneficial to run them under some kind of restriction to the randomization, leading to multi-stratum designs. In particular, some factors are often hard to set, so they cannot have their levels reset for each experimental unit. This paper presents a general solution to designing response surface experiments in any multi-stratum structure made up of crossing and/or nesting of unit factors. A stratum-by-stratum approach to constructing designs using compound optimal design criteria is used and illustrated. It is shown that good designs can be found even for large experiments in complex structures.
- [20] arXiv:2410.18833 [pdf, html, other]
-
Title: Adaptive reduced tempering For Bayesian inverse problems and rare event simulationSubjects: Computation (stat.CO)
This work proposes an adaptive sequential Monte Carlo sampling algorithm for solving inverse Bayesian problems in a context where a (costly) likelihood evaluation can be approximated by a surrogate, constructed from previous evaluations of the true likelihood. A rough error estimation of the obtained surrogates is required. The method is based on an adaptive sequential Monte-Carlo (SMC) simulation that jointly adapts the likelihood approximations and a standard tempering scheme of the target posterior distribution. This algorithm is well-suited to cases where the posterior is concentrated in a rare and unknown region of the prior. It is also suitable for solving low-temperature and rare-event simulation problems. The main contribution is to propose an entropy criteria that associates to the accuracy of the current surrogate a maximum inverse temperature for the likelihood approximation. The latter is used to sample a so-called snapshot, perform an exact likelihood evaluation, and update the surrogate and its error quantification. Some consistency results are presented in an idealized framework of the proposed algorithm. Our numerical experiments use in particular a reduced basis approach to construct approximate parametric solutions of a partially observed solution of an elliptic Partial Differential Equation. They demonstrate the convergence of the algorithm and show a significant cost reduction (close to a factor $10$) for comparable accuracy.
- [21] arXiv:2410.18837 [pdf, html, other]
-
Title: High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling LawsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.
- [22] arXiv:2410.18880 [pdf, html, other]
-
Title: Can we spot a fake?Comments: 16 pagesSubjects: Statistics Theory (math.ST); Probability (math.PR)
The problem of detecting fake data inspires the following seemingly simple mathematical question. Sample a data point $X$ from the standard normal distribution in $\mathbb{R}^n$. An adversary observes $X$ and corrupts it by adding a vector $rt$, where they can choose any vector $t$ from a fixed set $T$ of the adversary's "tricks", and where $r>0$ is a fixed radius. The adversary's choice of $t=t(X)$ may depend on the true data $X$. The adversary wants to hide the corruption by making the fake data $X+rt$ statistically indistinguishable from the real data $X$. What is the largest radius $r=r(T)$ for which the adversary can create an undetectable fake? We show that for highly symmetric sets $T$, the detectability radius $r(T)$ is approximately twice the scaled Gaussian width of $T$. The upper bound actually holds for arbitrary sets $T$ and generalizes to arbitrary, non-Gaussian distributions of real data $X$. The lower bound may fail for not highly symmetric $T$, but we conjecture that this problem can be solved by considering the focused version of the Gaussian width of $T$, which focuses on the most important directions of $T$.
- [23] arXiv:2410.18918 [pdf, html, other]
-
Title: MissNODAG: Differentiable Cyclic Causal Graph Learning from Incomplete DataSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data missing not at random. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.
- [24] arXiv:2410.18929 [pdf, html, other]
-
Title: AutoStep: Locally adaptive involutive MCMCSubjects: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selecting an appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methods -- AutoStep MCMC -- that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. We prove that AutoStep MCMC is $\pi$-invariant and has other desirable properties under mild assumptions on the target distribution $\pi$ and involutive proposal. Empirical results examine the effect of various step size selection design choices, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.
- [25] arXiv:2410.18938 [pdf, html, other]
-
Title: A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization CapabilitiesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
- [26] arXiv:2410.18939 [pdf, html, other]
-
Title: Adaptive partition Factor AnalysisSubjects: Methodology (stat.ME); Applications (stat.AP); Other Statistics (stat.OT)
Factor Analysis has traditionally been utilized across diverse disciplines to extrapolate latent traits that influence the behavior of multivariate observed variables. Historically, the focus has been on analyzing data from a single study, neglecting the potential study-specific variations present in data from multiple studies. Multi-study factor analysis has emerged as a recent methodological advancement that addresses this gap by distinguishing between latent traits shared across studies and study-specific components arising from artifactual or population-specific sources of variation. In this paper, we extend the current methodologies by introducing novel shrinkage priors for the latent factors, thereby accommodating a broader spectrum of scenarios -- from the absence of study-specific latent factors to models in which factors pertain only to small subgroups nested within or shared between the studies. For the proposed construction we provide conditions for identifiability of factor loadings and guidelines to perform straightforward posterior computation via Gibbs sampling. Through comprehensive simulation studies, we demonstrate that our proposed method exhibits competing performance across a variety of scenarios compared to existing methods, yet providing richer insights. The practical benefits of our approach are further illustrated through applications to bird species co-occurrence data and ovarian cancer gene expression data.
- [27] arXiv:2410.18945 [pdf, html, other]
-
Title: Mosqlimate: a platform to providing automatable access to data and forecasting models for arbovirus diseaseFabiana Ganem, Luã Bida Vacaro, Eduardo Correa Araujo, Leon Diniz Alves, Leonardo Bastos, Luiz Max Carvalho, Iasmim Almeida, Asla Medeiros de Sá, Flávio Codeço CoelhoComments: 10 pages, 2 figures, 6 tablesSubjects: Applications (stat.AP)
Dengue is a climate-sensitive mosquito-borne disease with a complex transmission dynamic. Data related to climate, environmental and sociodemographic characteristics of the target population are important for project scenarios. Different datasets and methodologies have been applied to build complex models for dengue forecast, stressing the need to evaluate these models and their relative accuracy grounded on a reproducible methodology. The goal of this work is to describe and present Mosqlimate, a web-based platform composed by a dashboard, a data store, model and rediction registries and support for a community of practice in arbovirus forecasting. Multiple API endpoints give access to data for development, open registration of predictive models from different approaches and sharing of predictive models for arboviruses incidence, facilitating interaction between modellers and allowing for proper comparison of the performance of different registered models, by means of probabilistic scores. Epidemiological, entomological, climatic and sociodemographic datasets related to arboviruses in Brazil, are freely available for download, alongside full documentation.
- [28] arXiv:2410.18973 [pdf, html, other]
-
Title: Tuning-free coreset Markov chain Monte CarloSubjects: Computation (stat.CO); Machine Learning (cs.LG)
A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), for training coreset weights in Coreset MCMC without user tuning effort. Empirical results demonstrate that Hot DoG provides higher quality posterior approximations than other learning-rate-free stochastic gradient methods, and performs competitively to optimally-tuned ADAM.
New submissions (showing 28 of 28 entries)
- [29] arXiv:2410.18147 (cross-list from cs.LG) [pdf, html, other]
-
Title: MEC-IP: Efficient Discovery of Markov Equivalent Classes via Integer ProgrammingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
This paper presents a novel Integer Programming (IP) approach for discovering the Markov Equivalent Class (MEC) of Bayesian Networks (BNs) through observational data. The MEC-IP algorithm utilizes a unique clique-focusing strategy and Extended Maximal Spanning Graphs (EMSG) to streamline the search for MEC, thus overcoming the computational limitations inherent in other existing algorithms. Our numerical results show that not only a remarkable reduction in computational time is achieved by our algorithm but also an improvement in causal discovery accuracy is seen across diverse datasets. These findings underscore this new algorithm's potential as a powerful tool for researchers and practitioners in causal discovery and BNSL, offering a significant leap forward toward the efficient and accurate analysis of complex data structures.
- [30] arXiv:2410.18148 (cross-list from cs.LG) [pdf, html, other]
-
Title: Deep Autoencoder with SVD-Like Convergence and Flat MinimaComments: 14 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
Representation learning for high-dimensional, complex physical systems aims to identify a low-dimensional intrinsic latent space, which is crucial for reduced-order modeling and modal analysis. To overcome the well-known Kolmogorov barrier, deep autoencoders (AEs) have been introduced in recent years, but they often suffer from poor convergence behavior as the rank of the latent space increases. To address this issue, we propose the learnable weighted hybrid autoencoder, a hybrid approach that combines the strengths of singular value decomposition (SVD) with deep autoencoders through a learnable weighted framework. We find that the introduction of learnable weighting parameters is essential - without them, the resulting model would either collapse into a standard POD or fail to exhibit the desired convergence behavior. Additionally, we empirically find that our trained model has a sharpness thousands of times smaller compared to other models. Our experiments on classical chaotic PDE systems, including the 1D Kuramoto-Sivashinsky and forced isotropic turbulence datasets, demonstrate that our approach significantly improves generalization performance compared to several competing methods, paving the way for robust representation learning of high-dimensional, complex physical systems.
- [31] arXiv:2410.18153 (cross-list from math.NA) [pdf, html, other]
-
Title: Physics-informed Neural Networks for Functional Differential Equations: Cylindrical Approximation and Its Convergence GuaranteesComments: Accepted at NeurIPS 2024. Both authors contributed equally. Some contents are omitted due to arXiv's storage limit. Please refer to the full paper at OpenReview (NeurIPS 2024) or this https URLSubjects: Numerical Analysis (math.NA); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
We propose the first learning scheme for functional differential equations (FDEs). FDEs play a fundamental role in physics, mathematics, and optimal control. However, the numerical analysis of FDEs has faced challenges due to its unrealistic computational costs and has been a long standing problem over decades. Thus, numerical approximations of FDEs have been developed, but they often oversimplify the solutions. To tackle these two issues, we propose a hybrid approach combining physics-informed neural networks (PINNs) with the \textit{cylindrical approximation}. The cylindrical approximation expands functions and functional derivatives with an orthonormal basis and transforms FDEs into high-dimensional PDEs. To validate the reliability of the cylindrical approximation for FDE applications, we prove the convergence theorems of approximated functional derivatives and solutions. Then, the derived high-dimensional PDEs are numerically solved with PINNs. Through the capabilities of PINNs, our approach can handle a broader class of functional derivatives more efficiently than conventional discretization-based methods, improving the scalability of the cylindrical approximation. As a proof of concept, we conduct experiments on two FDEs and demonstrate that our model can successfully achieve typical $L^1$ relative error orders of PINNs $\sim 10^{-3}$. Overall, our work provides a strong backbone for physicists, mathematicians, and machine learning experts to analyze previously challenging FDEs, thereby democratizing their numerical analysis, which has received limited attention. Code is available at \url{this https URL}.
- [32] arXiv:2410.18159 (cross-list from econ.EM) [pdf, html, other]
-
Title: On the Existence of One-Sided Representations in the Generalised Dynamic Factor ModelSubjects: Econometrics (econ.EM); Statistics Theory (math.ST)
We consider the generalised dynamic factor model (GDFM) and assume that the dynamic common component is purely non-deterministic. We show that then the common shocks (and therefore the dynamic common component) can always be represented in terms of current and past observed variables. Hence, we further generalise existing results on the so called One-Sidedness problem of the GDFM. We may conclude that the existence of a one-sided representation that is causally subordinated to the observed variables is in the very nature of the GDFM and the lack of one-sidedness is an artefact of the chosen representation.
- [33] arXiv:2410.18164 (cross-list from cs.LG) [pdf, html, other]
-
Title: TabDPT: Scaling Tabular Foundation ModelsJunwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Maksims Volkovs, Anthony L. CateriniComments: Minimal TabDPT interface to provide predictions on new datasets available at the following link: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The challenges faced by neural networks on tabular data are well-documented and have hampered the progress of tabular foundation models. Techniques leveraging in-context learning (ICL) have shown promise here, allowing for dynamic adaptation to unseen data. ICL can provide predictions for entirely new datasets without further training or hyperparameter tuning, therefore providing very fast inference when encountering a novel task. However, scaling ICL for tabular data remains an issue: approaches based on large language models cannot efficiently process numeric tables, and tabular-specific techniques have not been able to effectively harness the power of real data to improve performance and generalization. We are able to overcome these challenges by training tabular-specific ICL-based architectures on real data with self-supervised learning and retrieval, combining the best of both worlds. Our resulting model -- the Tabular Discriminative Pre-trained Transformer (TabDPT) -- achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks with no task-specific fine-tuning, demonstrating the adapatability and speed of ICL once the model is pre-trained. TabDPT also demonstrates strong scaling as both model size and amount of available data increase, pointing towards future improvements simply through the curation of larger tabular pre-training datasets and training larger models.
- [34] arXiv:2410.18235 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Massive Genealogies Distinguish Frontier from Steady-State Internal MigrationComments: 23 pages of manuscript, 18 pages of supplementary material, 7 figures, 1 tableSubjects: Physics and Society (physics.soc-ph); Applications (stat.AP)
Recent studies of human migration have focused on modern issues of international economics, politics, urbanization, or commuting. Here we make use of very large anonymized genealogies which offer quantitative metrics and models before census data became available. In European and North American data from 1400 to 1950 we find two distinct patterns of lifetime migration. The steady-state pattern shows a universal power-law distribution of migration distance; by its early appearance it cannot be dependent on post-industrial technology. The frontier pattern, in contrast, is not scale-free with its much longer average distances. All migration distances are well fit by a three parameter model; the temporal and geographic patterns of the fitted parameters give new insight to American internal expansion 1620-1950. Frontier migration is also highly directional and asymmetric; gravity models do not apply. The American frontier pattern arose from the colonial-era steady-state within a generation, plateaued for three generations, then returned to a more mobile steady-state, a sequence paralleled by the Steppe migrations that brought the Bronze Age to Neolithic Europe. The transient frontier pattern is enabled by large-scale technological or numeric imbalance and geographic opportunity; when these forces abate, a new steady-state begins.
- [35] arXiv:2410.18321 (cross-list from cs.LG) [pdf, html, other]
-
Title: Calibrating Deep Neural Network using Euclidean DistanceSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Uncertainty is a fundamental aspect of real-world scenarios, where perfect information is rarely available. Humans naturally develop complex internal models to navigate incomplete data and effectively respond to unforeseen or partially observed events. In machine learning, Focal Loss is commonly used to reduce misclassification rates by emphasizing hard-to-classify samples. However, it does not guarantee well-calibrated predicted probabilities and may result in models that are overconfident or underconfident. High calibration error indicates a misalignment between predicted probabilities and actual outcomes, affecting model reliability. This research introduces a novel loss function called Focal Calibration Loss (FCL), designed to improve probability calibration while retaining the advantages of Focal Loss in handling difficult samples. By minimizing the Euclidean norm through a strictly proper loss, FCL penalizes the instance-wise calibration error and constrains bounds. We provide theoretical validation for proposed method and apply it to calibrate CheXNet for potential deployment in web-based health-care systems. Extensive evaluations on various models and datasets demonstrate that our method achieves SOTA performance in both calibration and accuracy metrics.
- [36] arXiv:2410.18396 (cross-list from cs.LG) [pdf, html, other]
-
Title: Revisiting Differentiable Structure Learning: Inconsistency of $\ell_1$ Penalty and BeyondSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While Ng et al. (2024) highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.
- [37] arXiv:2410.18404 (cross-list from cs.LG) [pdf, html, other]
-
Title: Enhancing Feature-Specific Data Protection via Bayesian Coordinate Differential PrivacySubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Local Differential Privacy (LDP) offers strong privacy guarantees without requiring users to trust external parties. However, LDP applies uniform protection to all data features, including less sensitive ones, which degrades performance of downstream tasks. To overcome this limitation, we propose a Bayesian framework, Bayesian Coordinate Differential Privacy (BCDP), that enables feature-specific privacy quantification. This more nuanced approach complements LDP by adjusting privacy protection according to the sensitivity of each feature, enabling improved performance of downstream tasks without compromising privacy. We characterize the properties of BCDP and articulate its connections with standard non-Bayesian privacy frameworks. We further apply our BCDP framework to the problems of private mean estimation and ordinary least-squares regression. The BCDP-based approach obtains improved accuracy compared to a purely LDP-based approach, without compromising on privacy.
- [38] arXiv:2410.18613 (cross-list from cs.LG) [pdf, html, other]
-
Title: Rethinking Softmax: Self-Attention with Polynomial ActivationsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.
- [39] arXiv:2410.18784 (cross-list from cs.LG) [pdf, html, other]
-
Title: Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionalitySubjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Numerical Analysis (math.NA); Statistics Theory (math.ST); Machine Learning (stat.ML)
The denoising diffusion probabilistic model (DDPM) has emerged as a mainstream generative model in generative AI. While sharp convergence guarantees have been established for the DDPM, the iteration complexity is, in general, proportional to the ambient data dimension, resulting in overly conservative theory that fails to explain its practical efficiency. This has motivated the recent work Li and Yan (2024a) to investigate how the DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. We strengthen this prior work by demonstrating, in some sense, optimal adaptivity to unknown low dimensionality. For a broad class of data distributions with intrinsic dimension $k$, we prove that the iteration complexity of the DDPM scales nearly linearly with $k$, which is optimal when using KL divergence to measure distributional discrepancy. Our theory is established based on a key observation: the DDPM update rule is equivalent to running a suitably parameterized SDE upon discretization, where the nonlinear component of the drift term is intrinsically low-dimensional.
- [40] arXiv:2410.18844 (cross-list from cs.LG) [pdf, html, other]
-
Title: Learning to Explore with Lagrangians for Bandits under Unknown Linear ConstraintsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Pure exploration in bandits models multiple real-world problems, such as tuning hyper-parameters or conducting user studies, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an $r$$\textit{-good feasible policy}$. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. We show how this lower bound evolves with the sequential estimation of constraints. Second, we leverage the Lagrangian lower bound and the properties of convex optimisation to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. To this end, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use pessimistic estimate of the feasible set at each step. We show that these algorithms achieve asymptotically optimal sample complexity upper bounds up to constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LAGEX and LATS with respect to baselines.
- [41] arXiv:2410.18869 (cross-list from math.PR) [pdf, html, other]
-
Title: On the mean-field limit of diffusive games through the master equation: extreme value analysisComments: 34 pages including referencesSubjects: Probability (math.PR); Analysis of PDEs (math.AP); Optimization and Control (math.OC); Statistics Theory (math.ST); Mathematical Finance (q-fin.MF)
We consider an $N$-player game where the players control the drifts of their diffusive states which have no interaction in the noise terms. The aim of each player is to minimize the expected value of her cost, which is a function of the player's state and the empirical measure of the states of all the players. Our aim is to determine the $N \to \infty$ asymptotic behavior of the upper order statistics of the player's states under Nash equilibrium (the Nash states). For this purpose, we consider also a system of interacting diffusions which is constructed by using the Master PDE of the game and approximates the system of the Nash states, and we improve an $L^2$ estimate for the distance between the drifts of the two systems which has been used for establishing Central Limit Theorems and Large Deviations Principles for the Nash states in the past. By differentiating the Master PDE, we obtain that estimate also in $L^{\infty}$, which allows us to control the Radon-Nikodym derivative of a Girsanov transformation that connects the two systems. The latter allows us to reduce the problem to the case of $N$ uncontrolled diffusions with standard mean-field interaction in the drifts, which has been treated in a previous work.
- [42] arXiv:2410.18959 (cross-list from cs.LG) [pdf, html, other]
-
Title: Context is Key: A Benchmark for Forecasting with Essential Textual InformationAndrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre DrouinComments: Preprint; under review. First two authors contributed equallySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at this https URL .
Cross submissions (showing 14 of 14 entries)
- [43] arXiv:2007.02192 (replaced) [pdf, html, other]
-
Title: Tail-adaptive Bayesian shrinkageComments: Accepted in Electronic Journal of StatisticsSubjects: Statistics Theory (math.ST); Applications (stat.AP); Computation (stat.CO); Methodology (stat.ME); Machine Learning (stat.ML)
Robust Bayesian methods for high-dimensional regression problems under diverse sparse regimes are studied. Traditional shrinkage priors are primarily designed to detect a handful of signals from tens of thousands of predictors in the so-called ultra-sparsity domain. However, they may not perform desirably when the degree of sparsity is moderate. In this paper, we propose a robust sparse estimation method under diverse sparsity regimes, which has a tail-adaptive shrinkage property. In this property, the tail-heaviness of the prior adjusts adaptively, becoming larger or smaller as the sparsity level increases or decreases, respectively, to accommodate more or fewer signals, a posteriori. We propose a global-local-tail (GLT) Gaussian mixture distribution that ensures this property. We examine the role of the tail-index of the prior in relation to the underlying sparsity level and demonstrate that the GLT posterior contracts at the minimax optimal rate for sparse normal mean models. We apply both the GLT prior and the Horseshoe prior to a real data problem and simulation examples. Our findings indicate that the varying tail rule based on the GLT prior offers advantages over a fixed tail rule based on the Horseshoe prior in diverse sparsity regimes.
- [44] arXiv:2210.08149 (replaced) [pdf, html, other]
-
Title: Distance and Kernel-Based Measures for Global and Local Two-Sample Conditional Distribution TestingComments: Extensively revised versionSubjects: Methodology (stat.ME)
Testing the equality of two conditional distributions is crucial in various modern applications, including transfer learning and causal inference. Despite its importance, this fundamental problem has received surprisingly little attention in the literature. This work aims to present a unified framework based on distance and kernel methods for both global and local two-sample conditional distribution testing. To this end, we introduce distance and kernel-based measures that characterize the homogeneity of two conditional distributions. Drawing from the concept of conditional U-statistics, we propose consistent estimators for these measures. Theoretically, we derive the convergence rates and the asymptotic distributions of the estimators under both the null and alternative hypotheses. Utilizing these measures, along with a local bootstrap approach, we develop global and local tests that can detect discrepancies between two conditional distributions at global and local levels, respectively. Our tests demonstrate reliable performance through simulations and real data analyses.
- [45] arXiv:2312.15256 (replaced) [pdf, html, other]
-
Title: Adaptive Reduced Multilevel SplittingSubjects: Computation (stat.CO); Probability (math.PR)
This paper considers the classical problem of sampling with Monte Carlo methods a target rare event distribution defined by a score function that is very expensive to compute. We assume we can build using evaluations of the true score, an approximate surrogate score certified with error bounds. This work proposes a fully adaptive algorithm to sequentially sample surrogate rare event distributions with increasing target levels. An essential contribution consists in sampling at each iteration the surrogate rare event at a critical level corresponding to a specific cost. This cost is related to importance sampling for a target for a given budget. The critical level is calculated solely from the reduced score and its error bound From a practical point of view, sampling the proposal sequence is performed by extending the framework of the popular adaptive multilevel splitting algorithm to the use of score approximations. Numerical experiments evaluate the proposed importance sampling algorithm in terms of computational complexity versus squared error. In particular, we investigate the performance of the algorithm when simulating rare events related to the solution of a parametric PDE, which is approximated by a reduced basis.
- [46] arXiv:2401.12031 (replaced) [pdf, html, other]
-
Title: Multi-objective optimisation using expected quantile improvement for decision making in disease outbreaksSubjects: Methodology (stat.ME)
Optimization under uncertainty is important in many applications, particularly to inform policy and decision making in areas such as public health. A key source of uncertainty arises from the incorporation of environmental variables as inputs into computational models or simulators. Such variables represent uncontrollable features of the optimization problem and reliable decision making must account for the uncertainty they propagate to the simulator outputs. Often, multiple, competing objectives are defined from these outputs such that the final optimal decision is a compromise between different goals.
Here, we present emulation-based optimization methodology for such problems that extends expected quantile improvement (EQI) to address multi-objective optimization. Focusing on the practically important case of two objectives, we use a sequential design strategy to identify the Pareto front of optimal solutions. Uncertainty from the environmental variables is integrated out using Monte Carlo samples from the simulator. Interrogation of the expected output from the simulator is facilitated by use of (Gaussian process) emulators. The methodology is demonstrated on an optimization problem from public health involving the dispersion of anthrax spores across a spatial terrain. Environmental variables include meteorological features that impact the dispersion, and the methodology identifies the Pareto front even when there is considerable input uncertainty. - [47] arXiv:2402.00501 (replaced) [pdf, other]
-
Title: Equivalence of the Empirical Risk Minimization to Regularization on the Family of f-DivergencesComments: Submitted to the IEEE Symposium in Information Theory 2024. arXiv admin note: text overlap with arXiv:2306.07123Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
The solution to empirical risk minimization with $f$-divergence regularization (ERM-$f$DR) is presented under mild conditions on $f$. Under such conditions, the optimal measure is shown to be unique. Examples of the solution for particular choices of the function $f$ are presented. Previously known solutions to common regularization choices are obtained by leveraging the flexibility of the family of $f$-divergences. These include the unique solutions to empirical risk minimization with relative entropy regularization (Type-I and Type-II). The analysis of the solution unveils the following properties of $f$-divergences when used in the ERM-$f$DR problem: $i\bigl)$ $f$-divergence regularization forces the support of the solution to coincide with the support of the reference measure, which introduces a strong inductive bias that dominates the evidence provided by the training data; and $ii\bigl)$ any $f$-divergence regularization is equivalent to a different $f$-divergence regularization with an appropriate transformation of the empirical risk function.
- [48] arXiv:2402.08283 (replaced) [pdf, html, other]
-
Title: Classification Using Global and Local Mahalanobis DistancesSubjects: Methodology (stat.ME); Machine Learning (stat.ML)
We propose a novel semiparametric classifier based on Mahalanobis distances of an observation from the competing classes. Our tool is a generalized additive model with the logistic link function that uses these distances as features to estimate the posterior probabilities of different classes. While popular parametric classifiers like linear and quadratic discriminant analyses are mainly motivated by the normality of the underlying distributions, the proposed classifier is more flexible and free from such parametric modeling assumptions. Since the densities of elliptic distributions are functions of Mahalanobis distances, this classifier works well when the competing classes are (nearly) elliptic. In such cases, it often outperforms popular nonparametric classifiers, especially when the sample size is small compared to the dimension of the data. To cope with non-elliptic and possibly multimodal distributions, we propose a local version of the Mahalanobis distance. Subsequently, we propose another classifier based on a generalized additive model that uses the local Mahalanobis distances as features. This nonparametric classifier usually performs like the Mahalanobis distance based semiparametric classifier when the underlying distributions are elliptic, but outperforms it for several non-elliptic and multimodal distributions. We also investigate the behaviour of these two classifiers in high dimension, low sample size situations. A thorough numerical study involving several simulated and real datasets demonstrate the usefulness of the proposed classifiers in comparison to many state-of-the-art methods.
- [49] arXiv:2402.13599 (replaced) [pdf, html, other]
-
Title: Approximation and estimation of scale functions for spectrally negative Levy processesSubjects: Statistics Theory (math.ST)
The scale function holds significant importance within the fluctuation theory of Levy processes, particularly in addressing exit problems. However, its definition is established through the Laplace transform, thereby lacking explicit representations in general. This paper introduces a novel series representation for this scale function, employing Laguerre polynomials to construct a uniformly convergent approximate sequence. Additionally, we derive statistical inference based on specific discrete observations, presenting estimators of scale functions that are asymptotically normal.
- [50] arXiv:2403.11954 (replaced) [pdf, html, other]
-
Title: Robust Estimation and Inference for Categorical DataComments: 45 pages, 3 figures, 1 tableSubjects: Methodology (stat.ME); Econometrics (econ.EM); Statistics Theory (math.ST)
While there is a rich literature on robust methodologies for contamination in continuously distributed data, contamination in categorical data is largely overlooked. This is regrettable because many datasets are categorical and oftentimes suffer from contamination. Examples include inattentive responding and bot responses in questionnaires or zero-inflated count data. We propose a novel class of contamination-robust estimators of models for categorical data, coined $C$-estimators (``$C$" for categorical). We show that the countable and possibly finite sample space of categorical data results in non-standard theoretical properties. Notably, in contrast to classic robustness theory, $C$-estimators can be simultaneously robust \textit{and} fully efficient at the postulated model. In addition, a certain particularly robust specification fails to be asymptotically Gaussian at the postulated model, but is asymptotically Gaussian in the presence of contamination. We furthermore propose a diagnostic test to identify categorical outliers and demonstrate the enhanced robustness of $C$-estimators in a simulation study.
- [51] arXiv:2404.16745 (replaced) [pdf, html, other]
-
Title: Statistical Inference for Covariate-Adjusted and Interpretable Generalized Factor Model with Application to Testing FairnessSubjects: Methodology (stat.ME)
Latent variable models are popularly used to measure latent factors (e.g., abilities and personalities) from large-scale assessment data. Beyond understanding these latent factors, the covariate effect on responses controlling for latent factors is also of great scientific interest and has wide applications, such as evaluating the fairness of educational testing, where the covariate effect reflects whether a test question is biased toward certain individual characteristics (e.g., gender and race), taking into account their latent abilities. However, the large sample sizes and test lengths pose challenges to developing efficient methods and drawing valid inferences. Moreover, to accommodate the commonly encountered discrete responses, nonlinear latent factor models are often assumed, adding further complexity. To address these challenges, we consider a covariate-adjusted generalized factor model and develop novel and interpretable conditions to address the identifiability issue. Based on the identifiability conditions, we propose a joint maximum likelihood estimation method and establish estimation consistency and asymptotic normality results for the covariate effects. Furthermore, we derive estimation and inference results for latent factors and the factor loadings. We illustrate the finite sample performance of the proposed method through extensive numerical studies and an educational assessment dataset from the Programme for International Student Assessment (PISA).
- [52] arXiv:2406.17637 (replaced) [pdf, html, other]
-
Title: Nowcasting in triple-system estimationSubjects: Methodology (stat.ME)
Multiple systems estimation uses samples that each cover part of a population to obtain a total population size estimate. Ideally, all the available samples are used, but if some samples are available (much) later, one may use only the samples that are available early. Under some regularity conditions, including sample independence, two samples is enough to obtain an asymptotically unbiased population size estimate. However, the assumption of sample independence may be unrealistic, especially when samples are derived from administrative sources. The sample independence assumption can be relaxed when three or more samples are used, which is therefore generally recommended. This may be a problem if the third sample is available much later than the first two samples. Therefore, in this paper we propose a new approach that deals with this issue by utilising older samples, using the so-called expectation maximisation algorithm. This leads to a population size nowcast estimate that is asymptotically unbiased under more relaxed assumptions than the estimate based on two samples. The resulting nowcasting model is applied to the problem of estimating the number of homeless people in The Netherlands, which leads to reasonably accurate nowcast estimates.
- [53] arXiv:2407.05145 (replaced) [pdf, html, other]
-
Title: On high-dimensional modifications of the nearest neighbor classifierSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Nearest neighbor classifier is arguably the most simple and popular nonparametric classifier available in the literature. However, due to the concentration of pairwise distances and the violation of the neighborhood structure, this classifier often suffers in high-dimension, low-sample size (HDLSS) situations, especially when the scale difference between the competing classes dominates their location difference. Several attempts have been made in the literature to take care of this problem. In this article, we discuss some of these existing methods and propose some new ones. We carry out some theoretical investigations in this regard and analyze several simulated and benchmark datasets to compare the empirical performances of proposed methods with some of the existing ones.
- [54] arXiv:2407.18835 (replaced) [pdf, html, other]
-
Title: Robust Estimation of Polychoric CorrelationComments: 50 pages (30 main text), 13 figures (8 main text), 9 tables (4 main text). Changes to v1: updated notationSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Applications (stat.AP); Other Statistics (stat.OT)
Polychoric correlation is often an important building block in the analysis of rating data, particularly for structural equation models. However, the commonly employed maximum likelihood (ML) estimator is highly susceptible to misspecification of the polychoric correlation model, for instance through violations of latent normality assumptions. We propose a novel estimator that is designed to be robust to partial misspecification of the polychoric model, that is, the model is only misspecified for an unknown fraction of observations, for instance (but not limited to) careless respondents. In contrast to existing literature, our estimator makes no assumption on the type or degree of model misspecification. It furthermore generalizes ML estimation and is consistent as well as asymptotically normally distributed. We demonstrate the robustness and practical usefulness of our estimator in simulation studies and an empirical application on a Big Five administration. In the latter, the polychoric correlation estimates of our estimator and ML differ substantially, which, after further inspection, is likely due to the presence of careless respondents that the estimator helps identify.
- [55] arXiv:2407.20051 (replaced) [pdf, html, other]
-
Title: Estimating risk factors for pathogenic dose accrual from longitudinal dataSubjects: Methodology (stat.ME)
Estimating risk factors for incidence of a disease is crucial for understanding its etiology. For diseases caused by enteric pathogens, off-the-shelf statistical model-based approaches do not consider the biological mechanisms through which infection occurs and thus can only be used to make comparatively weak statements about association between risk factors and incidence. Building off of established work in quantitative microbiological risk assessment, we propose a new approach to determining the association between risk factors and dose accrual rates. Our more mechanistic approach achieves a higher degree of biological plausibility, incorporates currently-ignored sources of variability, and provides regression parameters that are easily interpretable as the dose accrual rate ratio due to changes in the risk factors under study. We also describe a method for leveraging information across multiple pathogens. The proposed methods are available as an R package at \url{this https URL}. Our simulation study shows unacceptable coverage rates from generalized linear models, while the proposed approach empirically maintains the nominal rate even when the model is misspecified. Finally, we demonstrated our proposed approach by applying our method to infant data obtained through the PATHOME study (\url{this https URL}), discovering the impact of various environmental factors on infant enteric infections.
- [56] arXiv:2409.12592 (replaced) [pdf, html, other]
-
Title: Choice of the hypothesis matrix for using the Anova-type-statisticSubjects: Methodology (stat.ME)
Initially developed in Brunner et al. (1997), the Anova-type-statistic (ATS) is one of the most used quadratic forms for testing multivariate hypotheses for a variety of different parameter vectors $\boldsymbol{\theta}\in\mathbb{R}^d$. Such tests can be based on several versions of ATS and in most settings, they are preferable over those based on other quadratic forms, as for example the Wald-type-statistic (WTS). However, the same null hypothesis $\boldsymbol{H}\boldsymbol{\theta}=\boldsymbol{y}$ can be expressed by a multitude of hypothesis matrices $\boldsymbol{H}\in\mathbb{R}^{m\times d}$ and corresponding vectors $\boldsymbol{y}\in\mathbb{R}^m$, which leads to different values of the test statistic, as it can be seen in simple examples. Since this can entail distinct test decisions, it remains to investigate under which conditions tests using different hypothesis matrices coincide. Here, the dimensions of the different hypothesis matrices can be substantially different, which has exceptional potential to save computation effort.
In this manuscript, we show that for the Anova-type-statistic and some versions thereof, it is possible for each hypothesis $\boldsymbol{H}\boldsymbol{\theta}=\boldsymbol{y}$ to construct a companion matrix $\boldsymbol{L}$ with a minimal number of rows, which not only tests the same hypothesis but also always yields the same test decisions. This allows a substantial reduction of computation time, which is investigated in several conducted simulations. - [57] arXiv:2410.08488 (replaced) [pdf, html, other]
-
Title: Fractional binomial regression model for count data with excess zerosSubjects: Methodology (stat.ME)
This paper proposes a new generalized linear model with fractional binomial distribution.
Zero-inflated Poisson/negative binomial distributions are used for count data that has many zeros. To analyze the association of such a count variable with covariates, zero-inflated Poisson/negative binomial regression models are widely used. In this work, we develop a regression model with the fractional binomial distribution that can serve as an additional tool for modeling the count response variable with covariates. Data analysis results show that on some occasions, our model outperforms the existing zero-inflated regression models. - [58] arXiv:2410.13522 (replaced) [pdf, other]
-
Title: Fair comparisons of causal parameters with many treatments and positivity violationsSubjects: Methodology (stat.ME); Applications (stat.AP)
Comparing outcomes across treatments is essential in medicine and public policy. To do so, researchers typically estimate a set of parameters, possibly counterfactual, with each targeting a different treatment. Treatment-specific means (TSMs) are commonly used, but their identification requires a positivity assumption -- that every subject has a non-zero probability of receiving each treatment. This assumption is often implausible, especially when treatment can take many values. Causal parameters based on dynamic stochastic interventions can be robust to positivity violations. However, comparing these parameters may be unfair because they may depend on outcomes under non-target treatments. To address this, and clarify when fair comparisons are possible, we propose a fairness criterion: if the conditional TSM for one treatment is greater than that for another, then the corresponding causal parameter should also be greater. We derive two intuitive properties equivalent to this criterion and show that only a mild positivity assumption is needed to identify fair parameters. We then provide examples that satisfy this criterion and are identifiable under the milder positivity assumption. These parameters are non-smooth, making standard nonparametric efficiency theory inapplicable, so we propose smooth approximations of them. We then develop doubly robust-style estimators that attain parametric convergence rates under nonparametric conditions. We illustrate our methods with an analysis of dialysis providers in New York State.
- [59] arXiv:2410.14055 (replaced) [pdf, html, other]
-
Title: Feedback Schr\"odinger Bridge MatchingPanagiotis Theodoropoulos, Nikolaos Komianos, Vincent Pacelli, Guan-Horng Liu, Evangelos A. TheodorouSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Recent advancements in diffusion bridges for distribution transport problems have heavily relied on matching frameworks, yet existing methods often face a trade-off between scalability and access to optimal pairings during training. Fully unsupervised methods make minimal assumptions but incur high computational costs, limiting their practicality. On the other hand, imposing full supervision of the matching process with optimal pairings improves scalability, however, it can be infeasible in many applications. To strike a balance between scalability and minimal supervision, we introduce Feedback Schrödinger Bridge Matching (FSBM), a novel semi-supervised matching framework that incorporates a small portion (less than 8% of the entire dataset) of pre-aligned pairs as state feedback to guide the transport map of non coupled samples, thereby significantly improving efficiency. This is achieved by formulating a static Entropic Optimal Transport (EOT) problem with an additional term capturing the semi-supervised guidance. The generalized EOT objective is then recast into a dynamic formulation to leverage the scalability of matching frameworks. Extensive experiments demonstrate that FSBM accelerates training and enhances generalization by leveraging coupled pairs guidance, opening new avenues for training matching frameworks with partially aligned datasets.
- [60] arXiv:2410.16419 (replaced) [pdf, html, other]
-
Title: Data Augmentation of Multivariate Sensor Time Series using Autoregressive Models and Application to Failure PrognosticsComments: PREPRINT of paper to appear at 2024 Conference of PHM SocietySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
This work presents a novel data augmentation solution for non-stationary multivariate time series and its application to failure prognostics. The method extends previous work from the authors which is based on time-varying autoregressive processes. It can be employed to extract key information from a limited number of samples and generate new synthetic samples in a way that potentially improves the performance of PHM solutions. This is especially valuable in situations of data scarcity which are very usual in PHM, especially for failure prognostics. The proposed approach is tested based on the CMAPSS dataset, commonly employed for prognostics experiments and benchmarks. An AutoML approach from PHM literature is employed for automating the design of the prognostics solution. The empirical evaluation provides evidence that the proposed method can substantially improve the performance of PHM solutions.
- [61] arXiv:2410.16806 (replaced) [pdf, html, other]
-
Title: Simplified vine copula models: state of science and affairsSubjects: Methodology (stat.ME)
Vine copula models have become highly popular practical tools for modeling multivariate dependencies. To maintain tractability, a commonly employed simplifying assumption is that conditional copulas remain unchanged by the conditioning variables. This assumption has sparked a somewhat polarizing debate within the copula community. The fact that much of this dispute occurs outside the public record has placed the field in an unfortunate position, impeding scientific progress. In this article, I will review what we know about the flexibility and limitations of simplified vine copula models, explore the broader implications, and offer my own, hopefully reconciling, perspective on the issue.
- [62] arXiv:2202.04146 (replaced) [pdf, html, other]
-
Title: A Neural Phillips Curve and a Deep Output GapSubjects: Econometrics (econ.EM); Applications (stat.AP); Machine Learning (stat.ML)
Many problems plague empirical Phillips curves (PCs). Among them is the hurdle that the two key components, inflation expectations and the output gap, are both unobserved. Traditional remedies include proxying for the absentees or extracting them via assumptions-heavy filtering procedures. I propose an alternative route: a Hemisphere Neural Network (HNN) whose architecture yields a final layer where components can be interpreted as latent states within a Neural PC. There are benefits. First, HNN conducts the supervised estimation of nonlinearities that arise when translating a high-dimensional set of observed regressors into latent states. Second, forecasts are economically interpretable. Among other findings, the contribution of real activity to inflation appears understated in traditional PCs. In contrast, HNN captures the 2021 upswing in inflation and attributes it to a large positive output gap starting from late 2020. The unique path of HNN's gap comes from dispensing with unemployment and GDP in favor of an amalgam of nonlinearly processed alternative tightness indicators.
- [63] arXiv:2305.16368 (replaced) [pdf, html, other]
-
Title: Neural incomplete factorization: learning preconditioners for the conjugate gradient methodComments: 26 pages, 8 figures, accepted in Transactions on Machine Learning Research (TMLR)Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
The convergence of the conjugate gradient method for solving large-scale and sparse linear equation systems depends on the spectral properties of the system matrix, which can be improved by preconditioning. In this paper, we develop a computationally efficient data-driven approach to accelerate the generation of effective preconditioners. We, therefore, replace the typically hand-engineered preconditioners by the output of graph neural networks. Our method generates an incomplete factorization of the matrix and is, therefore, referred to as neural incomplete factorization (NeuralIF). Optimizing the condition number of the linear system directly is computationally infeasible. Instead, we utilize a stochastic approximation of the Frobenius loss which only requires matrix-vector multiplications for efficient training. At the core of our method is a novel message-passing block, inspired by sparse matrix theory, that aligns with the objective of finding a sparse factorization of the matrix. We evaluate our proposed method on both synthetic problem instances and on problems arising from the discretization of the Poisson equation on varying domains. Our experiments show that by using data-driven preconditioners within the conjugate gradient method we are able to speed up the convergence of the iterative procedure. The code is available at this https URL.
- [64] arXiv:2305.16446 (replaced) [pdf, html, other]
-
Title: The Representation Jensen-Shannon DivergenceSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
Quantifying the difference between probability distributions is crucial in machine learning. However, estimating statistical divergences from empirical samples is challenging due to unknown underlying distributions. This work proposes the representation Jensen-Shannon divergence (RJSD), a novel measure inspired by the traditional Jensen-Shannon divergence. Our approach embeds data into a reproducing kernel Hilbert space (RKHS), representing distributions through uncentered covariance operators. We then compute the Jensen-Shannon divergence between these operators, thereby establishing a proper divergence measure between probability distributions in the input space. We provide estimators based on kernel matrices and empirical covariance matrices using Fourier features. Theoretical analysis reveals that RJSD is a lower bound on the Jensen-Shannon divergence, enabling variational estimation. Additionally, we show that RJSD is a higher-order extension of the maximum mean discrepancy (MMD), providing a more sensitive measure of distributional differences. Our experimental results demonstrate RJSD's superiority in two-sample testing, distribution shift detection, and unsupervised domain adaptation, outperforming state-of-the-art techniques. RJSD's versatility and effectiveness make it a promising tool for machine learning research and applications.
- [65] arXiv:2307.02275 (replaced) [pdf, other]
-
Title: Convolutions and More as Einsum: A Tensor Network Perspective with Advances for Second-Order MethodsComments: 10 pages main text + appendix, conference versionJournal-ref: Advances in Neural Information Processing Systems (NeurIPS) 2024Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Despite their simple intuition, convolutions are more tedious to analyze than dense layers, which complicates the transfer of theoretical and algorithmic ideas to convolutions. We simplify convolutions by viewing them as tensor networks (TNs) that allow reasoning about the underlying tensor multiplications by drawing diagrams, manipulating them to perform function transformations like differentiation, and efficiently evaluating them with einsum. To demonstrate their simplicity and expressiveness, we derive diagrams of various autodiff operations and popular curvature approximations with full hyper-parameter support, batching, channel groups, and generalization to any convolution dimension. Further, we provide convolution-specific transformations based on the connectivity pattern which allow to simplify diagrams before evaluation. Finally, we probe performance. Our TN implementation accelerates a recently-proposed KFAC variant up to 4.5x while removing the standard implementation's memory overhead, and enables new hardware-efficient tensor dropout for approximate backpropagation.
- [66] arXiv:2311.00109 (replaced) [pdf, html, other]
-
Title: FairWASP: Fast and Optimal Fair Wasserstein Pre-processingComments: AAAI 2024, 15 pages, 4 figures, 1 tableJournal-ref: Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 16120-16128, 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings.
- [67] arXiv:2312.00626 (replaced) [pdf, html, other]
-
Title: Forecasting trends in food security with real time dataJoschka Herteux, Christoph Räth, Giulia Martini, Amine Baha, Kyriacos Koupparis, Ilaria Lauzana, Duccio PiovaniComments: 19 pages, 7 figures + supplementary materialJournal-ref: Commun Earth Environ 5, 611 (2024)Subjects: Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Machine Learning (stat.ML)
Early warning systems are an essential tool for effective humanitarian action. Advance warnings on impending disasters facilitate timely and targeted response which help save lives and livelihoods. In this work we present a quantitative methodology to forecast levels of food consumption for 60 consecutive days, at the sub-national level, in four countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on publicly available data from the World Food Programme's global hunger monitoring system which collects, processes, and displays daily updates on key food security metrics, conflict, weather events, and other drivers of food insecurity. In this study we assessed the performance of various models including Autoregressive Integrated Moving Average (ARIMA), Extreme Gradient Boosting (XGBoost), Long Short Term Memory (LSTM) Network, Convolutional Neural Network (CNN), and Reservoir Computing (RC), by comparing their Root Mean Squared Error (RMSE) metrics. Our findings highlight Reservoir Computing as a particularly well-suited model in the field of food security given both its notable resistance to over-fitting on limited data samples and its efficient training capabilities. The methodology we introduce establishes the groundwork for a global, data-driven early warning system designed to anticipate and detect food insecurity.
- [68] arXiv:2312.12608 (replaced) [pdf, html, other]
-
Title: Rethinking Randomized Smoothing from the Perspective of ScalabilitySubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Machine learning models have demonstrated remarkable success across diverse domains but remain vulnerable to adversarial attacks. Empirical defense mechanisms often fail, as new attacks constantly emerge, rendering existing defenses obsolete, shifting the focus to certification-based defenses. Randomized smoothing has emerged as a promising technique among notable advancements. This study reviews the theoretical foundations and empirical effectiveness of randomized smoothing and its derivatives in verifying machine learning classifiers from a perspective of scalability. We provide an in-depth exploration of the fundamental concepts underlying randomized smoothing, highlighting its theoretical guarantees in certifying robustness against adversarial perturbations and discuss the challenges of existing methodologies.
- [69] arXiv:2401.03281 (replaced) [pdf, html, other]
-
Title: Statistical Response of ENSO Complexity to Initial Condition and Model Parameter PerturbationsComments: This is the final revised version. 54 pages, 11 figures, 2 tables (1 in main text and 1 in the appendix). This Work has been published in AMS' Journal of Climate (https://doi.org/10.1175/JCLI-D-24-0017.1). For more info see this https URL. Code and data available upon contact with the corresponding authorSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Applications (stat.AP)
Studying the response of a climate system to perturbations has practical significance. Standard methods in computing the trajectory-wise deviation caused by perturbations may suffer from the chaotic nature that makes the model error dominate the true response after a short lead time. Statistical response, which computes the return described by the statistics, provides a systematic way of reaching robust outcomes with an appropriate quantification of the uncertainty and extreme events. In this paper, information theory is applied to compute the statistical response and find the most sensitive perturbation direction of different El Niño-Southern Oscillation (ENSO) events to initial value and model parameter perturbations. Depending on the initial phase and the time horizon, different state variables contribute to the most sensitive perturbation direction. While initial perturbations in sea surface temperature (SST) and thermocline depth usually lead to the most significant response of SST at short- and long-range, respectively, initial adjustment of the zonal advection can be crucial to trigger strong statistical responses at medium-range around 5 to 7 months, especially at the transient phases between El Niño and La Niña. It is also shown that the response in the variance triggered by external random forcing perturbations, such as the wind bursts, often dominates the mean response, making the resulting most sensitive direction very different from the trajectory-wise methods. Finally, despite the strong non-Gaussian climatology distributions, using Gaussian approximations in the information theory is efficient and accurate for computing the statistical response, allowing the method to be applied to sophisticated operational systems.
- [70] arXiv:2402.00957 (replaced) [pdf, html, other]
-
Title: Credal Learning TheoryComments: 30 pages, 2 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Statistical learning theory is the foundation of machine learning, providing theoretical bounds for the risk of models learned from a (single) training set, assumed to issue from an unknown probability distribution. In actual deployment, however, the data distribution may (and often does) vary, causing domain adaptation/generalization issues. In this paper we lay the foundations for a `credal' theory of learning, using convex sets of probabilities (credal sets) to model the variability in the data-generating distribution. Such credal sets, we argue, may be inferred from a finite sample of training sets. Bounds are derived for the case of finite hypotheses spaces (both assuming realizability or not), as well as infinite model spaces, which directly generalize classical results.
- [71] arXiv:2402.03994 (replaced) [pdf, html, other]
-
Title: Efficient Sketches for Training Data Attribution and Studying the Loss LandscapeJournal-ref: Neurips 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The study of modern machine learning models often necessitates storing vast quantities of gradients or Hessian vector products (HVPs). Traditional sketching methods struggle to scale under these memory constraints. We present a novel framework for scalable gradient and HVP sketching, tailored for modern hardware. We provide theoretical guarantees and demonstrate the power of our methods in applications like training data attribution, Hessian spectrum analysis, and intrinsic dimension computation for pre-trained language models. Our work sheds new light on the behavior of pre-trained language models, challenging assumptions about their intrinsic dimensionality and Hessian properties.
- [72] arXiv:2404.12312 (replaced) [pdf, html, other]
-
Title: A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax OptimizationComments: SubmittedSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparameterized two-layer neural networks. In particular, we consider the minimax optimization problem stemming from estimating linear functional equations defined by conditional expectations, where the objective functions are quadratic in the functional spaces. We address (i) the convergence of the stochastic gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. We establish convergence under the mean-field regime by considering the continuous-time and infinite-width limit of the optimization dynamics. Under this regime, the stochastic gradient descent-ascent corresponds to a Wasserstein gradient flow over the space of probability measures defined over the space of neural network parameters. We prove that the Wasserstein gradient flow converges globally to a stationary point of the minimax objective at a $O(T^{-1} + \alpha^{-1})$ sublinear rate, and additionally finds the solution to the functional equation when the regularizer of the minimax objective is strongly convex. Here $T$ denotes the time and $\alpha$ is a scaling parameter of the neural networks. In terms of representation learning, our results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(\alpha^{-1})$, measured in terms of the Wasserstein distance. Finally, we apply our general results to concrete examples including policy evaluation, nonparametric instrumental variable regression, asset pricing, and adversarial Riesz representer estimation.
- [73] arXiv:2405.11881 (replaced) [pdf, html, other]
-
Title: Out-of-Distribution Detection with a Single Unconditional Diffusion ModelSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Out-of-distribution (OOD) detection is a critical task in machine learning that seeks to identify abnormal samples. Traditionally, unsupervised methods utilize a deep generative model for OOD detection. However, such approaches require a new model to be trained for each inlier dataset. This paper explores whether a single model can perform OOD detection across diverse tasks. To that end, we introduce Diffusion Paths (DiffPath), which uses a single diffusion model originally trained to perform unconditional generation for OOD detection. We introduce a novel technique of measuring the rate-of-change and curvature of the diffusion paths connecting samples to the standard normal. Extensive experiments show that with a single model, DiffPath is competitive with prior work using individual models on a variety of OOD tasks involving different distributions. Our code is publicly available at this https URL.
- [74] arXiv:2405.15885 (replaced) [pdf, html, other]
-
Title: Diffusion Bridge Implicit ModelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we take the first step in fast sampling of DDBMs without extra training, motivated by the well-established recipes in diffusion models. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same marginal distributions and training objectives, and give rise to generative processes ranging from stochastic to deterministic, resulting in diffusion bridge implicit models (DBIMs). DBIMs are not only up to 25$\times$ faster than the vanilla sampler of DDBMs but also induce a novel, simple, and insightful form of ordinary differential equation (ODE) which inspires high-order numerical solvers. Moreover, DBIMs maintain the generation diversity in a distinguished way, by using a booting noise in the initial sampling step, which enables faithful encoding, reconstruction, and semantic interpolation in image translation tasks. Code is available at \url{this https URL}.
- [75] arXiv:2405.20318 (replaced) [pdf, other]
-
Title: Analyzing Human Questioning Behavior and Causal Curiosity through Natural QueriesRoberto Ceraolo, Dmitrii Kharlapenko, Ahmad Khan, Amélie Reymond, Rada Mihalcea, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing JinSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
The recent development of Large Language Models (LLMs) has changed our role in interacting with them. Instead of primarily testing these models with questions we already know the answers to, we now use them to explore questions where the answers are unknown to us. This shift, which hasn't been fully addressed in existing datasets, highlights the growing need to understand naturally occurring human questions - that are more complex, open-ended, and reflective of real-world needs. To this end, we present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries, and examine their unique linguistic properties, cognitive complexity, and source distribution. We also lay the groundwork to explore LLM performance on these questions and provide six efficient classification models to identify causal questions at scale for future work.
- [76] arXiv:2406.04562 (replaced) [pdf, html, other]
-
Title: A Unified View of Group Fairness Tradeoffs Using Partial Information DecompositionComments: Published as a conference paper at 2024 IEEE International Symposium on Information Theory (ISIT 2024)Subjects: Information Theory (cs.IT); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper introduces a novel information-theoretic perspective on the relationship between prominent group fairness notions in machine learning, namely statistical parity, equalized odds, and predictive parity. It is well known that simultaneous satisfiability of these three fairness notions is usually impossible, motivating practitioners to resort to approximate fairness solutions rather than stringent satisfiability of these definitions. However, a comprehensive analysis of their interrelations, particularly when they are not exactly satisfied, remains largely unexplored. Our main contribution lies in elucidating an exact relationship between these three measures of (un)fairness by leveraging a body of work in information theory called partial information decomposition (PID). In this work, we leverage PID to identify the granular regions where these three measures of (un)fairness overlap and where they disagree with each other leading to potential tradeoffs. We also include numerical simulations to complement our results.
- [77] arXiv:2407.07829 (replaced) [pdf, html, other]
-
Title: Disentangled Representation Learning with the Gromov-Monge GapSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.
- [78] arXiv:2409.13728 (replaced) [pdf, other]
-
Title: Rule Extrapolation in Language Models: A Study of Compositional Generalization on OOD PromptsComments: Accepted as a spotlight poster at NeurIPS2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
LLMs show remarkable emergent abilities, such as inferring concepts from presumably out-of-distribution prompts, known as in-context learning. Though this success is often attributed to the Transformer architecture, our systematic understanding is limited. In complex real-world data sets, even defining what is out-of-distribution is not obvious. To better understand the OOD behaviour of autoregressive LLMs, we focus on formal languages, which are defined by the intersection of rules. We define a new scenario of OOD compositional generalization, termed rule extrapolation. Rule extrapolation describes OOD scenarios, where the prompt violates at least one rule. We evaluate rule extrapolation in formal languages with varying complexity in linear and recurrent architectures, the Transformer, and state space models to understand the architectures' influence on rule extrapolation. We also lay the first stones of a normative theory of rule extrapolation, inspired by the Solomonoff prior in algorithmic information theory.
- [79] arXiv:2409.18421 (replaced) [pdf, html, other]
-
Title: Moment varieties of the inverse Gaussian and gamma distributions are nondefectiveComments: 23 pages. Some small corrections and expository improvements. Comments welcome!Subjects: Algebraic Geometry (math.AG); Statistics Theory (math.ST)
We show that the parameters of a $k$-mixture of inverse Gaussian or gamma distributions are algebraically identifiable from the first $3k-1$ moments, and rationally identifiable from the first $3k+2$ moments. Our proofs are based on Terracini's classification of defective surfaces, careful analysis of the intersection theory of moment varieties, and a recent result on sufficient conditions for rational identifiability of secant varieties by Massarenti--Mella.
- [80] arXiv:2410.10473 (replaced) [pdf, other]
-
Title: The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean LabelsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.
- [81] arXiv:2410.13714 (replaced) [pdf, html, other]
-
Title: Generation through the lens of learning theoryComments: Fixed a bug in a proofSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study generation through the lens of statistical learning theory. First, we abstract and formalize the results of Gold [1967], Angluin [1979, 1980], and Kleinberg and Mullainathan [2024] for language identification/generation in the limit in terms of a binary hypothesis class defined over an abstract instance space. Then, we formalize a different paradigm of generation studied by Kleinberg and Mullainathan [2024], which we call "uniform generation," and provide a characterization of which hypothesis classes are uniformly generatable. As is standard in statistical learning theory, our characterization is in terms of the finiteness of a new combinatorial dimension we call the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are \emph{incompatible} - there are classes that are generatable but not predictable and vice versa.
- [82] arXiv:2410.13914 (replaced) [pdf, html, other]
-
Title: Exogenous Matching: Learning Good Proposals for Tractable Counterfactual EstimationComments: 51 pages, 15 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose an importance sampling method for tractable and efficient estimation of counterfactual expressions in general settings, named Exogenous Matching. By minimizing a common upper bound of counterfactual estimators, we transform the variance minimization problem into a conditional distribution learning problem, enabling its integration with existing conditional distribution modeling approaches. We validate the theoretical results through experiments under various types and settings of Structural Causal Models (SCMs) and demonstrate the outperformance on counterfactual estimation tasks compared to other existing importance sampling methods. We also explore the impact of injecting structural prior knowledge (counterfactual Markov boundaries) on the results. Finally, we apply this method to identifiable proxy SCMs and demonstrate the unbiasedness of the estimates, empirically illustrating the applicability of the method to practical scenarios.
- [83] arXiv:2410.17998 (replaced) [pdf, html, other]
-
Title: Estimating the Spectral Moments of the Kernel Integral Operator from Finite Sample MatricesSubjects: Machine Learning (cs.LG); Spectral Theory (math.SP); Statistics Theory (math.ST); Machine Learning (stat.ML)
Analyzing the structure of sampled features from an input data distribution is challenging when constrained by limited measurements in both the number of inputs and features. Traditional approaches often rely on the eigenvalue spectrum of the sample covariance matrix derived from finite measurement matrices; however, these spectra are sensitive to the size of the measurement matrix, leading to biased insights. In this paper, we introduce a novel algorithm that provides unbiased estimates of the spectral moments of the kernel integral operator in the limit of infinite inputs and features from finitely sampled measurement matrices. Our method, based on dynamic programming, is efficient and capable of estimating the moments of the operator spectrum. We demonstrate the accuracy of our estimator on radial basis function (RBF) kernels, highlighting its consistency with the theoretical spectra. Furthermore, we showcase the practical utility and robustness of our method in understanding the geometry of learned representations in neural networks.