Search SciRate

431 results for au:Chen_J in:stat

Show all abstracts

Training-free Diffusion Model Alignment with Sampling Demons
Po-Hung Yeh, Kuang-Huei Lee, Jun-Cheng Chen
Oct 10 2024 cs.CV cs.AI cs.LG math.OC stat.ML arXiv:2410.05760v1

@misc{2410.05760, author = {Po-Hung Yeh and Kuang-Huei Lee and Jun-Cheng Chen}, title = {{T}raining-free {D}iffusion {M}odel {A}lignment with {S}ampling {D}emons}, year = {2024}, eprint = {2410.05760}, note = {arXiv:2410.05760v1} }
PDF
Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgements. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation.
Challenges and Possible Strategies to Address Them in Rare Disease Drug Development: A Statistical Perspective
Jie Chen, Lei Nie, Shiowjen Lee, Haitao Chu, Haijun Tian, Yan Wang, Weili He, Thomas Jemielita, Susan Gruber, Yang Song, Roy Tamura, Lu Tian, Yihua Zhao, Yong Chen, Mark van der Laan, Hana Lee
Oct 10 2024 stat.AP arXiv:2410.06585v1

@misc{2410.06585, author = {Jie Chen and Lei Nie and Shiowjen Lee and Haitao Chu and Haijun Tian and Yan Wang and Weili He and Thomas Jemielita and Susan Gruber and Yang Song and Roy Tamura and Lu Tian and Yihua Zhao and Yong Chen and Mark van der Laan and Hana Lee}, title = {{C}hallenges and {P}ossible {S}trategies to {A}ddress {T}hem in {R}are {D}isease {D}rug {D}evelopment: {A} {S}tatistical {P}erspective}, year = {2024}, eprint = {2410.06585}, note = {arXiv:2410.06585v1} }
PDF
Developing drugs for rare diseases presents unique challenges from a statistical perspective. These challenges may include slowly progressive diseases with unmet medical needs, poorly understood natural history, small population size, diversified phenotypes and geneotypes within a disorder, and lack of appropriate surrogate endpoints to measure clinical benefits. The Real-World Evidence (RWE) Scientific Working Group of the American Statistical Association Biopharmaceutical Section has assembled a research team to assess the landscape including challenges and possible strategies to address these challenges and the role of real-world data (RWD) and RWE in rare disease drug development. This paper first reviews the current regulations by regulatory agencies worldwide and then discusses in more details the challenges from a statistical perspective in the design, conduct, and analysis of rare disease clinical trials. After outlining an overall development pathway for rare disease drugs, corresponding strategies to address the aforementioned challenges are presented. Other considerations are also discussed for generating relevant evidence for regulatory decision-making on drugs for rare diseases. The accompanying paper discusses how RWD and RWE can be used to improve the efficiency of rare disease drug development.
Use of Real-World Data and Real-World Evidence in Rare Disease Drug Development: A Statistical Perspective
Jie Chen, Susan Gruber, Hana Lee, Haitao Chu, Shiowjen Lee, Haijun Tian, Yan Wang, Weili He, Thomas Jemielita, Yang Song, Roy Tamura, Lu Tian, Yihua Zhao, Yong Chen, Mark van der Laan, Lei Nie
Oct 10 2024 stat.AP arXiv:2410.06586v1

@misc{2410.06586, author = {Jie Chen and Susan Gruber and Hana Lee and Haitao Chu and Shiowjen Lee and Haijun Tian and Yan Wang and Weili He and Thomas Jemielita and Yang Song and Roy Tamura and Lu Tian and Yihua Zhao and Yong Chen and Mark van der Laan and Lei Nie}, title = {{U}se of {R}eal-{W}orld {D}ata and {R}eal-{W}orld {E}vidence in {R}are {D}isease {D}rug {D}evelopment: {A} {S}tatistical {P}erspective}, year = {2024}, eprint = {2410.06586}, note = {arXiv:2410.06586v1} }
PDF
Real-world data (RWD) and real-world evidence (RWE) have been increasingly used in medical product development and regulatory decision-making, especially for rare diseases. After outlining the challenges and possible strategies to address the challenges in rare disease drug development (see the accompanying paper), the Real-World Evidence (RWE) Scientific Working Group of the American Statistical Association Biopharmaceutical Section reviews the roles of RWD and RWE in clinical trials for drugs treating rare diseases. This paper summarizes relevant guidance documents and frameworks by selected regulatory agencies and the current practice on the use of RWD and RWE in natural history studies and the design, conduct, and analysis of rare disease clinical trials. A targeted learning roadmap for rare disease trials is described, followed by case studies on the use of RWD and RWE to support a natural history study and marketing applications in various settings.
Decentralized Clinical Trials in the Era of Real-World Evidence: A Statistical Perspective
Jie Chen, Junrui Di, Nadia Daizadeh, Ying Lu, Hongwei Wang, Yuan-Li Shen, Jennifer Kirk, Frank W. Rockhold, Herbert Pang, Jing Zhao, Weili He, Andrew Potter, Hana Lee
Oct 10 2024 stat.AP arXiv:2410.06591v1

@misc{2410.06591, author = {Jie Chen and Junrui Di and Nadia Daizadeh and Ying Lu and Hongwei Wang and Yuan-Li Shen and Jennifer Kirk and Frank W.~Rockhold and Herbert Pang and Jing Zhao and Weili He and Andrew Potter and Hana Lee}, title = {{D}ecentralized {C}linical {T}rials in the {E}ra of {R}eal-{W}orld {E}vidence: {A} {S}tatistical {P}erspective}, year = {2024}, eprint = {2410.06591}, note = {arXiv:2410.06591v1} }
PDF
There has been a growing trend that activities relating to clinical trials take place at locations other than traditional trial sites (hence decentralized clinical trials or DCTs), some of which are at settings of real-world clinical practice. Although there are numerous benefits of DCTs, this also brings some implications on a number of issues relating to the design, conduct, and analysis of DCTs. The Real-World Evidence Scientific Working Group of the American Statistical Association Biopharmaceutical Section has been reviewing the field of DCTs and provides in this paper considerations for decentralized trials from a statistical perspective. This paper first discusses selected critical decentralized elements that may have statistical implications on the trial and then summarizes regulatory guidance, framework, and initiatives on DCTs. More discussions are presented by focusing on the design (including construction of estimand), implementation, statistical analysis plan (including missing data handling), and reporting of safety events. Some additional considerations (e.g., ethical considerations, technology infrastructure, study oversight, data security and privacy, and regulatory compliance) are also briefly discussed. This paper is intended to provide statistical considerations for decentralized trials of medical products to support regulatory decision-making.
On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent
Bingrui Li, Wei Huang, Andi Han, Zhanpeng Zhou, Taiji Suzuki, Jun Zhu, Jianfei Chen
Oct 08 2024 cs.LG stat.ML arXiv:2410.04870v1

@misc{2410.04870, author = {Bingrui Li and Wei Huang and Andi Han and Zhanpeng Zhou and Taiji Suzuki and Jun Zhu and Jianfei Chen}, title = {{O}n the {O}ptimization and {G}eneralization of {T}wo-layer {T}ransformers with {S}ign {G}radient {D}escent}, year = {2024}, eprint = {2410.04870}, note = {arXiv:2410.04870v1} }
PDF
The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.
Robust Traffic Forecasting against Spatial Shift over Years
Hongjun Wang, Jiyuan Chen, Tong Pan, Zheng Dong, Lingyu Zhang, Renhe Jiang, Xuan Song
Oct 02 2024 cs.LG cs.AI cs.DB stat.ML arXiv:2410.00373v1

@misc{2410.00373, author = {Hongjun Wang and Jiyuan Chen and Tong Pan and Zheng Dong and Lingyu Zhang and Renhe Jiang and Xuan Song}, title = {{R}obust {T}raffic {F}orecasting against {S}patial {S}hift over {Y}ears}, year = {2024}, eprint = {2410.00373}, note = {arXiv:2410.00373v1} }
PDF
Recent advancements in Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have demonstrated promising potential for traffic forecasting by effectively capturing both temporal and spatial correlations. The generalization ability of spatiotemporal models has received considerable attention in recent scholarly discourse. However, no substantive datasets specifically addressing traffic out-of-distribution (OOD) scenarios have been proposed. Existing ST-OOD methods are either constrained to testing on extant data or necessitate manual modifications to the dataset. Consequently, the generalization capacity of current spatiotemporal models in OOD scenarios remains largely underexplored. In this paper, we investigate state-of-the-art models using newly proposed traffic OOD benchmarks and, surprisingly, find that these models experience a significant decline in performance. Through meticulous analysis, we attribute this decline to the models' inability to adapt to previously unobserved spatial relationships. To address this challenge, we propose a novel Mixture of Experts (MoE) framework, which learns a set of graph generators (i.e., graphons) during training and adaptively combines them to generate new graphs based on novel environmental conditions to handle spatial distribution shifts during testing. We further extend this concept to the Transformer architecture, achieving substantial improvements. Our method is both parsimonious and efficacious, and can be seamlessly integrated into any spatiotemporal model, outperforming current state-of-the-art approaches in addressing spatial dynamics.
A model-constrained Discontinuous Galerkin Network (DGNet) for Compressible Euler Equations with Out-of-Distribution Generalization
Hai Van Nguyen, Jau-Uei Chen, William Cole Nockolds, Wesley Lao, Tan Bui-Thanh
Sep 30 2024 stat.ML cs.LG stat.CO arXiv:2409.18371v1

@misc{2409.18371, author = {Hai Van Nguyen and Jau-Uei Chen and William Cole Nockolds and Wesley Lao and Tan Bui-Thanh}, title = {{A} model-constrained {D}iscontinuous {G}alerkin {N}etwork ({DGN}et) for {C}ompressible {E}uler {E}quations with {O}ut-of-{D}istribution {G}eneralization}, year = {2024}, eprint = {2409.18371}, note = {arXiv:2409.18371v1} }
PDF
Real-time accurate solutions of large-scale complex dynamical systems are critically needed for control, optimization, uncertainty quantification, and decision-making in practical engineering and science applications, particularly in digital twin contexts. In this work, we develop a model-constrained discontinuous Galerkin Network (DGNet) approach, an extension to our previous work [Model-constrained Tagent Slope Learning Approach for Dynamical Systems], for compressible Euler equations with out-of-distribution generalization. The core of DGNet is the synergy of several key strategies: (i) leveraging time integration schemes to capture temporal correlation and taking advantage of neural network speed for computation time reduction; (ii) employing a model-constrained approach to ensure the learned tangent slope satisfies governing equations; (iii) utilizing a GNN-inspired architecture where edges represent Riemann solver surrogate models and nodes represent volume integration correction surrogate models, enabling capturing discontinuity capacity, aliasing error reduction, and mesh discretization generalizability; (iv) implementing the input normalization technique that allows surrogate models to generalize across different initial conditions, boundary conditions, and solution orders; and (v) incorporating a data randomization technique that not only implicitly promotes agreement between surrogate models and true numerical models up to second-order derivatives, ensuring long-term stability and prediction capacity, but also serves as a data generation engine during training, leading to enhanced generalization on unseen data. To validate the effectiveness, stability, and generalizability of our novel DGNet approach, we present comprehensive numerical results for 1D and 2D compressible Euler equation problems.
Optimal Classification-based Anomaly Detection with Neural Networks: Theory and Practice
Tian-Yi Zhou, Matthew Lau, Jizhou Chen, Wenke Lee, Xiaoming Huo
Sep 16 2024 stat.ML cs.CR cs.LG math.ST stat.TH arXiv:2409.08521v1

@misc{2409.08521, author = {Tian-Yi Zhou and Matthew Lau and Jizhou Chen and Wenke Lee and Xiaoming Huo}, title = {{O}ptimal {C}lassification-based {A}nomaly {D}etection with {N}eural {N}etworks: {T}heory and {P}ractice}, year = {2024}, eprint = {2409.08521}, note = {arXiv:2409.08521v1} }
PDF
Anomaly detection is an important problem in many application areas, such as network security. Many deep learning methods for unsupervised anomaly detection produce good empirical performance but lack theoretical guarantees. By casting anomaly detection into a binary classification problem, we establish non-asymptotic upper bounds and a convergence rate on the excess risk on rectified linear unit (ReLU) neural networks trained on synthetic anomalies. Our convergence rate on the excess risk matches the minimax optimal rate in the literature. Furthermore, we provide lower and upper bounds on the number of synthetic anomalies that can attain this optimality. For practical implementation, we relax some conditions to improve the search for the empirical risk minimizer, which leads to competitive performance to other classification-based methods for anomaly detection. Overall, our work provides the first theoretical guarantees of unsupervised neural network-based anomaly detectors and empirical insights on how to design them well.
Gaussian mixture Taylor approximations of risk measures constrained by PDEs with Gaussian random field inputs
Dingcheng Luo, Joshua Chen, Peng Chen, Omar Ghattas
Aug 14 2024 math.NA cs.NA stat.CO arXiv:2408.06615v1

@misc{2408.06615, author = {Dingcheng Luo and Joshua Chen and Peng Chen and Omar Ghattas}, title = {{G}aussian mixture {T}aylor approximations of risk measures constrained by {PDE}s with {G}aussian random field inputs}, year = {2024}, eprint = {2408.06615}, note = {arXiv:2408.06615v1} }
PDF
This work considers the computation of risk measures for quantities of interest governed by PDEs with Gaussian random field parameters using Taylor approximations. While efficient, Taylor approximations are local to the point of expansion, and hence may degrade in accuracy when the variances of the input parameters are large. To address this challenge, we approximate the underlying Gaussian measure by a mixture of Gaussians with reduced variance in a dominant direction of parameter space. Taylor approximations are constructed at the means of each Gaussian mixture component, which are then combined to approximate the risk measures. The formulation is presented in the setting of infinite-dimensional Gaussian random parameters for risk measures including the mean, variance, and conditional value-at-risk. We also provide detailed analysis of the approximations errors arising from two sources: the Gaussian mixture approximation and the Taylor approximations. Numerical experiments are conducted for a semilinear advection-diffusion-reaction equation with a random diffusion coefficient field and for the Helmholtz equation with a random wave speed field. For these examples, the proposed approximation strategy can achieve less than $1\%$ relative error in estimating CVaR with only $\mathcal{O}(10)$ state PDE solves, which is comparable to a standard Monte Carlo estimate with $\mathcal{O}(10^4)$ samples, thus achieving significant reduction in computational cost. The proposed method can therefore serve as a way to rapidly and accurately estimate risk measures under limited computational budgets.
Optimal Estimation of Structured Covariance Operators
Omar Al-Ghattas, Jiaheng Chen, Daniel Sanz-Alonso, Nathan Waniorek
Aug 06 2024 math.ST math.PR stat.TH arXiv:2408.02109v1

@misc{2408.02109, author = {Omar Al-Ghattas and Jiaheng Chen and Daniel Sanz-Alonso and Nathan Waniorek}, title = {{O}ptimal {E}stimation of {S}tructured {C}ovariance {O}perators}, year = {2024}, eprint = {2408.02109}, note = {arXiv:2408.02109v1} }
PDF
This paper establishes optimal convergence rates for estimation of structured covariance operators of Gaussian processes. We study banded operators with kernels that decay rapidly off-the-diagonal and $L^q$-sparse operators with an unordered sparsity pattern. For these classes of operators, we find the minimax optimal rate of estimation in operator norm, identifying the fundamental dimension-free quantities that determine the sample complexity. In addition, we prove that tapering and thresholding estimators attain the optimal rate. The proof of the upper bound for tapering estimators requires novel techniques to circumvent the issue that discretization of a banded operator does not result, in general, in a banded covariance matrix. To derive lower bounds for banded and $L^q$-sparse classes, we introduce a general framework to lift theory from high-dimensional matrix estimation to the operator setting. Our work contributes to the growing literature on operator estimation and learning, building on ideas from high-dimensional statistics while also addressing new challenges that emerge in infinite dimension.
Potential weights and implicit causal designs in linear regression
Jiafeng Chen
Aug 01 2024 econ.EM stat.ME arXiv:2407.21119v1

@misc{2407.21119, author = {Jiafeng Chen}, title = {{P}otential weights and implicit causal designs in linear regression}, year = {2024}, eprint = {2407.21119}, note = {arXiv:2407.21119v1} }
PDF
When do linear regressions estimate causal effects in quasi-experiments? This paper provides a generic diagnostic that assesses whether a given linear regression specification on a given dataset admits a design-based interpretation. To do so, we define a notion of potential weights, which encode counterfactual decisions a given regression makes to unobserved potential outcomes. If the specification does admit such an interpretation, this diagnostic can find a vector of unit-level treatment assignment probabilities -- which we call an implicit design -- under which the regression estimates a causal effect. This diagnostic also finds the implicit causal effect estimand. Knowing the implicit design and estimand adds transparency, leads to further sanity checks, and opens the door to design-based statistical inference. When applied to regression specifications studied in the causal inference literature, our framework recovers and extends existing theoretical results. When applied to widely-used specifications not covered by existing causal inference literature, our framework generates new theoretical insights.
Byzantine-tolerant distributed learning of finite mixture models
Qiong Zhang, Jiahua Chen
Jul 22 2024 stat.ME cs.LG stat.ML arXiv:2407.13980v1

@misc{2407.13980, author = {Qiong Zhang and Jiahua Chen}, title = {{B}yzantine-tolerant distributed learning of finite mixture models}, year = {2024}, eprint = {2407.13980}, note = {arXiv:2407.13980v1} }
PDF
This paper proposes two split-and-conquer (SC) learning estimators for finite mixture models that are tolerant to Byzantine failures. In SC learning, individual machines obtain local estimates, which are then transmitted to a central server for aggregation. During this communication, the server may receive malicious or incorrect information from some local machines, a scenario known as Byzantine failures. While SC learning approaches have been devised to mitigate Byzantine failures in statistical models with Euclidean parameters, developing Byzantine-tolerant methods for finite mixture models with non-Euclidean parameters requires a distinct strategy. Our proposed distance-based methods are hyperparameter tuning free, unlike existing methods, and are resilient to Byzantine failures while achieving high statistical efficiency. We validate the effectiveness of our methods both theoretically and empirically via experiments on simulated and real data from machine learning applications for digit recognition. The code for the experiment can be found at https://github.com/SarahQiong/RobustSCGMM.
Adjusting for Participation Bias in Case-Control Genetic Association Studies for Rare Diseases
Le Wang, Zhengbang Li, Ben Fitzpatrick, Clarice Weinberg, Jinbo Chen
Jul 12 2024 stat.ME arXiv:2407.08382v1

@misc{2407.08382, author = {Le Wang and Zhengbang Li and Ben Fitzpatrick and Clarice Weinberg and Jinbo Chen}, title = {{A}djusting for {P}articipation {B}ias in {C}ase-{C}ontrol {G}enetic {A}ssociation {S}tudies for {R}are {D}iseases}, year = {2024}, eprint = {2407.08382}, note = {arXiv:2407.08382v1} }
PDF
Collection of genotype data in case-control genetic association studies may often be incomplete for reasons related to genes themselves. This non-ignorable missingness structure, if not appropriately accounted for, can result in participation bias in association analyses. To deal with this issue, Chen et al. (2016) proposed to collect additional genetic information from family members of individuals whose genotype data were not available, and developed a maximum likelihood method for bias correction. In this study, we develop an estimating equation approach to analyzing data collected from this design that allows adjustment of covariates. It jointly estimates odds ratio parameters for genetic association and missingness, where a logistic regression model is used to relate missingness to genotype and other covariates. Our method allows correlation between genotype and covariates while using genetic information from family members to provide information on the missing genotype data. In the estimating equation for genetic association parameters, we weight the contribution of each genotyped subject to the empirical likelihood score function by the inverse probability that the genotype data are available. We evaluate large and finite sample performance of our method via simulation studies and apply it to a family-based case-control study of breast cancer.
MD tree: a model-diagnostic tree grown on loss landscape
Yefan Zhou, Jianlong Chen, Qinxue Cao, Konstantin Schürholt, Yaoqing Yang
Jun 26 2024 cs.LG stat.ML arXiv:2406.16988v1

@misc{2406.16988, author = {Yefan Zhou and Jianlong Chen and Qinxue Cao and Konstantin Schürholt and Yaoqing Yang}, title = {{MD} tree: a model-diagnostic tree grown on loss landscape}, year = {2024}, eprint = {2406.16988}, note = {arXiv:2406.16988v1} }
PDF
This paper considers "model diagnosis", which we formulate as a classification problem. Given a pre-trained neural network (NN), the goal is to predict the source of failure from a set of failure modes (such as a wrong hyperparameter, inadequate model size, and insufficient data) without knowing the training configuration of the pre-trained NN. The conventional diagnosis approach uses training and validation errors to determine whether the model is underfitting or overfitting. However, we show that rich information about NN performance is encoded in the optimization loss landscape, which provides more actionable insights than validation-based measurements. Therefore, we propose a diagnosis method called MD tree based on loss landscape metrics and experimentally demonstrate its advantage over classical validation-based approaches. We verify the effectiveness of MD tree in multiple practical scenarios: (1) use several models trained on one dataset to diagnose a model trained on another dataset, essentially a few-shot dataset transfer problem; (2) use small models (or models trained with small data) to diagnose big models (or models trained with big data), essentially a scale transfer problem. In a dataset transfer task, MD tree achieves an accuracy of 87.7%, outperforming validation-based approaches by 14.88%. Our code is available at https://github.com/YefanZhou/ModelDiagnosis.
TREE: Tree Regularization for Efficient Execution
Lena Schmid, Daniel Biebert, Christian Hakert, Kuan-Hsun Chen, Michel Lang, Markus Pauly, Jian-Jia Chen
Jun 19 2024 cs.LG stat.ML arXiv:2406.12531v1

@misc{2406.12531, author = {Lena Schmid and Daniel Biebert and Christian Hakert and Kuan-Hsun Chen and Michel Lang and Markus Pauly and Jian-Jia Chen}, title = {{TREE}: {T}ree {R}egularization for {E}fficient {E}xecution}, year = {2024}, eprint = {2406.12531}, note = {arXiv:2406.12531v1} }
PDF
The rise of machine learning methods on heavily resource constrained devices requires not only the choice of a suitable model architecture for the target platform, but also the optimization of the chosen model with regard to execution time consumption for inference in order to optimally utilize the available resources. Random forests and decision trees are shown to be a suitable model for such a scenario, since they are not only heavily tunable towards the total model size, but also offer a high potential for optimizing their executions according to the underlying memory architecture. In addition to the straightforward strategy of enforcing shorter paths through decision trees and hence reducing the execution time for inference, hardware-aware implementations can optimize the execution time in an orthogonal manner. One particular hardware-aware optimization is to layout the memory of decision trees in such a way, that higher probably paths are less likely to be evicted from system caches. This works particularly well when splits within tree nodes are uneven and have a high probability to visit one of the child nodes. In this paper, we present a method to reduce path lengths by rewarding uneven probability distributions during the training of decision trees at the cost of a minimal accuracy degradation. Specifically, we regularize the impurity computation of the CART algorithm in order to favor not only low impurity, but also highly asymmetric distributions for the evaluation of split criteria and hence offer a high optimization potential for a memory architecture-aware implementation. We show that especially for binary classification data sets and data sets with many samples, this form of regularization can lead to an reduction of up to approximately four times in the execution time with a minimal accuracy degradation.
Constrained Design of a Binary Instrument in a Partially Linear Model
Tim Morrison, Minh Nguyen, Jonathan Chen, Michael Baiocchi, Art B. Owen
Jun 11 2024 stat.ME arXiv:2406.05592v2

@misc{2406.05592, author = {Tim Morrison and Minh Nguyen and Jonathan Chen and Michael Baiocchi and Art B.~Owen}, title = {{C}onstrained {D}esign of a {B}inary {I}nstrument in a {P}artially {L}inear {M}odel}, year = {2024}, eprint = {2406.05592}, note = {arXiv:2406.05592v2} }
PDF
We study the question of how best to assign an encouragement in a randomized encouragement study. In our setting, units arrive with covariates, receive a nudge toward treatment or control, acquire one of those statuses in a way that need not align with the nudge, and finally have a response observed. The nudge can be seen as a binary instrument that affects the response only via the treatment status. Our goal is to assign the nudge as a function of covariates in a way that best estimates the local average treatment effect (LATE). We assume a partially linear model, wherein the baseline model is non-parametric and the treatment term is linear in the covariates. Under this model, we outline a two-stage procedure to consistently estimate the LATE. Though the variance of the LATE is intractable, we derive a finite sample approximation and thus a design criterion to minimize. This criterion is convex, allowing for constraints that might arise for budgetary or ethical reasons. We prove conditions under which our solution asymptotically recovers the lowest true variance among all possible nudge propensities. We apply our method to a semi-synthetic example involving triage in an emergency department and find significant gains relative to a regression discontinuity design.
Towards Scalable Automated Alignment of LLMs: A Survey
Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, Bowen Yu
Jun 04 2024 cs.CL cs.AI stat.ML arXiv:2406.01252v3

@misc{2406.01252, author = {Boxi Cao and Keming Lu and Xinyu Lu and Jiawei Chen and Mengjie Ren and Hao Xiang and Peilin Liu and Yaojie Lu and Ben He and Xianpei Han and Le Sun and Hongyu Lin and Bowen Yu}, title = {{T}owards {S}calable {A}utomated {A}lignment of {LLM}s: {A} {S}urvey}, year = {2024}, eprint = {2406.01252}, note = {arXiv:2406.01252v3} }
PDF
Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
Revisit, Extend, and Enhance Hessian-Free Influence Functions
Ziao Yang, Han Yue, Jian Chen, Hongfu Liu
May 29 2024 cs.LG stat.ML arXiv:2405.17490v2

@misc{2405.17490, author = {Ziao Yang and Han Yue and Jian Chen and Hongfu Liu}, title = {{R}evisit, {E}xtend, and {E}nhance {H}essian-{F}ree {I}nfluence {F}unctions}, year = {2024}, eprint = {2405.17490}, note = {arXiv:2405.17490v2} }
PDF
Influence functions serve as crucial tools for assessing sample influence in model interpretation, subset training set selection, noisy label detection, and more. By employing the first-order Taylor extension, influence functions can estimate sample influence without the need for expensive model retraining. However, applying influence functions directly to deep models presents challenges, primarily due to the non-convex nature of the loss function and the large size of model parameters. This difficulty not only makes computing the inverse of the Hessian matrix costly but also renders it non-existent in some cases. Various approaches, including matrix decomposition, have been explored to expedite and approximate the inversion of the Hessian matrix, with the aim of making influence functions applicable to deep models. In this paper, we revisit a specific, albeit naive, yet effective approximation method known as TracIn. This method substitutes the inverse of the Hessian matrix with an identity matrix. We provide deeper insights into why this simple approximation method performs well. Furthermore, we extend its applications beyond measuring model utility to include considerations of fairness and robustness. Finally, we enhance TracIn through an ensemble strategy. To validate its effectiveness, we conduct experiments on synthetic data and extensive evaluations on noisy label detection, sample selection for large language model fine-tuning, and defense against adversarial attacks.
Diffusion Bridge Implicit Models
Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu
May 28 2024 cs.LG stat.ML arXiv:2405.15885v1

@misc{2405.15885, author = {Kaiwen Zheng and Guande He and Jianfei Chen and Fan Bao and Jun Zhu}, title = {{D}iffusion {B}ridge {I}mplicit {M}odels}, year = {2024}, eprint = {2405.15885}, note = {arXiv:2405.15885v1} }
PDF
Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we present diffusion bridge implicit models (DBIMs) for accelerated sampling of diffusion bridges without extra training. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same training objective as DDBMs. These generalized diffusion bridges give rise to generative processes ranging from stochastic to deterministic (i.e., an implicit probabilistic model) while being up to 25$\times$ faster than the vanilla sampler of DDBMs. Moreover, the deterministic sampling procedure yielded by DBIMs enables faithful encoding and reconstruction by a booting noise used in the initial sampling step, and allows us to perform semantically meaningful interpolation in image translation tasks by regarding the booting noise as the latent variable.
Transfer Learning Under High-Dimensional Graph Convolutional Regression Model for Node Classification
Jiachen Chen, Danyang Huang, Liyuan Wang, Kathryn L. Lunetta, Debarghya Mukherjee, Huimin Cheng
May 28 2024 stat.ML cs.LG stat.ME arXiv:2405.16672v1

@misc{2405.16672, author = {Jiachen Chen and Danyang Huang and Liyuan Wang and Kathryn L.~Lunetta and Debarghya Mukherjee and Huimin Cheng}, title = {{T}ransfer {L}earning {U}nder {H}igh-{D}imensional {G}raph {C}onvolutional {R}egression {M}odel for {N}ode {C}lassification}, year = {2024}, eprint = {2405.16672}, note = {arXiv:2405.16672v1} }
PDF
Node classification is a fundamental task, but obtaining node classification labels can be challenging and expensive in many real-world scenarios. Transfer learning has emerged as a promising solution to address this challenge by leveraging knowledge from source domains to enhance learning in a target domain. Existing transfer learning methods for node classification primarily focus on integrating Graph Convolutional Networks (GCNs) with various transfer learning techniques. While these approaches have shown promising results, they often suffer from a lack of theoretical guarantees, restrictive conditions, and high sensitivity to hyperparameter choices. To overcome these limitations, we propose a Graph Convolutional Multinomial Logistic Regression (GCR) model and a transfer learning method based on the GCR model, called Trans-GCR. We provide theoretical guarantees of the estimate obtained under GCR model in high-dimensional settings. Moreover, Trans-GCR demonstrates superior empirical performance, has a low computational cost, and requires fewer hyperparameters than existing methods.
Considerations for Single-Arm Trials to Support Accelerated Approval of Oncology Drugs
Feinan Lu, Tao Wang, Ying Lu, Jie Chen
May 22 2024 stat.AP arXiv:2405.12437v1

@misc{2405.12437, author = {Feinan Lu and Tao Wang and Ying Lu and Jie Chen}, title = {{C}onsiderations for {S}ingle-{A}rm {T}rials to {S}upport {A}ccelerated {A}pproval of {O}ncology {D}rugs}, year = {2024}, eprint = {2405.12437}, note = {arXiv:2405.12437v1} }
PDF
In the last two decades, single-arm trials (SATs) have been effectively used to study anticancer therapies in well-defined patient populations using durable response rates as an objective and interpretable clinical endpoints. With a growing trend of regulatory accelerated approval (AA) requiring randomized controlled trials (RCTs), some confusions have arisen about the roles of SATs in AA. This paper is intended to elucidate conditions under which an SAT may be considered reasonable for AA. Specifically, the paper describes (1) two necessary conditions for designing an SAT, (2) three sufficient conditions that help either optimize the study design or interpret the study results, (3) four conditions that demonstrate substantial evidence of clinical benefits of the drug, and (4) a plan of a confirmatory RCT to verify the clinical benefits. Some further considerations are discussed to help design a scientifically sound SAT and communicate with regulatory agencies. Conditions presented in this paper may serve as a set of references for sponsors using SATs for regulatory decision.
Neural Optimization with Adaptive Heuristics for Intelligent Marketing System
Changshuai Wei, Benjamin Zelditch, Joyce Chen, Andre Assuncao Silva T Ribeiro, Jingyi Kenneth Tay, Borja Ocejo Elizondo, Keerthi Selvaraj, Aman Gupta, Licurgo Benemann De Almeida
May 20 2024 stat.ME cs.AI cs.IR cs.LG math.OC arXiv:2405.10490v3

@misc{2405.10490, author = {Changshuai Wei and Benjamin Zelditch and Joyce Chen and Andre Assuncao Silva T Ribeiro and Jingyi Kenneth Tay and Borja Ocejo Elizondo and Keerthi Selvaraj and Aman Gupta and Licurgo Benemann De Almeida}, title = {{N}eural {O}ptimization with {A}daptive {H}euristics for {I}ntelligent {M}arketing {S}ystem}, year = {2024}, eprint = {2405.10490}, doi = {10.1145/3637528.3671591}, note = {arXiv:2405.10490v3} }
PDF
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
Class-attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective
Xuechen Zhang, Mingchen Li, Jiasi Chen, Christos Thrampoulidis, Samet Oymak
Jan 26 2024 cs.LG cs.CY stat.ML arXiv:2401.14343v1

@misc{2401.14343, author = {Xuechen Zhang and Mingchen Li and Jiasi Chen and Christos Thrampoulidis and Samet Oymak}, title = {{C}lass-attribute {P}riors: {A}dapting {O}ptimization to {H}eterogeneity and {F}airness {O}bjective}, year = {2024}, eprint = {2401.14343}, note = {arXiv:2401.14343v1} }
PDF
Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g. hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities.
Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data
Wancen Mu, Jiawen Chen, Eric S. Davis, Kathleen Reed, Douglas Phanstiel, Michael I. Love, Didong Li
Jan 17 2024 stat.ME arXiv:2401.07400v2

@misc{2401.07400, author = {Wancen Mu and Jiawen Chen and Eric S.~Davis and Kathleen Reed and Douglas Phanstiel and Michael I.~Love and Didong Li}, title = {{G}aussian {P}rocesses for {T}ime {S}eries with {L}ead-{L}ag {E}ffects with applications to biology data}, year = {2024}, eprint = {2401.07400}, note = {arXiv:2401.07400v2} }
PDF
Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods.
Proximal Causal Inference With Text Data
Jacob M. Chen, Rohit Bhattacharya, Katherine A. Keith
Jan 15 2024 cs.CL cs.LG stat.ME arXiv:2401.06687v2

@misc{2401.06687, author = {Jacob M.~Chen and Rohit Bhattacharya and Katherine A.~Keith}, title = {{P}roximal {C}ausal {I}nference {W}ith {T}ext {D}ata}, year = {2024}, eprint = {2401.06687}, note = {arXiv:2401.06687v2} }
PDF
Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses multiple instances of pre-treatment text data, infers two proxies from two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove that our text-based proxy method satisfies identification conditions required by the proximal g-formula while other seemingly reasonable proposals do not. We evaluate our method in synthetic and semi-synthetic settings and find that it produces estimates with low bias. To address untestable assumptions associated with the proximal g-formula, we further propose an odds ratio falsification heuristic. This new combination of proximal causal inference and zero-shot classifiers expands the set of text-specific causal methods available to practitioners.
Exact Thresholds for Noisy Non-Adaptive Group Testing
Junren Chen, Jonathan Scarlett
Jan 11 2024 cs.IT cs.DM math.IT math.PR math.ST stat.TH arXiv:2401.04884v1

@misc{2401.04884, author = {Junren Chen and Jonathan Scarlett}, title = {{E}xact {T}hresholds for {N}oisy {N}on-{A}daptive {G}roup {T}esting}, year = {2024}, eprint = {2401.04884}, note = {arXiv:2401.04884v1} }
PDF
In recent years, the mathematical limits and algorithmic bounds for probabilistic group testing having become increasingly well-understood, with exact asymptotic thresholds now being known in general scaling regimes for the noiseless setting. In the noisy setting where each test outcome is flipped with constant probability, there have been similar developments, but the overall understanding has lagged significantly behind the noiseless setting. In this paper, we substantially narrow this gap by deriving exact asymptotic thresholds for the noisy setting under two widely-studied random test designs: i.i.d. Bernoulli and near-constant tests-per-item. These thresholds are established by combining components of an existing information-theoretic threshold decoder with a novel analysis of maximum-likelihood decoding (upper bounds), and deriving a novel set of impossibility results by analyzing certain failure events for optimal maximum-likelihood decoding (lower bounds). Our results show that existing algorithmic upper bounds for the noisy setting are strictly suboptimal, and leave open the interesting question of whether our thresholds can be attained using computationally efficient algorithms.
Uncertainty Quantification on Clinical Trial Outcome Prediction
Tianyi Chen, Yingzhou Lu, Nan Hao, Capucine Van Rechem, Jintai Chen, Tianfan Fu
Jan 09 2024 cs.LG stat.ML arXiv:2401.03482v2

@misc{2401.03482, author = {Tianyi Chen and Yingzhou Lu and Nan Hao and Capucine Van Rechem and Jintai Chen and Tianfan Fu}, title = {{U}ncertainty {Q}uantification on {C}linical {T}rial {O}utcome {P}rediction}, year = {2024}, eprint = {2401.03482}, note = {arXiv:2401.03482v2} }
PDF
The importance of uncertainty quantification is increasingly recognized in the diverse field of machine learning. Accurately assessing model prediction uncertainty can help provide deeper understanding and confidence for researchers and practitioners. This is especially critical in medical diagnosis and drug discovery areas, where reliable predictions directly impact research quality and patient health. In this paper, we proposed incorporating uncertainty quantification into clinical trial outcome predictions. Our main goal is to enhance the model's ability to discern nuanced differences, thereby significantly improving its overall performance. We have adopted a selective classification approach to fulfill our objective, integrating it seamlessly with the Hierarchical Interaction Network (HINT), which is at the forefront of clinical trial prediction modeling. Selective classification, encompassing a spectrum of methods for uncertainty quantification, empowers the model to withhold decision-making in the face of samples marked by ambiguity or low confidence, thereby amplifying the accuracy of predictions for the instances it chooses to classify. A series of comprehensive experiments demonstrate that incorporating selective classification into clinical trial predictions markedly enhances the model's performance, as evidenced by significant upticks in pivotal metrics such as PR-AUC, F1, ROC-AUC, and overall accuracy. Specifically, the proposed method achieved 32.37\%, 21.43\%, and 13.27\% relative improvement on PR-AUC over the base model (HINT) in phase I, II, and III trial outcome prediction, respectively. When predicting phase III, our method reaches 0.9022 PR-AUC scores. These findings illustrate the robustness and prospective utility of this strategy within the area of clinical trial predictions, potentially setting a new benchmark in the field.
Semidefinite Relaxations of the Gromov-Wasserstein Distance
Junyu Chen, Binh T. Nguyen, Shang Hui Koh, Yong Sheng Soh
Dec 25 2023 math.OC stat.ML arXiv:2312.14572v3

@misc{2312.14572, author = {Junyu Chen and Binh T.~Nguyen and Shang Hui Koh and Yong Sheng Soh}, title = {{S}emidefinite {R}elaxations of the {G}romov-{W}asserstein {D}istance}, year = {2023}, eprint = {2312.14572}, note = {arXiv:2312.14572v3} }
PDF
The Gromov-Wasserstein (GW) distance is an extension of the optimal transport problem that allows one to match objects between incomparable spaces. At its core, the GW distance is specified as the solution of a non-convex quadratic program and is not known to be tractable to solve. In particular, existing solvers for the GW distance are only able to find locally optimal solutions. In this work, we propose a semi-definite programming (SDP) relaxation of the GW distance. The relaxation can be viewed as the Lagrangian dual of the GW distance augmented with constraints that relate to the linear and quadratic terms of transportation plans. In particular, our relaxation provides a tractable (polynomial-time) algorithm to compute globally optimal transportation plans (in some instances) together with an accompanying proof of global optimality. Our numerical experiments suggest that the proposed relaxation is strong in that it frequently computes the globally optimal solution. Our Python implementation is available at https://github.com/tbng/gwsdp.
A Bayesian Spatial Berkson error approach to estimate small area opioid mortality rates accounting for population-at-risk uncertainty
Emily N Peterson, Rachel C. Nethery, Jarvis T. Chen, Loni P. Tabb, Brent A. Coull, Frederic B. Piel, Lance A Waller
Dec 22 2023 stat.ME stat.AP arXiv:2312.13331v1

@misc{2312.13331, author = {Emily N Peterson and Rachel C.~Nethery and Jarvis T.~Chen and Loni P.~Tabb and Brent A.~Coull and Frederic B.~Piel and Lance A Waller}, title = {{A} {B}ayesian {S}patial {B}erkson error approach to estimate small area opioid mortality rates accounting for population-at-risk uncertainty}, year = {2023}, eprint = {2312.13331}, note = {arXiv:2312.13331v1} }
PDF
Monitoring small-area geographical population trends in opioid mortality has large scale implications to informing preventative resource allocation. A common approach to obtain small area estimates of opioid mortality is to use a standard disease mapping approach in which population-at-risk estimates are treated as fixed and known. Assuming fixed populations ignores the uncertainty surrounding small area population estimates, which may bias risk estimates and under-estimate their associated uncertainties. We present a Bayesian Spatial Berkson Error (BSBE) model to incorporate population-at-risk uncertainty within a disease mapping model. We compare the BSBE approach to the naive (treating denominators as fixed) using simulation studies to illustrate potential bias resulting from this assumption. We show the application of the BSBE model to obtain 2020 opioid mortality risk estimates for 159 counties in GA accounting for population-at-risk uncertainty. Utilizing our proposed approach will help to inform interventions in opioid related public health responses, policies, and resource allocation. Additionally, we provide a general framework to improve in the estimation and mapping of health indicators.
Enhancing Polynomial Chaos Expansion Based Surrogate Modeling using a Novel Probabilistic Transfer Learning Strategy
Wyatt Bridgman, Uma Balakrishnan, Reese Jones, Jiefu Chen, Xuqing Wu, Cosmin Safta, Yueqin Huang, Mohammad Khalil
Dec 11 2023 stat.ML cs.LG arXiv:2312.04648v1

@misc{2312.04648, author = {Wyatt Bridgman and Uma Balakrishnan and Reese Jones and Jiefu Chen and Xuqing Wu and Cosmin Safta and Yueqin Huang and Mohammad Khalil}, title = {{E}nhancing {P}olynomial {C}haos {E}xpansion {B}ased {S}urrogate {M}odeling using a {N}ovel {P}robabilistic {T}ransfer {L}earning {S}trategy}, year = {2023}, eprint = {2312.04648}, note = {arXiv:2312.04648v1} }
PDF
In the field of surrogate modeling, polynomial chaos expansion (PCE) allows practitioners to construct inexpensive yet accurate surrogates to be used in place of the expensive forward model simulations. For black-box simulations, non-intrusive PCE allows the construction of these surrogates using a set of simulation response evaluations. In this context, the PCE coefficients can be obtained using linear regression, which is also known as point collocation or stochastic response surfaces. Regression exhibits better scalability and can handle noisy function evaluations in contrast to other non-intrusive approaches, such as projection. However, since over-sampling is generally advisable for the linear regression approach, the simulation requirements become prohibitive for expensive forward models. We propose to leverage transfer learning whereby knowledge gained through similar PCE surrogate construction tasks (source domains) is transferred to a new surrogate-construction task (target domain) which has a limited number of forward model simulations (training data). The proposed transfer learning strategy determines how much, if any, information to transfer using new techniques inspired by Bayesian modeling and data assimilation. The strategy is scrutinized using numerical investigations and applied to an engineering problem from the oil and gas industry.
Target-agnostic Source-free Domain Adaptation for Regression Tasks
Tianlang He, Zhiqiu Xia, Jierun Chen, Haoliang Li, S.-H. Gary Chan
Dec 04 2023 cs.LG cs.AI stat.ML arXiv:2312.00540v1

@misc{2312.00540, author = {Tianlang He and Zhiqiu Xia and Jierun Chen and Haoliang Li and S.-H.~Gary Chan}, title = {{T}arget-agnostic {S}ource-free {D}omain {A}daptation for {R}egression {T}asks}, year = {2023}, eprint = {2312.00540}, note = {arXiv:2312.00540v1} }
PDF
Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.
Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results
Biqian Cheng, Evangelos E. Papalexakis, Jia Chen
Dec 04 2023 cs.LG stat.ML arXiv:2312.00296v2

@misc{2312.00296, author = {Biqian Cheng and Evangelos E.~Papalexakis and Jia Chen}, title = {{T}owards {A}ligned {C}anonical {C}orrelation {A}nalysis: {P}reliminary {F}ormulation and {P}roof-of-{C}oncept {R}esults}, year = {2023}, eprint = {2312.00296}, note = {arXiv:2312.00296v2} }
PDF
Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding.
Batch effect correction with sample remeasurement in highly confounded case-control studies
Hanxuan Ye, Xianyang Zhang, Chen Wang, Ellen L. Goode, Jun Chen
Nov 07 2023 stat.ME arXiv:2311.03289v1

@misc{2311.03289, author = {Hanxuan Ye and Xianyang Zhang and Chen Wang and Ellen L.~Goode and Jun Chen}, title = {{B}atch effect correction with sample remeasurement in highly confounded case-control studies}, year = {2023}, eprint = {2311.03289}, note = {arXiv:2311.03289v1} }
PDF
Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch effect correction with remeasured samples are severely under-developed. In this study, we developed a framework for batch effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics, and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.
Generative Learning of Continuous Data by Tensor Networks
Alex Meiburg, Jing Chen, Jacob Miller, Raphaëlle Tihon, Guillaume Rabusseau, Alejandro Perdomo-Ortiz
Nov 01 2023 cs.LG cond-mat.stat-mech quant-ph stat.ML arXiv:2310.20498v2

@misc{2310.20498, author = {Alex Meiburg and Jing Chen and Jacob Miller and Raphaëlle Tihon and Guillaume Rabusseau and Alejandro Perdomo-Ortiz}, title = {{G}enerative {L}earning of {C}ontinuous {D}ata by {T}ensor {N}etworks}, year = {2023}, eprint = {2310.20498}, note = {arXiv:2310.20498v2} }
PDF
Beyond their origin in modeling many-body quantum systems, tensor networks have emerged as a promising class of models for solving machine learning problems, notably in unsupervised generative learning. While possessing many desirable features arising from their quantum-inspired nature, tensor network generative models have previously been largely restricted to binary or categorical data, limiting their utility in real-world modeling problems. We overcome this by introducing a new family of tensor network generative models for continuous data, which are capable of learning from distributions containing continuous random variables. We develop our method in the setting of matrix product states, first deriving a universal expressivity theorem proving the ability of this model family to approximate any reasonably smooth probability density function with arbitrary precision. We then benchmark the performance of this model on several synthetic and real-world datasets, finding that the model learns and generalizes well on distributions of continuous and discrete variables. We develop methods for modeling different data domains, and introduce a trainable compression layer which is found to increase model performance given limited memory or computational resources. Overall, our methods give important theoretical and empirical evidence of the efficacy of quantum-inspired methods for the rapidly growing field of generative learning.
Robust nonparametric regression based on deep ReLU neural networks
Juntong Chen
Nov 01 2023 stat.ME arXiv:2310.20294v1

@misc{2310.20294, author = {Juntong Chen}, title = {{R}obust nonparametric regression based on deep {R}e{LU} neural networks}, year = {2023}, eprint = {2310.20294}, note = {arXiv:2310.20294v1} }
PDF
In this paper, we consider robust nonparametric regression using deep neural networks with ReLU activation function. While several existing theoretically justified methods are geared towards robustness against identical heavy-tailed noise distributions, the rise of adversarial attacks has emphasized the importance of safeguarding estimation procedures against systematic contamination. We approach this statistical issue by shifting our focus towards estimating conditional distributions. To address it robustly, we introduce a novel estimation procedure based on $\ell$-estimation. Under a mild model assumption, we establish general non-asymptotic risk bounds for the resulting estimators, showcasing their robustness against contamination, outliers, and model misspecification. We then delve into the application of our approach using deep ReLU neural networks. When the model is well-specified and the regression function belongs to an $\alpha$-Hölder class, employing $\ell$-type estimation on suitable networks enables the resulting estimators to achieve the minimax optimal rate of convergence. Additionally, we demonstrate that deep $\ell$-type estimators can circumvent the curse of dimensionality by assuming the regression function closely resembles the composition of several Hölder functions. To attain this, new deep fully-connected ReLU neural networks have been designed to approximate this composition class. This approximation result can be of independent interest.
MCRAGE: Synthetic Healthcare Data for Fairness
Keira Behal, Jiayi Chen, Caleb Fikes, Sophia Xiao
Oct 31 2023 stat.ML cs.LG arXiv:2310.18430v3

@misc{2310.18430, author = {Keira Behal and Jiayi Chen and Caleb Fikes and Sophia Xiao}, title = {{MCRAGE}: {S}ynthetic {H}ealthcare {D}ata for {F}airness}, year = {2023}, eprint = {2310.18430}, note = {arXiv:2310.18430v3} }
PDF
In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to those from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes, which can be used to train less biased downstream models. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC of these downstream models. We provide theoretical justification for our method in terms of recent convergence results for DDPMs.
On the Identifiability and Interpretability of Gaussian Process Models
Jiawen Chen, Wancen Mu, Yun Li, Didong Li
Oct 27 2023 stat.ML cs.LG arXiv:2310.17023v1

@misc{2310.17023, author = {Jiawen Chen and Wancen Mu and Yun Li and Didong Li}, title = {{O}n the {I}dentifiability and {I}nterpretability of {G}aussian {P}rocess {M}odels}, year = {2023}, eprint = {2310.17023}, note = {arXiv:2310.17023v1} }
PDF
In this paper, we critically examine the prevalent practice of using additive mixtures of Matérn kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Matérn kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Matérn kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Matérn. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.
Covariance Operator Estimation: Sparsity, Lengthscale, and Ensemble Kalman Filters
Omar Al-Ghattas, Jiaheng Chen, Daniel Sanz-Alonso, Nathan Waniorek
Oct 27 2023 math.ST math.PR stat.TH arXiv:2310.16933v2

@misc{2310.16933, author = {Omar Al-Ghattas and Jiaheng Chen and Daniel Sanz-Alonso and Nathan Waniorek}, title = {{C}ovariance {O}perator {E}stimation: {S}parsity, {L}engthscale, and {E}nsemble {K}alman {F}ilters}, year = {2023}, eprint = {2310.16933}, note = {arXiv:2310.16933v2} }
PDF
This paper investigates covariance operator estimation via thresholding. For Gaussian random fields with approximately sparse covariance operators, we establish non-asymptotic bounds on the estimation error in terms of the sparsity level of the covariance and the expected supremum of the field. We prove that thresholded estimators enjoy an exponential improvement in sample complexity compared with the standard sample covariance estimator if the field has a small correlation lengthscale. As an application of the theory, we study thresholded estimation of covariance operators within ensemble Kalman filters.
Empirical limit theorems for Wiener chaos
Shuyang Bai, Jiemiao Chen
Oct 25 2023 math.PR math.ST stat.TH arXiv:2310.15462v4

@misc{2310.15462, author = {Shuyang Bai and Jiemiao Chen}, title = {{E}mpirical limit theorems for {W}iener chaos}, year = {2023}, eprint = {2310.15462}, doi = {10.1016/j.spl.2024.110222}, note = {arXiv:2310.15462v4} }
PDF
We consider empirical measures in a triangular array setup with underlying distributions varying as sample size grows. We study asymptotic properties of multiple integrals with respect to normalized empirical measures. Limit theorems involving series of multiple Wiener-Itô integrals are established.
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing
Junren Chen, Jonathan Scarlett, Michael K. Ng, Zhaoqiang Liu
Oct 24 2023 eess.SP cs.IT cs.LG math.IT stat.ML arXiv:2310.03758v2

@misc{2310.03758, author = {Junren Chen and Jonathan Scarlett and Michael K.~Ng and Zhaoqiang Liu}, title = {{A} {U}nified {F}ramework for {U}niform {S}ignal {R}ecovery in {N}onlinear {G}enerative {C}ompressed {S}ensing}, year = {2023}, eprint = {2310.03758}, note = {arXiv:2310.03758v2} }
PDF
In generative compressed sensing (GCS), we want to recover a signal $\mathbf{x}^* \in \mathbb{R}^n$ from $m$ measurements ($m\ll n$) using a generative prior $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$, where $G$ is typically an $L$-Lipschitz continuous generative model and $\mathbb{B}_2^k(r)$ represents the radius-$r$ $\ell_2$-ball in $\mathbb{R}^k$. Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $\mathbf{x}^*$ rather than for all $\mathbf{x}^*$ simultaneously. In this paper, we build a unified framework to derive uniform recovery guarantees for nonlinear GCS where the observation model is nonlinear and possibly discontinuous or unknown. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. Specifically, using a single realization of the sensing ensemble and generalized Lasso, \em all $\mathbf{x}^*\in G(\mathbb{B}_2^k(r))$ can be recovered up to an $\ell_2$-error at most $\epsilon$ using roughly $\tilde{O}({k}/{\epsilon^2})$ samples, with omitted logarithmic factors typically being dominated by $\log L$. Notably, this almost coincides with existing non-uniform guarantees up to logarithmic factors, hence the uniformity costs very little. As part of our technical contributions, we introduce the Lipschitz approximation to handle discontinuous observation models. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy. Experimental results are presented to corroborate our theory.
Optimal Conditional Inference in Adaptive Experiments
Jiafeng Chen, Isaiah Andrews
Sep 22 2023 stat.ME cs.LG econ.EM math.ST stat.TH arXiv:2309.12162v1

@misc{2309.12162, author = {Jiafeng Chen and Isaiah Andrews}, title = {{O}ptimal {C}onditional {I}nference in {A}daptive {E}xperiments}, year = {2023}, eprint = {2309.12162}, note = {arXiv:2309.12162v1} }
PDF
We study batched bandit experiments and consider the problem of inference conditional on the realized stopping time, assignment probabilities, and target parameter, where all of these may be chosen adaptively using information up to the last batch of the experiment. Absent further restrictions on the experiment, we show that inference using only the results of the last batch is optimal. When the adaptive aspects of the experiment are known to be location-invariant, in the sense that they are unchanged when we shift all batch-arm means by a constant, we show that there is additional information in the data, captured by one additional linear function of the batch-arm means. In the more restrictive case where the stopping time, assignment probabilities, and target parameter are known to depend on the data only through a collection of polyhedral events, we derive computationally tractable and optimal conditional inference procedures.
Solving Quadratic Systems with Full-Rank Matrices Using Sparse or Generative Priors
Junren Chen, Shuai Huang, Michael K. Ng, Zhaoqiang Liu
Sep 19 2023 cs.IT cs.LG eess.SP math.IT stat.ML arXiv:2309.09032v1

@misc{2309.09032, author = {Junren Chen and Shuai Huang and Michael K.~Ng and Zhaoqiang Liu}, title = {{S}olving {Q}uadratic {S}ystems with {F}ull-{R}ank {M}atrices {U}sing {S}parse or {G}enerative {P}riors}, year = {2023}, eprint = {2309.09032}, note = {arXiv:2309.09032v1} }
PDF
The problem of recovering a signal $\boldsymbol{x} \in \mathbb{R}^n$ from a quadratic system $\{y_i=\boldsymbol{x}^\top\boldsymbol{A}_i\boldsymbol{x},\ i=1,\ldots,m\}$ with full-rank matrices $\boldsymbol{A}_i$ frequently arises in applications such as unassigned distance geometry and sub-wavelength imaging. With i.i.d. standard Gaussian matrices $\boldsymbol{A}_i$, this paper addresses the high-dimensional case where $m\ll n$ by incorporating prior knowledge of $\boldsymbol{x}$. First, we consider a $k$-sparse $\boldsymbol{x}$ and introduce the thresholded Wirtinger flow (TWF) algorithm that does not require the sparsity level $k$. TWF comprises two steps: the spectral initialization that identifies a point sufficiently close to $\boldsymbol{x}$ (up to a sign flip) when $m=O(k^2\log n)$, and the thresholded gradient descent (with a good initialization) that produces a sequence linearly converging to $\boldsymbol{x}$ with $m=O(k\log n)$ measurements. Second, we explore the generative prior, assuming that $\boldsymbol{x}$ lies in the range of an $L$-Lipschitz continuous generative model with $k$-dimensional inputs in an $\ell_2$-ball of radius $r$. We develop the projected gradient descent (PGD) algorithm that also comprises two steps: the projected power method that provides an initial vector with $O\big(\sqrt{\frac{k \log L}{m}}\big)$ $\ell_2$-error given $m=O(k\log(Lnr))$ measurements, and the projected gradient descent that refines the $\ell_2$-error to $O(\delta)$ at a geometric rate when $m=O(k\log\frac{Lrn}{\delta^2})$. Experimental results corroborate our theoretical findings and show that: (i) our approach for the sparse case notably outperforms the existing provable algorithm sparse power factorization; (ii) leveraging the generative prior allows for precise image recovery in the MNIST dataset from a small number of quadratic measurements.
Optimal Estimation under a Semiparametric Density Ratio Model
Archer Gong Zhang, Jiahua Chen
Sep 19 2023 stat.ME econ.EM math.ST stat.TH arXiv:2309.09103v1

@misc{2309.09103, author = {Archer Gong Zhang and Jiahua Chen}, title = {{O}ptimal {E}stimation under a {S}emiparametric {D}ensity {R}atio {M}odel}, year = {2023}, eprint = {2309.09103}, note = {arXiv:2309.09103v1} }
PDF
In many statistical and econometric applications, we gather individual samples from various interconnected populations that undeniably exhibit common latent structures. Utilizing a model that incorporates these latent structures for such data enhances the efficiency of inferences. Recently, many researchers have been adopting the semiparametric density ratio model (DRM) to address the presence of latent structures. The DRM enables estimation of each population distribution using pooled data, resulting in statistically more efficient estimations in contrast to nonparametric methods that analyze each sample in isolation. In this article, we investigate the limit of the efficiency improvement attainable through the DRM. We focus on situations where one population's sample size significantly exceeds those of the other populations. In such scenarios, we demonstrate that the DRM-based inferences for populations with smaller sample sizes achieve the highest attainable asymptotic efficiency as if a parametric model is assumed. The estimands we consider include the model parameters, distribution functions, and quantiles. We use simulation experiments to support the theoretical findings with a specific focus on quantile estimation. Additionally, we provide an analysis of real revenue data from U.S. collegiate sports to illustrate the efficacy of our contribution.
Anomaly Detection in Spatio-Temporal Data: Theory and Application
Ji Chen
Sep 19 2023 stat.ME stat.AP arXiv:2309.09878v1

@misc{2309.09878, author = {Ji Chen}, title = {{A}nomaly {D}etection in {S}patio-{T}emporal {D}ata: {T}heory and {A}pplication}, year = {2023}, eprint = {2309.09878}, note = {arXiv:2309.09878v1} }
PDF
This paper provides an overview of three notable approaches for detecting anomalies in spatio-temporal data. The three review methods are selected from the framework of multivariate statistical process control (SPC), scan statistics, and tensor decomposition. For each method, we first demonstrate its technical intricacies and then apply it to a real-world dataset, which is 300 images of solar activities collected by satellite. Our findings reveal that these methods possess distinct strengths. Specifically, scan statistics excel at identifying clustered anomalies, multivariate SPC is effective in detecting sparse anomalies, and tensor decomposition is adept at identifying anomalies exhibiting desirable patterns, such as temporal circularity. We emphasize the importance of customizing the selection of these methods based on the specific characteristics of the dataset and the analysis objectives.
Monotone Tree-Based GAMI Models by Adapting XGBoost
Linwei Hu, Soroush Aramideh, Jie Chen, Vijayan N. Nair
Sep 07 2023 stat.ML cs.LG arXiv:2309.02426v1

@misc{2309.02426, author = {Linwei Hu and Soroush Aramideh and Jie Chen and Vijayan N.~Nair}, title = {{M}onotone {T}ree-{B}ased {GAMI} {M}odels by {A}dapting {XGB}oost}, year = {2023}, eprint = {2309.02426}, note = {arXiv:2309.02426v1} }
PDF
Recent papers have used machine learning architecture to fit low-order functional ANOVA models with main effects and second-order interactions. These GAMI (GAM + Interaction) models are directly interpretable as the functional main effects and interactions can be easily plotted and visualized. Unfortunately, it is not easy to incorporate the monotonicity requirement into the existing GAMI models based on boosted trees, such as EBM (Lou et al. 2013) and GAMI-Lin-T (Hu et al. 2022). This paper considers models of the form $f(x)=\sum_{j,k}f_{j,k}(x_j, x_k)$ and develops monotone tree-based GAMI models, called monotone GAMI-Tree, by adapting the XGBoost algorithm. It is straightforward to fit a monotone model to $f(x)$ using the options in XGBoost. However, the fitted model is still a black box. We take a different approach: i) use a filtering technique to determine the important interactions, ii) fit a monotone XGBoost algorithm with the selected interactions, and finally iii) parse and purify the results to get a monotone GAMI model. Simulated datasets are used to demonstrate the behaviors of mono-GAMI-Tree and EBM, both of which use piecewise constant fits. Note that the monotonicity requirement is for the full model. Under certain situations, the main effects will also be monotone. But, as seen in the examples, the interactions will not be monotone.
A Parameter-Free Two-Bit Covariance Estimator with Improved Operator Norm Error Rate
Junren Chen, Michael K. Ng
Aug 31 2023 stat.ML cs.IT cs.LG math.IT arXiv:2308.16059v1

@misc{2308.16059, author = {Junren Chen and Michael K.~Ng}, title = {{A} {P}arameter-{F}ree {T}wo-{B}it {C}ovariance {E}stimator with {I}mproved {O}perator {N}orm {E}rror {R}ate}, year = {2023}, eprint = {2308.16059}, note = {arXiv:2308.16059v1} }
PDF
A covariance matrix estimator using two bits per entry was recently developed by Dirksen, Maly and Rauhut [Annals of Statistics, 50(6), pp. 3538-3562]. The estimator achieves near minimax rate for general sub-Gaussian distributions, but also suffers from two downsides: theoretically, there is an essential gap on operator norm error between their estimator and sample covariance when the diagonal of the covariance matrix is dominated by only a few entries; practically, its performance heavily relies on the dithering scale, which needs to be tuned according to some unknown parameters. In this work, we propose a new 2-bit covariance matrix estimator that simultaneously addresses both issues. Unlike the sign quantizer associated with uniform dither in Dirksen et al., we adopt a triangular dither prior to a 2-bit quantizer inspired by the multi-bit uniform quantizer. By employing dithering scales varying across entries, our estimator enjoys an improved operator norm error rate that depends on the effective rank of the underlying covariance matrix rather than the ambient dimension, thus closing the theoretical gap. Moreover, our proposed method eliminates the need of any tuning parameter, as the dithering scales are entirely determined by the data. Experimental results under Gaussian samples are provided to showcase the impressive numerical performance of our estimator. Remarkably, by halving the dithering scales, our estimator oftentimes achieves operator norm errors less than twice of the errors of sample covariance.
Deep Generative Imputation Model for Missing Not At Random Data
Jialei Chen, Yuanbo Xu, Pengyang Wang, Yongjian Yang
Aug 17 2023 cs.LG stat.ML arXiv:2308.08158v1

@misc{2308.08158, author = {Jialei Chen and Yuanbo Xu and Pengyang Wang and Yongjian Yang}, title = {{D}eep {G}enerative {I}mputation {M}odel for {M}issing {N}ot {A}t {R}andom {D}ata}, year = {2023}, eprint = {2308.08158}, doi = {10.1145/3583780.3614835}, note = {arXiv:2308.08158v1} }
PDF
Data analysis usually suffers from the Missing Not At Random (MNAR) problem, where the cause of the value missing is not fully observed. Compared to the naive Missing Completely At Random (MCAR) problem, it is more in line with the realistic scenario whereas more complex and challenging. Existing statistical methods model the MNAR mechanism by different decomposition of the joint distribution of the complete data and the missing mask. But we empirically find that directly incorporating these statistical methods into deep generative models is sub-optimal. Specifically, it would neglect the confidence of the reconstructed mask during the MNAR imputation process, which leads to insufficient information extraction and less-guaranteed imputation quality. In this paper, we revisit the MNAR problem from a novel perspective that the complete data and missing mask are two modalities of incomplete data on an equal footing. Along with this line, we put forward a generative-model-specific joint probability decomposition method, conjunction model, to represent the distributions of two modalities in parallel and extract sufficient information from both complete data and missing mask. Taking a step further, we exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space and concurrently impute the incomplete data and reconstruct the missing mask. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins (averagely improved from 9.9% to 18.8% in RMSE) and always gives a better mask reconstruction accuracy which makes the imputation more principle.
Gender Inclusive Methods in Studies of STEM Practitioners
Kaitlin Rasmussen, Jocelyne Chen, Rebecca L. Colquhoun, Sophia Frentz, Laurel Hiatt, Aiden James Kosciesza, Charlotte Olsen, Theo J. O'Neill, Vic Zamloot, Beckett E. Strauss
Aug 01 2023 stat.AP arXiv:2307.15802v1

@misc{2307.15802, author = {Kaitlin Rasmussen and Jocelyne Chen and Rebecca L.~Colquhoun and Sophia Frentz and Laurel Hiatt and Aiden James Kosciesza and Charlotte Olsen and Theo J.~O'Neill and Vic Zamloot and Beckett E.~Strauss}, title = {{G}ender {I}nclusive {M}ethods in {S}tudies of {STEM} {P}ractitioners}, year = {2023}, eprint = {2307.15802}, note = {arXiv:2307.15802v1} }
PDF
Gender inequity is one of the biggest challenges facing the STEM workforce. While there are many studies that look into gender disparities within STEM and academia, the majority of these have been designed and executed by those unfamiliar with research in sociology and gender studies. They adopt a normative view of gender as a binary choice of 'male' or 'female,' leaving individuals whose genders do not fit within that model out of such research entirely. This especially impacts those experiencing multiple axes of marginalization, such as race, disability, and socioeconomic status. For STEM fields to recruit and retain members of historically excluded groups, a new paradigm must be developed. Here, we collate a new dataset of the methods used in 119 past studies of gender equity, and recommend better survey practices and institutional policies based on a more complex and accurate approach to gender. We find that problematic approaches to gender in surveys can be classified into 5 main themes - treating gender as white, observable, discrete, as a statistic, and as inconsequential. We recommend allowing self-reporting of gender and never automating gender assignment within research. This work identifies the key areas of development for studies of gender-based inclusion within STEM, and provides recommended solutions to support the methodological uplift required for this work to be both scientifically sound and fully inclusive.
Towards Generalizable Reinforcement Learning for Trade Execution
Chuheng Zhang, Yitong Duan, Xiaoyu Chen, Jianyu Chen, Jian Li, Li Zhao
Jul 24 2023 q-fin.TR cs.LG stat.ML arXiv:2307.11685v1

@misc{2307.11685, author = {Chuheng Zhang and Yitong Duan and Xiaoyu Chen and Jianyu Chen and Jian Li and Li Zhao}, title = {{T}owards {G}eneralizable {R}einforcement {L}earning for {T}rade {E}xecution}, year = {2023}, eprint = {2307.11685}, note = {arXiv:2307.11685v1} }
PDF
Optimized trade execution is to sell (or buy) a given amount of assets in a given time with the lowest possible trading cost. Recently, reinforcement learning (RL) has been applied to optimized trade execution to learn smarter policies from market data. However, we find that many existing RL methods exhibit considerable overfitting which prevents them from real deployment. In this paper, we provide an extensive study on the overfitting problem in optimized trade execution. First, we model the optimized trade execution as offline RL with dynamic context (ORDC), where the context represents market variables that cannot be influenced by the trading policy and are collected in an offline manner. Under this framework, we derive the generalization bound and find that the overfitting issue is caused by large context space and limited context samples in the offline setting. Accordingly, we propose to learn compact representations for context to address the overfitting problem, either by leveraging prior knowledge or in an end-to-end manner. To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms. Our experiments on the high-fidelity simulator demonstrate that our algorithms can effectively alleviate overfitting and achieve better performance.
Considerations for Master Protocols Using External Controls
Jie Chen, Xiaoyun Li, Chengxing Lu, Sammy Yuan, Godwin Yung, Jingjing Ye, Hong Tian, Jianchang Lin
Jul 12 2023 stat.AP arXiv:2307.05050v2

@misc{2307.05050, author = {Jie Chen and Xiaoyun Li and Chengxing Lu and Sammy Yuan and Godwin Yung and Jingjing Ye and Hong Tian and Jianchang Lin}, title = {{C}onsiderations for {M}aster {P}rotocols {U}sing {E}xternal {C}ontrols}, year = {2023}, eprint = {2307.05050}, note = {arXiv:2307.05050v2} }
PDF
There has been an increasing use of master protocols in oncology clinical trials because of its efficiency and flexibility to accelerate cancer drug development. Depending on the study objective and design, a master protocol trial can be a basket trial, an umbrella trial, a platform trial, or any other form of trials in which multiple investigational products and/or subpopulations are studied under a single protocol. Master protocols can use external data and evidence (e.g., external controls) for treatment effect estimation, which can further improve efficiency of master protocol trials. This paper provides an overview of different types of external controls and their unique features when used in master protocols. Some key considerations in master protocols with external controls are discussed including construction of estimands, assessment of fit-for-use real-world data, and considerations for different types of master protocols. Similarities and differences between regular randomized controlled trials and master protocols when using external controls are discussed. A targeted learning-based causal roadmap is presented which constitutes three key steps: (1) define a target statistical estimand that aligns with the causal estimand for the study objective, (2) use an efficient estimator to estimate the target statistical estimand and its uncertainty, and (3) evaluate the impact of causal assumptions on the study conclusion by performing sensitivity analyses. Two illustrative examples for master protocols using external controls are discussed for their merits and possible improvement in causal effect estimation.
A Direct Approach to Simultaneous Tests of Superiority and Noninferiority with Multiple Endpoints
Wenfeng Chen, Naiqing Zhao, Guoyou Qin, Jie Chen
Jul 04 2023 stat.ME stat.AP arXiv:2307.00189v2

@misc{2307.00189, author = {Wenfeng Chen and Naiqing Zhao and Guoyou Qin and Jie Chen}, title = {{A} {D}irect {A}pproach to {S}imultaneous {T}ests of {S}uperiority and {N}oninferiority with {M}ultiple {E}ndpoints}, year = {2023}, eprint = {2307.00189}, note = {arXiv:2307.00189v2} }
PDF
Simultaneous tests of superiority and non-inferiority hypotheses on multiple endpoints are often performed in clinical trials to demonstrate that a new treatment is superior over a control on at least one endpoint and non-inferior on the remaining endpoints. Existing methods tackle this problem by testing the superiority and non-inferiority hypotheses separately and control the Type I error rate each at $\alpha$ level. In this paper we propose a unified approach to testing the superiority and non-inferiority hypotheses simultaneously. The proposed approach is based on the UI-IU test and the least favorable configurations of the combined superiority and non-inferiority hypotheses, which leads to the solution of an adjusted significance level $\alpha'$ for marginal tests that controls the overall Type I error rate at pre-defined $\alpha$. Simulations show that the proposed approach maintains a higher power than existing methods in the settings under investigation. Since the adjusted significance level $\alpha'$ is obtained by controlling the Type I error rate at $\alpha$, one can easily construct the exact $(1 - \alpha)\%$ simultaneous confidence intervals for treatment effects on all endpoints. The proposed approach is illustrated with two real examples.
Estimands in Real-World Evidence Studies
Jie Chen, Daniel Scharfstein, Hongwei Wang, Binbing Yu, Yang Song, Weili He, John Scott, Xiwu Lin, Hana Lee
Jul 04 2023 stat.AP arXiv:2307.00190v1

@misc{2307.00190, author = {Jie Chen and Daniel Scharfstein and Hongwei Wang and Binbing Yu and Yang Song and Weili He and John Scott and Xiwu Lin and Hana Lee}, title = {{E}stimands in {R}eal-{W}orld {E}vidence {S}tudies}, year = {2023}, eprint = {2307.00190}, note = {arXiv:2307.00190v1} }
PDF
A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which reflects the research question and the study objective, is one of the key components in formulating a clinical study. ICH E9(R1) describes statistical principles for constructing estimands in clinical trials with a focus on five attributes -- population, treatment, endpoints, intercurrent events, and population-level summary. However, defining estimands for clinical studies using real-world data (RWD), i.e., RWE studies, requires additional considerations due to, for example, heterogeneity of study population, complexity of treatment regimes, different types and patterns of intercurrent events, and complexities in choosing study endpoints. This paper reviews the essential components of estimands and causal inference framework, discusses considerations in constructing estimands for RWE studies, highlights similarities and differences in traditional clinical trial and RWE study estimands, and provides a roadmap for choosing appropriate estimands for RWE studies.
Data Structures for Density Estimation
Anders Aamand, Alexandr Andoni, Justin Y. Chen, Piotr Indyk, Shyam Narayanan, Sandeep Silwal
Jun 21 2023 cs.DS cs.LG stat.ML arXiv:2306.11312v1

@misc{2306.11312, author = {Anders Aamand and Alexandr Andoni and Justin Y.~Chen and Piotr Indyk and Shyam Narayanan and Sandeep Silwal}, title = {{D}ata {S}tructures for {D}ensity {E}stimation}, year = {2023}, eprint = {2306.11312}, note = {arXiv:2306.11312v1} }
PDF
We study statistical/computational tradeoffs for the following density estimation problem: given $k$ distributions $v_1, \ldots, v_k$ over a discrete domain of size $n$, and sampling access to a distribution $p$, identify $v_i$ that is "close" to $p$. Our main result is the first data structure that, given a sublinear (in $n$) number of samples from $p$, identifies $v_i$ in time sublinear in $k$. We also give an improved version of the algorithm of Acharya et al. (2018) that reports $v_i$ in time linear in $k$. The experimental evaluation of the latter algorithm shows that it achieves a significant reduction in the number of operations needed to achieve a given accuracy compared to prior work.
A Gromov--Wasserstein Geometric View of Spectrum-Preserving Graph Coarsening
Yifan Chen, Rentian Yao, Yun Yang, Jie Chen
Jun 16 2023 cs.LG cs.AI stat.CO stat.ML arXiv:2306.08854v1

@misc{2306.08854, author = {Yifan Chen and Rentian Yao and Yun Yang and Jie Chen}, title = {{A} {G}romov--{W}asserstein {G}eometric {V}iew of {S}pectrum-{P}reserving {G}raph {C}oarsening}, year = {2023}, eprint = {2306.08854}, note = {arXiv:2306.08854v1} }
PDF
Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov--Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel $K$-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https://github.com/ychen-stat-ml/GW-Graph-Coarsening .
Learning under Selective Labels with Data from Heterogeneous Decision-makers: An Instrumental Variable Approach
Jian Chen, Zhehao Li, Xiaojie Mao
Jun 14 2023 stat.ML cs.LG arXiv:2306.07566v2

@misc{2306.07566, author = {Jian Chen and Zhehao Li and Xiaojie Mao}, title = {{L}earning under {S}elective {L}abels with {D}ata from {H}eterogeneous {D}ecision-makers: {A}n {I}nstrumental {V}ariable {A}pproach}, year = {2023}, eprint = {2306.07566}, note = {arXiv:2306.07566v2} }
PDF
We study the problem of learning with selectively labeled data, which arises when outcomes are only partially labeled due to historical decision-making. The labeled data distribution may substantially differ from the full population, especially when the historical decisions and the target outcome can be simultaneously affected by some unobserved factors. Consequently, learning with only the labeled data may lead to severely biased results when deployed to the full population. Our paper tackles this challenge by exploiting the fact that in many applications the historical decisions were made by a set of heterogeneous decision-makers. In particular, we analyze this setup in a principled instrumental variable (IV) framework. We establish conditions for the full-population risk of any given prediction rule to be point-identified from the observed data and provide sharp risk bounds when the point identification fails. We further propose a weighted learning approach that learns prediction rules robust to the label selection bias in both identification settings. Finally, we apply our proposed approach to a semi-synthetic financial dataset and demonstrate its superior performance in the presence of selection bias.
Causal Inference With Outcome-Dependent Missingness And Self-Censoring
Jacob M Chen, Daniel Malinsky, Rohit Bhattacharya
Jun 12 2023 stat.ME arXiv:2306.05511v1

@misc{2306.05511, author = {Jacob M Chen and Daniel Malinsky and Rohit Bhattacharya}, title = {{C}ausal {I}nference {W}ith {O}utcome-{D}ependent {M}issingness {A}nd {S}elf-{C}ensoring}, year = {2023}, eprint = {2306.05511}, note = {arXiv:2306.05511v1} }
PDF
We consider missingness in the context of causal inference when the outcome of interest may be missing. If the outcome directly affects its own missingness status, i.e., it is "self-censoring", this may lead to severely biased causal effect estimates. Miao et al. [2015] proposed the shadow variable method to correct for bias due to self-censoring; however, verifying the required model assumptions can be difficult. Here, we propose a test based on a randomized incentive variable offered to encourage reporting of the outcome that can be used to verify identification assumptions that are sufficient to correct for both self-censoring and confounding bias. Concretely, the test confirms whether a given set of pre-treatment covariates is sufficient to block all backdoor paths between the treatment and outcome as well as all paths between the treatment and missingness indicator after conditioning on the outcome. We show that under these conditions, the causal effect is identified by using the treatment as a shadow variable, and it leads to an intuitive inverse probability weighting estimator that uses a product of the treatment and response weights. We evaluate the efficacy of our test and downstream estimator via simulations.
On the Linear Convergence of Policy Gradient under Hadamard Parameterization
Jiacai Liu, Jinchi Chen, Ke Wei
Jun 01 2023 math.OC cs.LG stat.ML arXiv:2305.19575v2

@misc{2305.19575, author = {Jiacai Liu and Jinchi Chen and Ke Wei}, title = {{O}n the {L}inear {C}onvergence of {P}olicy {G}radient under {H}adamard {P}arameterization}, year = {2023}, eprint = {2305.19575}, note = {arXiv:2305.19575v2} }
PDF
The convergence of deterministic policy gradient under the Hadamard parameterization is studied in the tabular setting and the linear convergence of the algorithm is established. To this end, we first show that the error decreases at an $O(\frac{1}{k})$ rate for all the iterations. Based on this result, we further show that the algorithm has a faster local linear convergence rate after $k_0$ iterations, where $k_0$ is a constant that only depends on the MDP problem and the initialization. To show the local linear convergence of the algorithm, we have indeed established the contraction of the sub-optimal probability $b_s^k$ (i.e., the probability of the output policy $\pi^k$ on non-optimal actions) when $k\ge k_0$.
GC-Flow: A Graph-Based Flow Network for Effective Clustering
Tianchun Wang, Farzaneh Mirzazadeh, Xiang Zhang, Jie Chen
May 30 2023 cs.LG stat.ML arXiv:2305.17284v1

@misc{2305.17284, author = {Tianchun Wang and Farzaneh Mirzazadeh and Xiang Zhang and Jie Chen}, title = {{GC}-{F}low: {A} {G}raph-{B}ased {F}low {N}etwork for {E}ffective {C}lustering}, year = {2023}, eprint = {2305.17284}, note = {arXiv:2305.17284v1} }
PDF
Graph convolutional networks (GCNs) are \emphdiscriminative models that directly model the class posterior $p(y|\mathbf{x})$ for semi-supervised classification of graph data. While being effective, as a representation learning approach, the node representations extracted from a GCN often miss useful information for effective clustering, because the objectives are different. In this work, we design normalizing flows that replace GCN layers, leading to a \emphgenerative model that models both the class conditional likelihood $p(\mathbf{x}|y)$ and the class prior $p(y)$. The resulting neural network, GC-Flow, retains the graph convolution operations while being equipped with a Gaussian mixture representation space. It enjoys two benefits: it not only maintains the predictive power of GCN, but also produces well-separated clusters, due to the structuring of the representation space. We demonstrate these benefits on a variety of benchmark data sets. Moreover, we show that additional parameterization, such as that on the adjacency matrix used for graph convolutions, yields additional improvement in clustering.
Interpretable Machine Learning based on Functional ANOVA Framework: Algorithms and Comparisons
Linwei Hu, Vijayan N. Nair, Agus Sudjianto, Aijun Zhang, Jie Chen
May 26 2023 stat.ML cs.LG arXiv:2305.15670v1

@misc{2305.15670, author = {Linwei Hu and Vijayan N.~Nair and Agus Sudjianto and Aijun Zhang and Jie Chen}, title = {{I}nterpretable {M}achine {L}earning based on {F}unctional {ANOVA} {F}ramework: {A}lgorithms and {C}omparisons}, year = {2023}, eprint = {2305.15670}, note = {arXiv:2305.15670v1} }
PDF
In the early days of machine learning (ML), the emphasis was on developing complex algorithms to achieve best predictive performance. To understand and explain the model results, one had to rely on post hoc explainability techniques, which are known to have limitations. Recently, with the recognition that interpretability is just as important, researchers are compromising on small increases in predictive performance to develop algorithms that are inherently interpretable. While doing so, the ML community has rediscovered the use of low-order functional ANOVA (fANOVA) models that have been known in the statistical literature for some time. This paper starts with a description of challenges with post hoc explainability and reviews the fANOVA framework with a focus on main effects and second-order interactions. This is followed by an overview of two recently developed techniques: Explainable Boosting Machines or EBM (Lou et al., 2013) and GAMI-Net (Yang et al., 2021b). The paper proposes a new algorithm, called GAMI-Lin-T, that also uses trees like EBM, but it does linear fits instead of piecewise constants within the partitions. There are many other differences, including the development of a new interaction filtering algorithm. Finally, the paper uses simulated and real datasets to compare selected ML algorithms. The results show that GAMI-Lin-T and GAMI-Net have comparable performances, and both are generally better than EBM.
The Structurally Complex with Additive Parent Causality (SCARY) Dataset
Jarry Chen, Haytham M.Fayek
Apr 28 2023 stat.ML cs.LG arXiv:2304.14109v1

@misc{2304.14109, author = {Jarry Chen and Haytham M.Fayek}, title = {{T}he {S}tructurally {C}omplex with {A}dditive {P}arent {C}ausality ({SCARY}) {D}ataset}, year = {2023}, eprint = {2304.14109}, note = {arXiv:2304.14109v1} }
PDF
Causal datasets play a critical role in advancing the field of causality. However, existing datasets often lack the complexity of real-world issues such as selection bias, unfaithful data, and confounding. To address this gap, we propose a new synthetic causal dataset, the Structurally Complex with Additive paRent causalitY (SCARY) dataset, which includes the following features. The dataset comprises 40 scenarios, each generated with three different seeds, allowing researchers to leverage relevant subsets of the dataset. Additionally, we use two different data generation mechanisms for generating the causal relationship between parents and child nodes, including linear and mixed causal mechanisms with multiple sub-types. Our dataset generator is inspired by the Causal Discovery Toolbox and generates only additive models. The dataset has a Varsortability of 0.5. Our SCARY dataset provides a valuable resource for researchers to explore causal discovery under more realistic scenarios. The dataset is available at https://github.com/JayJayc/SCARY.
Fair Grading Algorithms for Randomized Exams
Jiale Chen, Jason Hartline, Onno Zoeter
Apr 14 2023 stat.ML cs.GT arXiv:2304.06254v1

@misc{2304.06254, author = {Jiale Chen and Jason Hartline and Onno Zoeter}, title = {{F}air {G}rading {A}lgorithms for {R}andomized {E}xams}, year = {2023}, eprint = {2304.06254}, note = {arXiv:2304.06254v1} }
PDF
This paper studies grading algorithms for randomized exams. In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.
Data-driven multinomial random forest
Junhao Chen, Xueli wang
Apr 11 2023 stat.ML cs.LG arXiv:2304.04240v1

@misc{2304.04240, author = {Junhao Chen and Xueli wang}, title = {{D}ata-driven multinomial random forest}, year = {2023}, eprint = {2304.04240}, note = {arXiv:2304.04240v1} }
PDF
In this article, we strengthen the proof methods of some previously weakly consistent variants of random forests into strongly consistent proof methods, and improve the data utilization of these variants, in order to obtain better theoretical properties and experimental performance. In addition, based on the multinomial random forest (MRF) and Bernoulli random forest (BRF), we propose a data-driven multinomial random forest (DMRF) algorithm, which has lower complexity than MRF and higher complexity than BRF while satisfying strong consistency. It has better performance in classification and regression problems than previous RF variants that only satisfy weak consistency, and in most cases even surpasses standard random forest. To the best of our knowledge, DMRF is currently the most excellent strongly consistent RF variant with low algorithm complexity
Coskewness under dependence uncertainty
Carole Bernard, Jinghui Chen, Ludger Ruschendorf, Steven Vanduffel
Mar 31 2023 math.ST math.PR q-fin.PM q-fin.ST stat.TH arXiv:2303.17266v1

@misc{2303.17266, author = {Carole Bernard and Jinghui Chen and Ludger Ruschendorf and Steven Vanduffel}, title = {{C}oskewness under dependence uncertainty}, year = {2023}, eprint = {2303.17266}, note = {arXiv:2303.17266v1} }
PDF
We study the impact of dependence uncertainty on the expectation of the product of $d$ random variables, $\mathbb{E}(X_1X_2\cdots X_d)$ when $X_i \sim F_i$ for all~$i$. Under some conditions on the $F_i$, explicit sharp bounds are obtained and a numerical method is provided to approximate them for arbitrary choices of the $F_i$. The results are applied to assess the impact of dependence uncertainty on coskewness. In this regard, we introduce a novel notion of "standardized rank coskewness," which is invariant under strictly increasing transformations and takes values in $[-1,\ 1]$.
Global Consistency of Empirical Likelihood
Haodi Liang, Jiahua Chen
Mar 30 2023 math.ST stat.TH arXiv:2303.16410v1

@misc{2303.16410, author = {Haodi Liang and Jiahua Chen}, title = {{G}lobal {C}onsistency of {E}mpirical {L}ikelihood}, year = {2023}, eprint = {2303.16410}, note = {arXiv:2303.16410v1} }
PDF
This paper develops several interesting, significant, and interconnected approaches to nonparametric or semi-parametric statistical inferences. The overwhelmingly favoured maximum likelihood estimator (MLE) under parametric model is renowned for its strong consistency and optimality generally credited to Cramer. These properties, however, falter when the model is not regular or not completely accurate. In addition, their applicability is limited to local maxima close to the unknown true parameter value. One must therefore ascertain that the global maximum of the likelihood is strongly consistent under generic conditions (Wald, 1949). Global consistency is also a vital research problem in the context of empirical likelihood (Owen, 2001). The EL is a ground-breaking platform for nonparametric statistical inference. A subsequent milestone is achieved by placing estimating functions under the EL umbrella (Qin and Lawless, 1994). The resulting profile EL function possesses many nice properties of parametric likelihood but also shares the same shortcomings. These properties cannot be utilized unless we know the local maximum at hand is close to the unknown true parameter value. To overcome this obstacle, we first put forward a clean set of conditions under which the global maximum is consistent. We then develop a global maximum test to ascertain if the local maximum at hand is in fact a global maximum. Furthermore, we invent a global maximum remedy to ensure global consistency by expanding the set of estimating functions under EL. Our simulation experiments on many examples from the literature firmly establish that the proposed approaches work as predicted. Our approaches also provide superior solutions to problems of their parametric counterparts investigated by DeHaan (1981), Veall (1991), and Gan and Jiang (1999).
Functional-Coefficient Quantile Regression for Panel Data with Latent Group Structure
Xiaorong Yang, Jia Chen, Degui Li, Runze Li
Mar 24 2023 econ.EM stat.ME arXiv:2303.13218v1

@misc{2303.13218, author = {Xiaorong Yang and Jia Chen and Degui Li and Runze Li}, title = {{F}unctional-{C}oefficient {Q}uantile {R}egression for {P}anel {D}ata with {L}atent {G}roup {S}tructure}, year = {2023}, eprint = {2303.13218}, note = {arXiv:2303.13218v1} }
PDF
This paper considers estimating functional-coefficient models in panel quantile regression with individual effects, allowing the cross-sectional and temporal dependence for large panel observations. A latent group structure is imposed on the heterogenous quantile regression models so that the number of nonparametric functional coefficients to be estimated can be reduced considerably. With the preliminary local linear quantile estimates of the subject-specific functional coefficients, a classic agglomerative clustering algorithm is used to estimate the unknown group structure and an easy-to-implement ratio criterion is proposed to determine the group number. The estimated group number and structure are shown to be consistent. Furthermore, a post-grouping local linear smoothing method is introduced to estimate the group-specific functional coefficients, and the relevant asymptotic normal distribution theory is derived with a normalisation rate comparable to that in the literature. The developed methodologies and theory are verified through a simulation study and showcased with an application to house price data from UK local authority districts, which reveals different homogeneity structures at different quantile levels.
A New Covariate Selection Strategy for High Dimensional Data in Causal Effect Estimation with Multivariate Treatments
Juan Chen, Yingchun Zhou
Mar 20 2023 stat.ME arXiv:2303.09766v1

@misc{2303.09766, author = {Juan Chen and Yingchun Zhou}, title = {{A} {N}ew {C}ovariate {S}election {S}trategy for {H}igh {D}imensional {D}ata in {C}ausal {E}ffect {E}stimation with {M}ultivariate {T}reatments}, year = {2023}, eprint = {2303.09766}, note = {arXiv:2303.09766v1} }
PDF
Selection of covariates is crucial in the estimation of average treatment effects given observational data with high or even ultra-high dimensional pretreatment variables. Existing methods for this problem typically assume sparse linear models for both outcome and univariate treatment, and cannot handle situations with ultra-high dimensional covariates. In this paper, we propose a new covariate selection strategy called double screening prior adaptive lasso (DSPAL) to select confounders and predictors of the outcome for multivariate treatments, which combines the adaptive lasso method with the marginal conditional (in)dependence prior information to select target covariates, in order to eliminate confounding bias and improve statistical efficiency. The distinctive features of our proposal are that it can be applied to high-dimensional or even ultra-high dimensional covariates for multivariate treatments, and can deal with the cases of both parametric and nonparametric outcome models, which makes it more robust compared to other methods. Our theoretical analyses show that the proposed procedure enjoys the sure screening property, the ranking consistency property and the variable selection consistency. Through a simulation study, we demonstrate that the proposed approach selects all confounders and predictors consistently and estimates the multivariate treatment effects with smaller bias and mean squared error compared to several alternatives under various scenarios. In real data analysis, the method is applied to estimate the causal effect of a three-dimensional continuous environmental treatment on cholesterol level and enlightening results are obtained.
GLASU: A Communication-Efficient Algorithm for Federated Learning with Vertically Distributed Graph Data
Xinwei Zhang, Mingyi Hong, Jie Chen
Mar 17 2023 cs.LG stat.ML arXiv:2303.09531v1

@misc{2303.09531, author = {Xinwei Zhang and Mingyi Hong and Jie Chen}, title = {{GLASU}: {A} {C}ommunication-{E}fficient {A}lgorithm for {F}ederated {L}earning with {V}ertically {D}istributed {G}raph {D}ata}, year = {2023}, eprint = {2303.09531}, note = {arXiv:2303.09531v1} }
PDF
Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. For graph-structured data, graph neural networks (GNNs) are competitive machine learning models, but a naive implementation in the VFL setting causes a significant communication overhead. Moreover, the analysis of the training is faced with a challenge caused by the biased stochastic gradients. In this paper, we propose a model splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip aggregation when evaluating the model and skip feature exchanges during training, greatly reducing communication. We offer a theoretical analysis and conduct extensive numerical experiments on real-world datasets, showing that the proposed algorithm effectively trains a GNN model, whose performance matches that of the backbone GNN when trained in a centralized manner.
Mean-variance constrained priors have finite maximum Bayes risk in the normal location model
Jiafeng Chen
Mar 16 2023 math.ST econ.EM econ.TH stat.TH arXiv:2303.08653v1

@misc{2303.08653, author = {Jiafeng Chen}, title = {{M}ean-variance constrained priors have finite maximum {B}ayes risk in the normal location model}, year = {2023}, eprint = {2303.08653}, note = {arXiv:2303.08653v1} }
PDF
Consider a normal location model $X \mid \theta \sim N(\theta, \sigma^2)$ with known $\sigma^2$. Suppose $\theta \sim G_0$, where the prior $G_0$ has zero mean and unit variance. Let $G_1$ be a possibly misspecified prior with zero mean and unit variance. We show that the squared error Bayes risk of the posterior mean under $G_1$ is bounded, uniformly over $G_0, G_1, \sigma^2 > 0$.
Weighted Euclidean balancing for a matrix exposure in estimating causal effect
Juan Chen, Yingchun Zhou
Mar 14 2023 stat.ME arXiv:2303.06812v1

@misc{2303.06812, author = {Juan Chen and Yingchun Zhou}, title = {{W}eighted {E}uclidean balancing for a matrix exposure in estimating causal effect}, year = {2023}, eprint = {2303.06812}, note = {arXiv:2303.06812v1} }
PDF
In many scientific fields such as biology, psychology and sociology, there is an increasing interest in estimating the causal effect of a matrix exposure on an outcome. Covariate balancing is crucial in causal inference and both exact balancing and approximate balancing methods have been proposed in the past decades. However, due to the large number of constraints, it is difficult to achieve exact balance or to select the threshold parameters for approximate balancing methods when the treatment is a matrix. To meet these challenges, we propose the weighted Euclidean balancing method, which approximately balance covariates from an overall perspective. This method is also applicable to high-dimensional covariates scenario. Both parametric and nonparametric methods are proposed to estimate the causal effect of matrix treatment and theoretical properties of the two estimations are provided. Furthermore, the simulation results show that the proposed method outperforms other methods in various cases. Finally, the method is applied to investigating the causal relationship between children's participation in various training courses and their IQ. The results show that the duration of attending hands-on practice courses for children at 6-9 years old has a siginificantly positive impact on children's IQ.
Quantized Low-Rank Multivariate Regression with Random Dithering
Junren Chen, Yueqi Wang, Michael K. Ng
Feb 23 2023 stat.ML cs.LG eess.SP arXiv:2302.11197v3

@misc{2302.11197, author = {Junren Chen and Yueqi Wang and Michael K.~Ng}, title = {{Q}uantized {L}ow-{R}ank {M}ultivariate {R}egression with {R}andom {D}ithering}, year = {2023}, eprint = {2302.11197}, note = {arXiv:2302.11197v3} }
PDF
Low-rank multivariate regression (LRMR) is an important statistical learning model that combines highly correlated tasks as a multiresponse regression problem with low-rank priori on the coefficient matrix. In this paper, we study quantized LRMR, a practical setting where the responses and/or the covariates are discretized to finite precision. We focus on the estimation of the underlying coefficient matrix. To make consistent estimator that could achieve arbitrarily small error possible, we employ uniform quantization with random dithering, i.e., we add appropriate random noise to the data before quantization. Specifically, uniform dither and triangular dither are used for responses and covariates, respectively. Based on the quantized data, we propose the constrained Lasso and regularized Lasso estimators, and derive the non-asymptotic error bounds. With the aid of dithering, the estimators achieve minimax optimal rate, while quantization only slightly worsens the multiplicative factor in the error rate. Moreover, we extend our results to a low-rank regression model with matrix responses. We corroborate and demonstrate our theoretical results via simulations on synthetic data or image restoration.