-
Collapsible Kernel Machine Regression for Exposomic Analyses
Authors:
Glen McGee,
Brent A. Coull,
Ander Wilson
Abstract:
An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibili…
▽ More
An important goal of environmental epidemiology is to quantify the complex health risks posed by a wide array of environmental exposures. In analyses focusing on a smaller number of exposures within a mixture, flexible models like Bayesian kernel machine regression (BKMR) are appealing because they allow for non-linear and non-additive associations among mixture components. However, this flexibility comes at the cost of low power and difficult interpretation, particularly in exposomic analyses when the number of exposures is large. We propose a flexible framework that allows for separate selection of additive and non-additive effects, unifying additive models and kernel machine regression. The proposed approach yields increased power and simpler interpretation when there is little evidence of interaction. Further, it allows users to specify separate priors for additive and non-additive effects, and allows for tests of non-additive interaction. We extend the approach to the class of multiple index models, in which the special case of kernel machine-distributed lag models are nested. We apply the method to motivating data from a subcohort of the Human Early Life Exposome (HELIX) study containing 65 mixture components grouped into 13 distinct exposure classes.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
A Bayesian Spatial Berkson error approach to estimate small area opioid mortality rates accounting for population-at-risk uncertainty
Authors:
Emily N Peterson,
Rachel C. Nethery,
Jarvis T. Chen,
Loni P. Tabb,
Brent A. Coull,
Frederic B. Piel,
Lance A Waller
Abstract:
Monitoring small-area geographical population trends in opioid mortality has large scale implications to informing preventative resource allocation. A common approach to obtain small area estimates of opioid mortality is to use a standard disease mapping approach in which population-at-risk estimates are treated as fixed and known. Assuming fixed populations ignores the uncertainty surrounding sma…
▽ More
Monitoring small-area geographical population trends in opioid mortality has large scale implications to informing preventative resource allocation. A common approach to obtain small area estimates of opioid mortality is to use a standard disease mapping approach in which population-at-risk estimates are treated as fixed and known. Assuming fixed populations ignores the uncertainty surrounding small area population estimates, which may bias risk estimates and under-estimate their associated uncertainties. We present a Bayesian Spatial Berkson Error (BSBE) model to incorporate population-at-risk uncertainty within a disease mapping model. We compare the BSBE approach to the naive (treating denominators as fixed) using simulation studies to illustrate potential bias resulting from this assumption. We show the application of the BSBE model to obtain 2020 opioid mortality risk estimates for 159 counties in GA accounting for population-at-risk uncertainty. Utilizing our proposed approach will help to inform interventions in opioid related public health responses, policies, and resource allocation. Additionally, we provide a general framework to improve in the estimation and mapping of health indicators.
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Impacts of Census Differential Privacy for Small-Area Disease Mapping to Monitor Health Inequities
Authors:
Yanran Li,
Brent A. Coull,
Nancy Krieger,
Emily Peterson,
Lance A. Waller,
Jarvis T. Chen,
Rachel C. Nethery
Abstract:
The US Census Bureau will implement a new privacy-preserving disclosure avoidance system (DAS), which includes application of differential privacy, on the public-release 2020 census data. There are concerns that the DAS may bias small-area and demographically-stratified population counts, which play a critical role in public health research and policy, serving as denominators in estimation of dise…
▽ More
The US Census Bureau will implement a new privacy-preserving disclosure avoidance system (DAS), which includes application of differential privacy, on the public-release 2020 census data. There are concerns that the DAS may bias small-area and demographically-stratified population counts, which play a critical role in public health research and policy, serving as denominators in estimation of disease/mortality rates. Employing three DAS demonstration products, we quantify errors attributable to reliance on DAS-protected denominators in standard small-area disease mapping models for characterizing health inequities. We conduct simulation studies and real data analyses of inequities in premature mortality at the census tract level in Massachusetts. Results show that overall patterns of inequity by racialized group and economic deprivation level are not compromised by the DAS. While early versions of DAS induce errors in mortality rate estimation that are larger for Black than for non-Hispanic white populations, this issue is ameliorated in newer DAS versions.
△ Less
Submitted 29 March, 2023; v1 submitted 9 September, 2022;
originally announced September 2022.
-
Towards a Unified Framework for Uncertainty-aware Nonlinear Variable Selection with Theoretical Guarantees
Authors:
Wenying Deng,
Beau Coker,
Rajarshi Mukherjee,
Jeremiah Zhe Liu,
Brent A. Coull
Abstract:
We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using…
▽ More
We develop a simple and unified framework for nonlinear variable selection that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc). In particular, for a learned nonlinear model $f(\mathbf{x})$, we consider quantifying the importance of an input variable $\mathbf{x}^j$ using the integrated partial derivative $Ψ_j = \Vert \frac{\partial}{\partial \mathbf{x}^j} f(\mathbf{x})\Vert^2_{P_\mathcal{X}}$. We then (1) provide a principled approach for quantifying variable selection uncertainty by deriving its posterior distribution, and (2) show that the approach is generalizable even to non-differentiable models such as tree ensembles. Rigorous Bayesian nonparametric theorems are derived to guarantee the posterior consistency and asymptotic uncertainty of the proposed approach. Extensive simulations and experiments on healthcare benchmark datasets confirm that the proposed algorithm outperforms existing classic and recent variable selection methods.
△ Less
Submitted 27 May, 2022; v1 submitted 14 April, 2022;
originally announced April 2022.
-
Integrating Biological Knowledge in Kernel-Based Analyses of Environmental Mixtures and Health
Authors:
Glen McGee,
Ander Wilson,
Brent A Coull,
Thomas F Webster
Abstract:
A key goal of environmental health research is to assess the risk posed by mixtures of pollutants. As epidemiologic studies of mixtures can be expensive to conduct, it behooves researchers to incorporate prior knowledge about mixtures into their analyses. This work extends the Bayesian multiple index model (BMIM), which assumes the exposure-response function is a non-parametric function of a set o…
▽ More
A key goal of environmental health research is to assess the risk posed by mixtures of pollutants. As epidemiologic studies of mixtures can be expensive to conduct, it behooves researchers to incorporate prior knowledge about mixtures into their analyses. This work extends the Bayesian multiple index model (BMIM), which assumes the exposure-response function is a non-parametric function of a set of linear combinations of pollutants formed with a set of exposure-specific weights. The framework is attractive because it combines the flexibility of response-surface methods with the interpretability of linear index models. We propose three strategies to incorporate prior toxicological knowledge into construction of indices in a BMIM: (a) constraining index weights, (b) structuring index weights by exposure transformations, and (c) placing informative priors on the index weights. We propose a novel prior specification that combines spike-and-slab variable selection with informative Dirichlet distribution based on relative potency factors often derived from previous toxicological studies. In simulations we show that the proposed priors improve inferences when prior information is correct and can protect against misspecification suffered by naive toxicological models when prior information is incorrect. Moreover, different strategies may be mixed-and-matched for different indices to suit available information (or lack thereof). We demonstrate the proposed methods on an analysis of data from the National Health and Nutrition Examination Survey and incorporate prior information on relative chemical potencies obtained from toxic equivalency factors available in the literature.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Multivariate cluster point process to quantify and explore multi-entity configurations: Application to biofilm image data
Authors:
Suman Majumder,
Brent A. Coull,
Jessica L. Mark Welch,
Patrick J. La Riviere,
Floyd E. Dewhirst,
Jacqueline R. Starr,
Kyu Ha Lee
Abstract:
Clusters of similar or dissimilar objects are encountered in many fields. Frequently used approaches treat the central object of each cluster as latent. Yet, often objects of one or more types cluster around objects of another type. Such arrangements are common in biomedical images of cells, in which nearby cell types likely interact. Quantifying spatial relationships may elucidate biological mech…
▽ More
Clusters of similar or dissimilar objects are encountered in many fields. Frequently used approaches treat the central object of each cluster as latent. Yet, often objects of one or more types cluster around objects of another type. Such arrangements are common in biomedical images of cells, in which nearby cell types likely interact. Quantifying spatial relationships may elucidate biological mechanisms. Parent-offspring statistical frameworks can be usefully applied even when central objects (parents) differ from peripheral ones (offspring). We propose the novel multivariate cluster point process (MCPP) to quantify multi-object (e.g., multi-cellular) arrangements. Unlike commonly used approaches, the MCPP exploits locations of the central parent object in clusters. It accounts for possibly multilayered, multivariate clustering. The model formulation requires specification of which object types function as cluster centers and which reside peripherally. If such information is unknown, the relative roles of object types may be explored by comparing fit of different models via the deviance information criterion (DIC). In simulated data, we compared DIC of a series of models; the MCPP correctly identified simulated relationships. It also produced more accurate and precise parameter estimates than the classical univariate Neyman-Scott process model. We also used the MCPP to quantify proposed configurations and explore new ones in human dental plaque biofilm image data. MCPP models quantified simultaneous clustering of Streptococcus and Porphyromonas around Corynebacterium and of Pasteurellaceae around Streptococcus and successfully captured hypothesized structures for all taxa. Further exploration suggested the presence of clustering between Fusobacterium and Leptotrichia, a previously unreported relationship.
△ Less
Submitted 8 October, 2024; v1 submitted 8 February, 2022;
originally announced February 2022.
-
A Bayesian hierarchical small-area population model accounting for data source specific methodologies from American Community Survey, Population Estimates Program, and Decennial Census data
Authors:
Emily N Peterson,
Rachel C Nethery,
Tullia Padellini,
Jarvis T Chen,
Brent A Coull,
Frederic B Piel,
Jon Wakefield,
Marta Blangiardo,
Lance A Waller
Abstract:
Small area estimates of population are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area estimates of population counts are published by the United States Census Bureau (USCB) in the form of the Decennial census counts, Intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although…
▽ More
Small area estimates of population are necessary for many epidemiological studies, yet their quality and accuracy are often not assessed. In the United States, small area estimates of population counts are published by the United States Census Bureau (USCB) in the form of the Decennial census counts, Intercensal population projections (PEP), and American Community Survey (ACS) estimates. Although there are significant relationships between these data sources, there are important contrasts in data collection and processing methodologies, such that each set of estimates may be subject to different sources and magnitudes of error. Additionally, these data sources do not report identical small area population counts due to post-survey adjustments specific to each data source. Resulting small area disease/mortality rates may differ depending on which data source is used for population counts (denominator data). To accurately capture annual small area population counts, and associated uncertainties, we present a Bayesian population model (B-Pop), which fuses information from all three USCB sources, accounting for data source specific methodologies and associated errors. The main features of our framework are: 1) a single model integrating multiple data sources, 2) accounting for data source specific data generating mechanisms, and specifically accounting for data source specific errors, and 3) prediction of estimates for years without USCB reported data. We focus our study on the 159 counties of Georgia, and produce estimates for years 2005-2021.
△ Less
Submitted 17 December, 2021;
originally announced December 2021.
-
Heterogeneous Distributed Lag Models to Estimate Personalized Effects of Maternal Exposures to Air Pollution
Authors:
Daniel Mork,
Marianthi-Anna Kioumourtzoglou,
Marc Weisskopf,
Brent A Coull,
Ander Wilson
Abstract:
Children's health studies support an association between maternal environmental exposures and children's birth outcomes. A common goal is to identify critical windows of susceptibility--periods during gestation with increased association between maternal exposures and a future outcome. The timing of the critical windows and magnitude of the associations are likely heterogeneous across different le…
▽ More
Children's health studies support an association between maternal environmental exposures and children's birth outcomes. A common goal is to identify critical windows of susceptibility--periods during gestation with increased association between maternal exposures and a future outcome. The timing of the critical windows and magnitude of the associations are likely heterogeneous across different levels of individual, family, and neighborhood characteristics. Using an administrative Colorado birth cohort we estimate the individualized relationship between weekly exposures to fine particulate matter (PM$_{2.5}$) during gestation and birth weight. To achieve this goal, we propose a statistical learning method combining distributed lag models and Bayesian additive regression trees to estimate critical windows at the individual level and identify characteristics that induce heterogeneity from a high-dimensional set of potential modifying factors. We find evidence of heterogeneity in the PM$_{2.5}$-birth weight relationship, with some mother-child dyads showing a 3 times larger decrease in birth weight for an IQR increase in exposure (5.9 to 8.5 $μg/m^3$ PM$_{2.5}$) compared to the population average. Specifically, we find increased susceptibility for non-Hispanic mothers who are either younger, have higher body mass index or lower educational attainment. Our case study is the first precision health study of critical windows.
△ Less
Submitted 30 June, 2023; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Bayesian Multiple Index Models for Environmental Mixtures
Authors:
Glen McGee,
Ander Wilson,
Thomas F. Webster,
Brent A. Coull
Abstract:
An important goal of environmental health research is to assess the risk posed by mixtures of environmental exposures. Two popular classes of models for mixtures analyses are response-surface methods and exposure-index methods. Response-surface methods estimate high-dimensional surfaces and are thus highly flexible but difficult to interpret. In contrast, exposure-index methods decompose coefficie…
▽ More
An important goal of environmental health research is to assess the risk posed by mixtures of environmental exposures. Two popular classes of models for mixtures analyses are response-surface methods and exposure-index methods. Response-surface methods estimate high-dimensional surfaces and are thus highly flexible but difficult to interpret. In contrast, exposure-index methods decompose coefficients from a linear model into an overall mixture effect and individual index weights; these models yield easily interpretable effect estimates and efficient inferences when model assumptions hold, but, like most parsimonious models, incur bias when these assumptions do not hold. In this paper we propose a Bayesian multiple index model framework that combines the strengths of each, allowing for non-linear and non-additive relationships between exposure indices and a health outcome, while reducing the dimensionality of the exposure vector and estimating index weights with variable selection. This framework contains response-surface and exposure-index models as special cases, thereby unifying the two analysis strategies. This unification increases the range of models possible for analyzing environmental mixtures and health, allowing one to select an appropriate analysis from a spectrum of models varying in flexibility and interpretability. In an analysis of the association between telomere length and 18 organic pollutants in the National Health and Nutrition Examination Survey (NHANES), the proposed approach fits the data as well as more complex response-surface methods and yields more interpretable results.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Reflection on modern methods: Good practices for applied statistical learning in epidemiology
Authors:
Yanelli Nunez,
Elizabeth A. Gibson,
Eva M. Tanner,
Chris Gennings,
Brent A. Coull,
Jeff A. Goldsmith,
Marianthi-Anna Kioumourtzoglou
Abstract:
Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampl…
▽ More
Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampling which may induce variability in results. Best practices in data science can help to ensure robustness. As a case study, we included four SL models that have been applied previously to analyze the relationship between environmental mixtures and health outcomes. We ran each model across 100 initializing values for random number generation, or "seeds," and assessed variability in resulting estimation and inference. All methods exhibited some seed-dependent variability in results. The degree of variability differed across methods and exposure of interest. Any SL method reliant on a random seed will exhibit some degree of seed sensitivity. We recommend that researchers repeat their analysis with various seeds as a sensitivity analysis when implementing these methods to enhance interpretability and robustness of results.
△ Less
Submitted 2 October, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Function-on-Function Regression for the Identification of Epigenetic Regions Exhibiting Windows of Susceptibility to Environmental Exposures
Authors:
Michele Zemplenyi,
Mark J. Meyer,
Andres Cardenas,
Marie-France Hivert,
Sheryl L. Rifas-Shiman,
Heike Gibson,
Itai Kloog,
Joel Schwartz,
Emily Oken,
Dawn L. DeMeo,
Diane R. Gold,
Brent A. Coull
Abstract:
The ability to identify time periods when individuals are most susceptible to exposures, as well as the biological mechanisms through which these exposures act, is of great public health interest. Growing evidence supports an association between prenatal exposure to air pollution and epigenetic marks, such as DNA methylation, but the timing and gene-specific effects of these epigenetic changes are…
▽ More
The ability to identify time periods when individuals are most susceptible to exposures, as well as the biological mechanisms through which these exposures act, is of great public health interest. Growing evidence supports an association between prenatal exposure to air pollution and epigenetic marks, such as DNA methylation, but the timing and gene-specific effects of these epigenetic changes are not well understood. Here, we present the first study that aims to identify prenatal windows of susceptibility to air pollution exposures in cord blood DNA methylation. In particular, we propose a function-on-function regression model that leverages data from nearby DNA methylation probes to identify epigenetic regions that exhibit windows of susceptibility to ambient particulate matter less than 2.5 microns (PM$_{2.5}$). By incorporating the covariance structure among both the multivariate DNA methylation outcome and the time-varying exposure under study, this framework yields greater power to detect windows of susceptibility and greater control of false discoveries than methods that model probes independently. We compare our method to a distributed lag model approach that models DNA methylation in a probe-by-probe manner, both in simulation and by application to motivating data from the Project Viva birth cohort. In two epigenetic regions selected based on prior studies of air pollution effects on epigenome-wide methylation, we identify windows of susceptibility to PM$_{2.5}$ exposure near the beginning and middle of the third trimester of pregnancy.
△ Less
Submitted 13 December, 2019;
originally announced December 2019.
-
On the Interplay Between Exposure Misclassification and Informative Cluster Size
Authors:
Glen McGee,
Marianthi-Anna Kioumourtzoglou,
Marc G. Weisskopf,
Sebastien Haneuse,
Brent A. Coull
Abstract:
In this paper we study the impact of exposure misclassification when cluster size is potentially informative (i.e., related to outcomes) and when misclassification is differential by cluster size. First, we show that misclassification in an exposure related to cluster size can induce informativeness when cluster size would otherwise be non-informative. Second, we show that misclassification that i…
▽ More
In this paper we study the impact of exposure misclassification when cluster size is potentially informative (i.e., related to outcomes) and when misclassification is differential by cluster size. First, we show that misclassification in an exposure related to cluster size can induce informativeness when cluster size would otherwise be non-informative. Second, we show that misclassification that is differential by informative cluster size can not only attenuate estimates of exposure effects but even inflate or reverse the sign of estimates. To correct for bias in estimating marginal parameters, we propose two frameworks: (i) an observed likelihood approach for joint marginalized models of cluster size and outcomes and (ii) an expected estimating equations approach. Although we focus on estimating marginal parameters, a corollary is that the observed likelihood approach permits valid inference for conditional parameters as well. Using data from the Nurses Health Study II, we compare the results of the proposed correction methods when applied to motivating data on the multigenerational effect of in-utero diethylstilbestrol exposure on attention-deficit/hyperactivity disorder in 106,198 children of 47,450 nurses.
△ Less
Submitted 16 October, 2019;
originally announced October 2019.
-
Bayesian Wavelet-packet Historical Functional Linear Models
Authors:
Mark J. Meyer,
Elizabeth J. Malloy,
Brent A. Coull
Abstract:
Historical Functional Linear Models (HFLM) quantify associations between a functional predictor and functional outcome where the predictor is an exposure variable that occurs before, or at least concurrently with, the outcome. Current work on the HFLM is largely limited to frequentist estimation techniques that employ spline-based basis representations. In this work, we propose a novel use of the…
▽ More
Historical Functional Linear Models (HFLM) quantify associations between a functional predictor and functional outcome where the predictor is an exposure variable that occurs before, or at least concurrently with, the outcome. Current work on the HFLM is largely limited to frequentist estimation techniques that employ spline-based basis representations. In this work, we propose a novel use of the discrete wavelet-packet transformation, which has not previously been used in functional models, to estimate historical relationships in a fully Bayesian model. Since inference has not been an emphasis of the existing work on HFLMs, we also employ two established Bayesian inference procedures in this historical functional setting. We investigate the operating characteristics of our wavelet-packet HFLM, as well as the two inference procedures, in simulation and use the model to analyze data on the impact of lagged exposure to particulate matter finer than 2.5$μ$g on heart rate variability in a cohort of journeyman boilermakers over the course of a day's shift.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
Kernel Machine and Distributed Lag Models for Assessing Windows of Susceptibility to Environmental Mixtures in Children's Health Studies
Authors:
Ander Wilson,
Hsiao-Hsien Leon Hsu,
Yueh-Hsiu Mathilda Chiu,
Robert O. Wright,
Rosalind J. Wright,
Brent A. Coull
Abstract:
Exposures to environmental chemicals during gestation can alter health status later in life. Most studies of maternal exposure to chemicals during pregnancy have focused on a single chemical exposure observed at high temporal resolution. Recent research has turned to focus on exposure to mixtures of multiple chemicals, generally observed at a single time point. We consider statistical methods for…
▽ More
Exposures to environmental chemicals during gestation can alter health status later in life. Most studies of maternal exposure to chemicals during pregnancy have focused on a single chemical exposure observed at high temporal resolution. Recent research has turned to focus on exposure to mixtures of multiple chemicals, generally observed at a single time point. We consider statistical methods for analyzing data on chemical mixtures that are observed at a high temporal resolution. As motivation, we analyze the association between exposure to four ambient air pollutants observed weekly throughout gestation and birth weight in a Boston-area prospective birth cohort. To explore patterns in the data, we first apply methods for analyzing data on (1) a single chemical observed at high temporal resolution, and (2) a mixture measured at a single point in time. We highlight the shortcomings of these approaches for temporally-resolved data on exposure to chemical mixtures. Second, we propose a novel method, a Bayesian kernel machine regression distributed lag model (BKMR-DLM), that simultaneously accounts for nonlinear associations and interactions among time-varying measures of exposure to mixtures. BKMR-DLM uses a functional weight for each exposure that parameterizes the window of susceptibility corresponding to that exposure within a kernel machine framework that captures non-linear and interaction effects of the multivariate exposure on the outcome. In a simulation study, we show that the proposed method can better estimate the exposure-response function and, in high signal settings, can identify critical windows in time during which exposure has an increased association with the outcome. Applying the proposed method to the Boston birth cohort data, we find evidence of a negative association between organic carbon and birth weight and that nitrate modifies the organic carbon, ...
△ Less
Submitted 21 September, 2021; v1 submitted 28 April, 2019;
originally announced April 2019.
-
A Cross-validated Ensemble Approach to Robust Hypothesis Testing of Continuous Nonlinear Interactions: Application to Nutrition-Environment Studies
Authors:
Jeremiah Zhe Liu,
Jane Lee,
Pi-i Debby Lin,
Linda Valeri,
David C. Christiani,
David C. Bellinger,
Robert O. Wright,
Maitreyi M. Mazumdar,
Brent A. Coull
Abstract:
Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear…
▽ More
Gene-environment and nutrition-environment studies often involve testing of high-dimensional interactions between two sets of variables, each having potentially complex nonlinear main effects on an outcome. Construction of a valid and powerful hypothesis test for such an interaction is challenging, due to the difficulty in constructing an efficient and unbiased estimator for the complex, nonlinear main effects. In this work we address this problem by proposing a Cross-validated Ensemble of Kernels (CVEK) that learns the space of appropriate functions for the main effects using a cross-validated ensemble approach. With a carefully chosen library of base kernels, CVEK flexibly estimates the form of the main-effect functions from the data, and encourages test power by guarding against over-fitting under the alternative. The method is motivated by a study on the interaction between metal exposures in utero and maternal nutrition on children's neurodevelopment in rural Bangladesh. The proposed tests identified evidence of an interaction between minerals and vitamins intake and arsenic and manganese exposures. Results suggest that the detrimental effects of these metals are most pronounced at low intake levels of the nutrients, suggesting nutritional interventions in pregnant women could mitigate the adverse impacts of in utero metal exposures on children's neurodevelopment.
△ Less
Submitted 24 April, 2019;
originally announced April 2019.
-
Adaptive Ensemble Learning of Spatiotemporal Processes with Calibrated Predictive Uncertainty: A Bayesian Nonparametric Approach
Authors:
Jeremiah Zhe Liu,
John Paisley,
Marianthi-Anna Kioumourtzoglou,
Brent A. Coull
Abstract:
Ensemble learning is a mainstay in modern data science practice. Conventional ensemble algorithms assign to base models a set of deterministic, constant model weights that (1) do not fully account for individual models' varying accuracy across data subgroups, nor (2) provide uncertainty estimates for the ensemble prediction. These shortcomings can yield predictions that are precise but biased, whi…
▽ More
Ensemble learning is a mainstay in modern data science practice. Conventional ensemble algorithms assign to base models a set of deterministic, constant model weights that (1) do not fully account for individual models' varying accuracy across data subgroups, nor (2) provide uncertainty estimates for the ensemble prediction. These shortcomings can yield predictions that are precise but biased, which can negatively impact the performance of the algorithm in real-word applications. In this work, we present an adaptive, probabilistic approach to ensemble learning using a transformed Gaussian process as a prior for the ensemble weights. Given input features, our method optimally combines base models based on their predictive accuracy in the feature space, and provides interpretable estimates of the uncertainty associated with both model selection, as reflected by the ensemble weights, and the overall ensemble predictions. Furthermore, to ensure that this quantification of the model uncertainty is accurate, we propose additional machinery to non-parametrically model the ensemble's predictive cumulative density function (CDF) so that it is consistent with the empirical distribution of the data. We apply the proposed method to data simulated from a nonlinear regression model, and to generate a spatial prediction model and associated prediction uncertainties for fine particle levels in eastern Massachusetts, USA.
△ Less
Submitted 31 March, 2019;
originally announced April 2019.
-
Bayesian data fusion for unmeasured confounding
Authors:
Leah Comment,
Brent A. Coull,
Corwin Zigler,
Linda Valeri
Abstract:
Bayesian causal inference offers a principled approach to policy evaluation of proposed interventions on mediators or time-varying exposures. We outline a general approach to the estimation of causal quantities for settings with time-varying confounding, such as exposure-induced mediator-outcome confounders. We further extend this approach to propose two Bayesian data fusion (BDF) methods for unme…
▽ More
Bayesian causal inference offers a principled approach to policy evaluation of proposed interventions on mediators or time-varying exposures. We outline a general approach to the estimation of causal quantities for settings with time-varying confounding, such as exposure-induced mediator-outcome confounders. We further extend this approach to propose two Bayesian data fusion (BDF) methods for unmeasured confounding. Using informative priors on quantities relating to the confounding bias parameters, our methods incorporate data from an external source where the confounder is measured in order to make inferences about causal estimands in the main study population. We present results from a simulation study comparing our data fusion methods to two common frequentist correction methods for unmeasured confounding bias in the mediation setting. We also demonstrate our method with an investigation of the role of stage at cancer diagnosis in contributing to Black-White colorectal cancer survival disparities.
△ Less
Submitted 27 February, 2019;
originally announced February 2019.
-
Ordinal Probit Functional Outcome Regression with Application to Computer-Use Behavior in Rhesus Monkeys
Authors:
Mark J. Meyer,
Jeffrey S. Morris,
Regina Paxton Gazes,
Brent A. Coull
Abstract:
Research in functional regression has made great strides in expanding to non-Gaussian functional outcomes, but exploration of ordinal functional outcomes remains limited. Motivated by a study of computer-use behavior in rhesus macaques (Macaca mulatta), we introduce the Ordinal Probit Functional Outcome Regression model (OPFOR). OPFOR models can be fit using one of several basis functions includin…
▽ More
Research in functional regression has made great strides in expanding to non-Gaussian functional outcomes, but exploration of ordinal functional outcomes remains limited. Motivated by a study of computer-use behavior in rhesus macaques (Macaca mulatta), we introduce the Ordinal Probit Functional Outcome Regression model (OPFOR). OPFOR models can be fit using one of several basis functions including penalized B-splines, wavelets, and O'Sullivan splines -- the last of which typically performs best. Simulation using a variety of underlying covariance patterns shows that the model performs reasonably well in estimation under multiple basis functions with near nominal coverage for joint credible intervals. Finally, in application, we use Bayesian model selection criteria adapted to functional outcome regression to best characterize the relation between several demographic factors of interest and the monkeys' computer use over the course of a year. In comparison with a standard ordinal longitudinal analysis, OPFOR outperforms a cumulative-link mixed-effects model in simulation and provides additional and more nuanced information on the nature of the monkeys' computer-use behavior.
△ Less
Submitted 18 March, 2021; v1 submitted 23 January, 2019;
originally announced January 2019.
-
Adaptive and Calibrated Ensemble Learning with Dependent Tail-free Process
Authors:
Jeremiah Zhe Liu,
John Paisley,
Marianthi-Anna Kioumourtzoglou,
Brent A. Coull
Abstract:
Ensemble learning is a mainstay in modern data science practice. Conventional ensemble algorithms assigns to base models a set of deterministic, constant model weights that (1) do not fully account for variations in base model accuracy across subgroups, nor (2) provide uncertainty estimates for the ensemble prediction, which could result in mis-calibrated (i.e. precise but biased) predictions that…
▽ More
Ensemble learning is a mainstay in modern data science practice. Conventional ensemble algorithms assigns to base models a set of deterministic, constant model weights that (1) do not fully account for variations in base model accuracy across subgroups, nor (2) provide uncertainty estimates for the ensemble prediction, which could result in mis-calibrated (i.e. precise but biased) predictions that could in turn negatively impact the algorithm performance in real-word applications. In this work, we present an adaptive, probabilistic approach to ensemble learning using dependent tail-free process as ensemble weight prior. Given input feature $\mathbf{x}$, our method optimally combines base models based on their predictive accuracy in the feature space $\mathbf{x} \in \mathcal{X}$, and provides interpretable uncertainty estimates both in model selection and in ensemble prediction. To encourage scalable and calibrated inference, we derive a structured variational inference algorithm that jointly minimize KL objective and the model's calibration score (i.e. Continuous Ranked Probability Score (CRPS)). We illustrate the utility of our method on both a synthetic nonlinear function regression task, and on the real-world application of spatio-temporal integration of particle pollution prediction models in New England.
△ Less
Submitted 19 December, 2018; v1 submitted 8 December, 2018;
originally announced December 2018.
-
The Role of Body Mass Index at Diagnosis on Black-White Disparities in Colorectal Cancer Survival: A Density Regression Mediation Approach
Authors:
Katrina L. Devick,
Linda Valeri,
Jarvis Chen,
Alejandro Jara,
Marie-Abèle Bind,
Brent A. Coull
Abstract:
The study of racial/ethnic inequalities in health is important to reduce the uneven burden of disease. In the case of colorectal cancer (CRC), disparities in survival among non-Hispanic Whites and Blacks are well documented, and mechanisms leading to these disparities need to be studied formally. It has also been established that body mass index (BMI) is a risk factor for developing CRC, and recen…
▽ More
The study of racial/ethnic inequalities in health is important to reduce the uneven burden of disease. In the case of colorectal cancer (CRC), disparities in survival among non-Hispanic Whites and Blacks are well documented, and mechanisms leading to these disparities need to be studied formally. It has also been established that body mass index (BMI) is a risk factor for developing CRC, and recent literature shows BMI at diagnosis of CRC is associated with survival. Since BMI varies by racial/ethnic group, a question that arises is whether disparities in BMI is partially responsible for observed racial/ethnic disparities in CRC survival. This paper presents new methodology to quantify the impact of the hypothetical intervention that matches the BMI distribution in the Black population to a potentially complex distributional form observed in the White population on racial/ethnic disparities in survival. We perform a simulation that shows our proposed Bayesian density regression approach performs as well as or better than current methodology allowing for a shift in the mean of the distribution only, and that standard practice of categorizing BMI leads to large biases. When applied to motivating data from the Cancer Care Outcomes Research and Surveillance (CanCORS) Consortium, our approach suggests the proposed intervention is potentially beneficial for elderly and low income Black patients, yet harmful for young and high income Black populations.
△ Less
Submitted 16 November, 2018;
originally announced December 2018.
-
CVEK: Robust Estimation and Testing for Nonlinear Effects using Kernel Machine Ensemble
Authors:
Wenying Deng,
Jeremiah Zhe Liu,
Erin Lake,
Brent A. Coull
Abstract:
The R package CVEK introduces a suite of flexible machine learning models and robust hypothesis tests for learning the joint nonlinear effects of multiple covariates in limited samples. It implements the Cross-validated Ensemble of Kernels (CVEK)(Liu and Coull 2017), an ensemble-based kernel machine learning method that adaptively learns the joint nonlinear effect of multiple covariates from data,…
▽ More
The R package CVEK introduces a suite of flexible machine learning models and robust hypothesis tests for learning the joint nonlinear effects of multiple covariates in limited samples. It implements the Cross-validated Ensemble of Kernels (CVEK)(Liu and Coull 2017), an ensemble-based kernel machine learning method that adaptively learns the joint nonlinear effect of multiple covariates from data, and provides powerful hypothesis tests for both main effects of features and interactions among features. The R Package CVEK provides a flexible, easy-to-use implementation of CVEK, and offers a wide range of choices for the kernel family (for instance, polynomial, radial basis functions, Matérn, neural network, and others), model selection criteria, ensembling method (averaging, exponential weighting, cross-validated stacking), and the type of hypothesis test (asymptotic or parametric bootstrap). Through extensive simulations we demonstrate the validity and robustness of this approach, and provide practical guidelines on how to design an estimation strategy for optimal performance in different data scenarios.
△ Less
Submitted 18 December, 2020; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Bayesian kernel machine regression-causal mediation analysis
Authors:
Katrina L. Devick,
Jennifer F. Bobb,
Maitreyi Mazumdar,
Birgit Claus Henn,
David C. Bellinger,
David C. Christiani,
Robert O. Wright,
Paige L. Williams,
Brent A. Coull,
Linda Valeri
Abstract:
Greater understanding of the pathways through which an environmental mixture operates is important to design effective interventions. We present new methodology to estimate natural direct and indirect effects and controlled direct effects of a complex mixture exposure on an outcome through a mediator variable. We implement Bayesian Kernel Machine Regression (BKMR) to allow for all possible interac…
▽ More
Greater understanding of the pathways through which an environmental mixture operates is important to design effective interventions. We present new methodology to estimate natural direct and indirect effects and controlled direct effects of a complex mixture exposure on an outcome through a mediator variable. We implement Bayesian Kernel Machine Regression (BKMR) to allow for all possible interactions and nonlinear effects of (1) the co-exposures on the mediator, (2) the co-exposures and mediator on the outcome, and (3) selected covariates on the mediator and/or outcome. From the posterior predictive distributions of the mediator and outcome, we simulate counterfactuals to obtain posterior samples, estimates, and credible intervals of the mediation effects. Our simulation study demonstrates that when the exposure-mediator and exposure-mediator-outcome relationships are complex, BKMR--Causal Mediation Analysis performs better than current mediation methods. We applied our methodology to quantify the contribution of birth length as a mediator between in utero co-exposure to arsenic, manganese and lead, and children's neurodevelopmental scores, in a prospective birth cohort in Bangladesh. Among younger children, we found a negative (adverse) association between the metal mixture and neurodevelopment. We also found evidence that birth length mediates the effect of exposure to the metal mixture on neurodevelopment for younger children. If birth length were fixed to its $75^{th}$ percentile value, the harmful effect of the metal mixture on neurodevelopment is attenuated, suggesting nutritional interventions to help increase fetal growth, and thus birth length, could potentially block the harmful effect of the metal mixture on neurodevelopment.
△ Less
Submitted 21 December, 2021; v1 submitted 26 November, 2018;
originally announced November 2018.
-
A Variational Inference Algorithm for BKMR in the Cross-Sectional Setting
Authors:
Raphael Small,
Brent A. Coull
Abstract:
The identification of pollutant effects is an important task in environmental health. Bayesian kernel machine regression (BKMR) is a standard tool for inference of individual-level pollutant health-effects, and we present a mean field Variational Inference (VI) algorithm for quick inference when only a single response per individual is recorded. Using simulation studies in the case of informative…
▽ More
The identification of pollutant effects is an important task in environmental health. Bayesian kernel machine regression (BKMR) is a standard tool for inference of individual-level pollutant health-effects, and we present a mean field Variational Inference (VI) algorithm for quick inference when only a single response per individual is recorded. Using simulation studies in the case of informative priors, we show that VI, although fast, produces anti-conservative credible intervals of covariate effects and conservative credible intervals for pollutant effects. To correct the coverage probabilities of covariate effects, we propose a simple Generalized Least Squares (GLS) approach that induces conservative credible intervals. We also explore using BKMR with flat priors and find that, while slower than the case with informative priors, this approach yields uncorrected credible intervals for covariate effects with coverage probabilities that are much closer to the nominal 95% level. We further note that fitting BKMR by VI provides a remarkable improvement in speed over existing MCMC methods.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors
Authors:
Joseph Antonelli,
Maitreyi Mazumdar,
David Bellinger,
David C. Christiani,
Robert Wright,
Brent A. Coull
Abstract:
Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case tha…
▽ More
Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome, and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We illustrate our approach's ability to estimate complex functions using simulated data, and apply our method to two studies to determine which environmental pollutants adversely affect health.
△ Less
Submitted 29 October, 2019; v1 submitted 30 November, 2017;
originally announced November 2017.
-
Bayesian Variable Selection for Multivariate Zero-Inflated Models: Application to Microbiome Count Data
Authors:
Kyu Ha Lee,
Brent A. Coull,
Anna-Barbara Moscicki,
Bruce J. Paster,
Jacqueline R. Starr
Abstract:
Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa,…
▽ More
Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. In addition, the analysis of microbial count data requires special attention because data commonly exhibit zero inflation. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Although there has been a great deal of effort in zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five species (of 44) associated with HIV infection.
△ Less
Submitted 20 May, 2018; v1 submitted 31 October, 2017;
originally announced November 2017.
-
Bayesian Distributed Lag Interaction Models to Identify Perinatal Windows of Vulnerability in Children's Health
Authors:
Ander Wilson,
Yueh-Hsiu Mathilda Chiu,
Hsiao-Hsien Leon Hsu,
Robert O. Wright,
Rosalind J. Wright,
Brent A. Coull
Abstract:
Epidemiological research supports an association between maternal exposure to air pollution during pregnancy and adverse children's health outcomes. Advances in exposure assessment and statistics allow for estimation of both critical windows of vulnerability and exposure effect heterogeneity. Simultaneous estimation of windows of vulnerability and effect heterogeneity can be accomplished by fittin…
▽ More
Epidemiological research supports an association between maternal exposure to air pollution during pregnancy and adverse children's health outcomes. Advances in exposure assessment and statistics allow for estimation of both critical windows of vulnerability and exposure effect heterogeneity. Simultaneous estimation of windows of vulnerability and effect heterogeneity can be accomplished by fitting a distributed lag model (DLM) stratified by subgroup. However, this can provide an incomplete picture of how effects vary across subgroups because it does not allow for subgroups to have the same window but different within-window effects or to have different windows but the same within-window effect. Because the timing of some developmental processes are common across subpopulations of infants while for others the timing differs across subgroups, both scenarios are important to consider when evaluating health risks of prenatal exposures. We propose a new approach that partitions the DLM into a constrained functional predictor that estimates windows of vulnerability and a scalar effect representing the within-window effect directly. The proposed method allows for heterogeneity in only the window, only the within-window effect, or both. In a simulation study we show that a model assuming a shared component across groups results in lower bias and mean squared error for the estimated windows and effects when that component is in fact constant across groups. We apply the proposed method to estimate windows of vulnerability in the association between prenatal exposures to fine particulate matter and each of birth weight and asthma incidence, and estimate how these associations vary by sex and maternal obesity status, in a Boston-area prospective pre-birth cohort study.
△ Less
Submitted 17 December, 2016;
originally announced December 2016.
-
General Design Bayesian Generalized Linear Mixed Models
Authors:
Y. Zhao,
J. Staudenmayer,
B. A. Coull,
M. P. Wand
Abstract:
Linear mixed models are able to handle an extraordinary range of complications in regression-type analyses. Their most common use is to account for within-subject correlation in longitudinal data analysis. They are also the standard vehicle for smoothing spatial count data. However, when treated in full generality, mixed models can also handle spline-type smoothing and closely approximate krigin…
▽ More
Linear mixed models are able to handle an extraordinary range of complications in regression-type analyses. Their most common use is to account for within-subject correlation in longitudinal data analysis. They are also the standard vehicle for smoothing spatial count data. However, when treated in full generality, mixed models can also handle spline-type smoothing and closely approximate kriging. This allows for nonparametric regression models (e.g., additive models and varying coefficient models) to be handled within the mixed model framework. The key is to allow the random effects design matrix to have general structure; hence our label general design. For continuous response data, particularly when Gaussianity of the response is reasonably assumed, computation is now quite mature and supported by the R, SAS and S-PLUS packages. Such is not the case for binary and count responses, where generalized linear mixed models (GLMMs) are required, but are hindered by the presence of intractable multivariate integrals. Software known to us supports special cases of the GLMM (e.g., PROC NLMIXED in SAS or glmmML in R) or relies on the sometimes crude Laplace-type approximation of integrals (e.g., the SAS macro glimmix or glmmPQL in R). This paper describes the fitting of general design generalized linear mixed models. A Bayesian approach is taken and Markov chain Monte Carlo (MCMC) is used for estimation and inference. In this generalized setting, MCMC requires sampling from nonstandard distributions. In this article, we demonstrate that the MCMC package WinBUGS facilitates sound fitting of general design Bayesian generalized linear mixed models in practice.
△ Less
Submitted 20 June, 2006;
originally announced June 2006.