subscribe to arXiv mailings

A Stochastic Model-Based Control Methodology for Glycemic Management in the Intensive Care Unit

Authors: Melike Sirlanci, George Hripcsak, Cecilia C. Low Wang, J. N. Stroh, Yanran Wang, Tellen D. Bennett, Andrew M. Stuart, David J. Albers

Abstract: Intensive care unit (ICU) patients exhibit erratic blood glucose (BG) fluctuations, including hypoglycemic and hyperglycemic episodes, and require exogenous insulin delivery to keep their BG in healthy ranges. Glycemic control via glycemic management (GM) is associated with reduced mortality and morbidity in the ICU, but GM increases the cognitive load on clinicians. The availability of robust, ac… ▽ More Intensive care unit (ICU) patients exhibit erratic blood glucose (BG) fluctuations, including hypoglycemic and hyperglycemic episodes, and require exogenous insulin delivery to keep their BG in healthy ranges. Glycemic control via glycemic management (GM) is associated with reduced mortality and morbidity in the ICU, but GM increases the cognitive load on clinicians. The availability of robust, accurate, and actionable clinical decision support (CDS) tools reduces this burden and assists in the decision-making process to improve health outcomes. Clinicians currently follow GM protocol flow charts for patient intravenous insulin delivery rate computations. We present a mechanistic model-based control algorithm that predicts the optimal intravenous insulin rate to keep BG within a target range; the goal is to develop this approach for eventual use within CDS systems. In this control framework, we employed a stochastic model representing BG dynamics in the ICU setting and used the linear quadratic Gaussian control methodology to develop a controller. We designed two experiments, one using virtual (simulated) patients and one using a real-world retrospective dataset. Using these, we evaluate the safety and efficacy of this model-based glycemic control methodology. The presented controller avoids hypoglycemia and hyperglycemia in virtual patients, maintaining BG levels in the target range more consistently than two existing GM protocols. Moreover, this methodology could theoretically prevent a large proportion of hypoglycemic and hyperglycemic events recorded in a real-world retrospective dataset. △ Less

Submitted 3 July, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

Comments: 26 pages, 4 figures, 5 tables

MSC Class: 49-11 ACM Class: I.6.3

arXiv:2403.14563 [pdf, other]

Evaluating the impact of instrumental variables in propensity score models using synthetic and negative control experiments

Authors: Yuxi Tian, Nicole Pratt, Laura L Hester, George Hripcsak, Martijn J Schuemie, Marc A Suchard

Abstract: In pharmacoepidemiology research, instrumental variables (IVs) are variables that strongly predict treatment but have no causal effect on the outcome of interest except through the treatment. There remain concerns about the inclusion of IVs in propensity score (PS) models amplifying estimation bias and reducing precision. Some PS modeling approaches attempt to address the potential effects of IVs,… ▽ More In pharmacoepidemiology research, instrumental variables (IVs) are variables that strongly predict treatment but have no causal effect on the outcome of interest except through the treatment. There remain concerns about the inclusion of IVs in propensity score (PS) models amplifying estimation bias and reducing precision. Some PS modeling approaches attempt to address the potential effects of IVs, including selecting only covariates for the PS model that are strongly associated to the outcome of interest, thus screening out IVs. We conduct a study utilizing simulations and negative control experiments to evaluate the effect of IVs on PS model performance and to uncover best PS practices for real-world studies. We find that simulated IVs have a weak effect on bias and precision in both simulations and negative control experiments based on real-world data. In simulation experiments, PS methods that utilize outcome data, including the high-dimensional propensity score, produce the least estimation bias. However, in real-world settings underlying causal structures are unknown, and negative control experiments can illustrate a PS model's ability to minimize systematic bias. We find that large-scale, regularized regression based PS models in this case provide the most centered negative control distributions, suggesting superior performance in real-world scenarios. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2402.04400 [pdf, other]

CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines

Authors: Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, Karthik Natarajan

Abstract: Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabula… ▽ More Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format. △ Less

Submitted 5 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2307.05727 [pdf]

An Open-Source Knowledge Graph Ecosystem for the Life Sciences

Authors: Tiffany J. Callahan, Ignacio J. Tripodi, Adrianne L. Stefanski, Luca Cappelletti, Sanya B. Taneja, Jordan M. Wyrwa, Elena Casiraghi, Nicolas A. Matentzoglu, Justin Reese, Jonathan C. Silverstein, Charles Tapley Hoyt, Richard D. Boyce, Scott A. Malec, Deepak R. Unni, Marcin P. Joachimiak, Peter N. Robinson, Christopher J. Mungall, Emanuele Cavalleri, Tommaso Fontana, Giorgio Valentini, Marco Mesiti, Lucas A. Gillenwater, Brook Santangelo, Nicole A. Vasilevsky, Robert Hoehndorf , et al. (7 additional authors not shown)

Abstract: Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integrat… ▽ More Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoints and abstraction algorithms), and benchmarks (e.g., prebuilt KGs and embeddings). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability. △ Less

Submitted 30 January, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:2305.12034 [pdf, other]

Bayesian Safety Surveillance with Adaptive Bias Correction

Authors: Fan Bu, Martijn J. Schuemie, Akihiko Nishimura, Louisa H. Smith, Kristin Kostka, Thomas Falconer, Jody-Ann McLeggon, Patrick B. Ryan, George Hripcsak, Marc A. Suchard

Abstract: Post-market safety surveillance is an integral part of mass vaccination programs. Typically relying on sequential analysis of real-world health data as they accrue, safety surveillance is challenged by the difficulty of sequential multiple testing and by biases induced by residual confounding. The current standard approach based on the maximized sequential probability ratio test (MaxSPRT) fails to… ▽ More Post-market safety surveillance is an integral part of mass vaccination programs. Typically relying on sequential analysis of real-world health data as they accrue, safety surveillance is challenged by the difficulty of sequential multiple testing and by biases induced by residual confounding. The current standard approach based on the maximized sequential probability ratio test (MaxSPRT) fails to satisfactorily address these practical challenges and it remains a rigid framework that requires pre-specification of the surveillance schedule. We develop an alternative Bayesian surveillance procedure that addresses both challenges using a more flexible framework. We adopt a joint statistical modeling approach to sequentially estimate the effect of vaccine exposure on the adverse event of interest and correct for estimation bias by simultaneously analyzing a large set of negative control outcomes through a Bayesian hierarchical model. We then compute a posterior probability of the alternative hypothesis via Markov chain Monte Carlo sampling and use it for sequential detection of safety signals. Through an empirical evaluation using six US observational healthcare databases covering more than 360 million patients, we benchmark the proposed procedure against MaxSPRT on testing errors and estimation accuracy, under two epidemiological designs, the historical comparator and the self-controlled case series. We demonstrate that our procedure substantially reduces Type 1 error rates, maintains high statistical power, delivers fast signal detection, and provides considerably more accurate estimation. As an effort to promote open science, we present all empirical results in an R ShinyApp and provide full implementation of our method in the R package EvidenceSynthesis. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.06513 [pdf, other]

Interpretable Forecasting of Physiology in the ICU Using Constrained Data Assimilation and Electronic Health Record Data

Authors: David Albers, Melike Sirlanci, Matthew Levine, Jan Claassen, Caroline Der Nigoghossian, George Hripcsak

Abstract: Prediction of physiologic states are important in medical practice because interventions are guided by predicted impacts of interventions. But prediction is difficult in medicine because the generating system is complex and difficult to understand from data alone, and the data are sparse relative to the complexity of the generating processes due to human costs of data collection. Computational mac… ▽ More Prediction of physiologic states are important in medical practice because interventions are guided by predicted impacts of interventions. But prediction is difficult in medicine because the generating system is complex and difficult to understand from data alone, and the data are sparse relative to the complexity of the generating processes due to human costs of data collection. Computational machinery can potentially make prediction more accurate, but, working within the constraints of realistic clinical data makes robust inference difficult because the data are sparse, noisy and nonstationary. This paper focuses on prediction given sparse, non-stationary, electronic health record data in the intensive care unit (ICU) using data assimilation, a broad collection of methods that pairs mechanistic models with inference machinery such as the Kalman filter. We find that to make inference with sparse clinical data accurate and robust requires advancements beyond standard DA methods combined with additional machine learning methods. Specifically, we show that combining the newly developed constrained ensemble Kalman filter with machine learning methods can produce substantial gains in robustness and accuracy while minimizing the data requirements. We also identify limitations of Kalman filtering methods that lead to new problems to be overcome to make inference feasible in clinical settings using realistic clinical data. △ Less

Submitted 10 May, 2023; originally announced May 2023.

arXiv:2211.11183 [pdf, other]

Causal Fairness Assessment of Treatment Allocation with Electronic Health Records

Authors: Linying Zhang, Lauren R. Richter, Yixin Wang, Anna Ostropolets, Noemie Elhadad, David M. Blei, George Hripcsak

Abstract: Healthcare continues to grapple with the persistent issue of treatment disparities, sparking concerns regarding the equitable allocation of treatments in clinical practice. While various fairness metrics have emerged to assess fairness in decision-making processes, a growing focus has been on causality-based fairness concepts due to their capacity to mitigate confounding effects and reason about b… ▽ More Healthcare continues to grapple with the persistent issue of treatment disparities, sparking concerns regarding the equitable allocation of treatments in clinical practice. While various fairness metrics have emerged to assess fairness in decision-making processes, a growing focus has been on causality-based fairness concepts due to their capacity to mitigate confounding effects and reason about bias. However, the application of causal fairness notions in evaluating the fairness of clinical decision-making with electronic health record (EHR) data remains an understudied domain. This study aims to address the methodological gap in assessing causal fairness of treatment allocation with electronic health records data. We propose a causal fairness algorithm to assess fairness in clinical decision-making. Our algorithm accounts for the heterogeneity of patient populations and identifies potential unfairness in treatment allocation by conditioning on patients who have the same likelihood to benefit from the treatment. We apply this framework to a patient cohort with coronary artery disease derived from an EHR database to evaluate the fairness of treatment decisions. In addition, we investigate the impact of social determinants of health on the assessment of causal fairness of treatment allocation. △ Less

Submitted 7 January, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

arXiv:2209.04732 [pdf]

Ontologizing Health Systems Data at Scale: Making Translational Discovery a Reality

Authors: Tiffany J. Callahan, Adrianne L. Stefanski, Jordan M. Wyrwa, Chenjie Zeng, Anna Ostropolets, Juan M. Banda, William A. Baumgartner Jr., Richard D. Boyce, Elena Casiraghi, Ben D. Coleman, Janine H. Collins, Sara J. Deakyne-Davies, James A. Feinstein, Melissa A. Haendel, Asiyah Y. Lin, Blake Martin, Nicolas A. Matentzoglu, Daniella Meeker, Justin Reese, Jessica Sinclair, Sanya B. Taneja, Katy E. Trinkley, Nicole A. Vasilevsky, Andrew Williams, Xingman A. Zhang , et al. (7 additional authors not shown)

Abstract: Background: Common data models solve many challenges of standardizing electronic health record (EHR) data, but are unable to semantically integrate all the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OB… ▽ More Background: Common data models solve many challenges of standardizing electronic health record (EHR) data, but are unable to semantically integrate all the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OBO ontologies requires significant manual curation and domain expertise. Objective: We introduce OMOP2OBO, an algorithm for mapping Observational Medical Outcomes Partnership (OMOP) vocabularies to OBO ontologies. Results: Using OMOP2OBO, we produced mappings for 92,367 conditions, 8611 drug ingredients, and 10,673 measurement results, which covered 68-99% of concepts used in clinical practice when examined across 24 hospitals. When used to phenotype rare disease patients, the mappings helped systematically identify undiagnosed patients who might benefit from genetic testing. Conclusions: By aligning OMOP vocabularies to OBO ontologies our algorithm presents new opportunities to advance EHR-based deep phenotyping. △ Less

Submitted 30 January, 2023; v1 submitted 10 September, 2022; originally announced September 2022.

Comments: Supplementary Material is included at the end of the manuscript

ACM Class: J.3

arXiv:2110.12235 [pdf, other]

doi 10.1016/j.jbi.2022.104204

Adjusting for indirectly measured confounding using large-scale propensity scores

Authors: Linying Zhang, Yixin Wang, Martijn Schuemie, David Blei, George Hripcsak

Abstract: Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the conf… ▽ More Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs) and administrative claims. Modern medical data typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data. △ Less

Submitted 8 January, 2024; v1 submitted 23 October, 2021; originally announced October 2021.

arXiv:2007.09309 [pdf, other]

doi 10.1063/5.0027682

Delay-Induced Uncertainty for a Paradigmatic Glucose-Insulin Model

Authors: Bhargav Karamched, George Hripcsak, Dave Albers, William Ott

Abstract: Medical practice in the intensive care unit is based on the supposition that physiological systems such as the human glucose-insulin system are predictable. We demonstrate that delay within the glucose-insulin system can induce sustained temporal chaos, rendering the system unpredictable. Specifically, we exhibit such chaos for the Ultradian glucose-insulin model. This well-validated, finite-dimen… ▽ More Medical practice in the intensive care unit is based on the supposition that physiological systems such as the human glucose-insulin system are predictable. We demonstrate that delay within the glucose-insulin system can induce sustained temporal chaos, rendering the system unpredictable. Specifically, we exhibit such chaos for the Ultradian glucose-insulin model. This well-validated, finite-dimensional model represents feedback delay as a three-stage filter. Using the theory of rank one maps from smooth dynamical systems, we precisely explain the nature of the resulting delay-induced uncertainty (DIU). We develop a recipe one may use to diagnose DIU in a general oscillatory dynamical system. For infinite-dimensional delay systems, no analog of the theory of rank one maps exists. Nevertheless, we show that the geometric principles encoded in our DIU recipe apply to such systems by exhibiting sustained temporal chaos for a linear shear flow. Our results are potentially broadly applicable because delay is ubiquitous throughout mathematical physiology. △ Less

Submitted 14 April, 2021; v1 submitted 17 July, 2020; originally announced July 2020.

Comments: 19 pages; 9 figures

MSC Class: 92C50; 92C30; 37N25; 37D25; 37D45; 37G35

arXiv:2003.06541 [pdf, ps, other]

Using Data Assimilation of Mechanistic Models to Estimate Glucose and Insulin Metabolism

Authors: Jami J. Mulgrave, Matthew E. Levine, David J. Albers, Joon Ha, Arthur Sherman, George Hripcsak

Abstract: Motivation: There is a growing need to integrate mechanistic models of biological processes with computational methods in healthcare in order to improve prediction. We apply data assimilation in the context of Type 2 diabetes to understand parameters associated with the disease. Results: The data assimilation method captures how well patients improve glucose tolerance after their surgery. Data a… ▽ More Motivation: There is a growing need to integrate mechanistic models of biological processes with computational methods in healthcare in order to improve prediction. We apply data assimilation in the context of Type 2 diabetes to understand parameters associated with the disease. Results: The data assimilation method captures how well patients improve glucose tolerance after their surgery. Data assimilation has the potential to improve phenotyping in Type 2 diabetes. △ Less

Submitted 13 March, 2020; originally announced March 2020.

arXiv:2003.06002 [pdf, other]

Bayesian Posterior Interval Calibration to Improve the Interpretability of Observational Studies

Authors: Jami J. Mulgrave, David Madigan, George Hripcsak

Abstract: Observational healthcare data offer the potential to estimate causal effects of medical products on a large scale. However, the confidence intervals and p-values produced by observational studies only account for random error and fail to account for systematic error. As a consequence, operating characteristics such as confidence interval coverage and Type I error rates often deviate sharply from t… ▽ More Observational healthcare data offer the potential to estimate causal effects of medical products on a large scale. However, the confidence intervals and p-values produced by observational studies only account for random error and fail to account for systematic error. As a consequence, operating characteristics such as confidence interval coverage and Type I error rates often deviate sharply from their nominal values and render interpretation impossible. While there is longstanding awareness of systematic error in observational studies, analytic approaches to empirically account for systematic error are relatively new. Several authors have proposed approaches using negative controls (also known as "falsification hypotheses") and positive controls. The basic idea is to adjust confidence intervals and p-values in light of the bias (if any) detected in the analyses of the negative and positive control. In this work, we propose a Bayesian statistical procedure for posterior interval calibration that uses negative and positive controls. We show that the posterior interval calibration procedure restores nominal characteristics, such as 95% coverage of the true effect size by the 95% posterior interval. △ Less

Submitted 1 May, 2024; v1 submitted 12 March, 2020; originally announced March 2020.

arXiv:1904.02098 [pdf, other]

The Medical Deconfounder: Assessing Treatment Effects with Electronic Health Records

Authors: Linying Zhang, Yixin Wang, Anna Ostropolets, Jami J. Mulgrave, David M. Blei, George Hripcsak

Abstract: The treatment effects of medications play a key role in guiding medical prescriptions. They are usually assessed with randomized controlled trials (RCTs), which are expensive. Recently, large-scale electronic health records (EHRs) have become available, opening up new opportunities for more cost-effective assessments. However, assessing a treatment effect from EHRs is challenging: it is biased by… ▽ More The treatment effects of medications play a key role in guiding medical prescriptions. They are usually assessed with randomized controlled trials (RCTs), which are expensive. Recently, large-scale electronic health records (EHRs) have become available, opening up new opportunities for more cost-effective assessments. However, assessing a treatment effect from EHRs is challenging: it is biased by unobserved confounders, unmeasured variables that affect both patients' medical prescription and their outcome, e.g. the patients' social economic status. To adjust for unobserved confounders, we develop the medical deconfounder, a machine learning algorithm that unbiasedly estimates treatment effects from EHRs. The medical deconfounder first constructs a substitute confounder by modeling which medications were prescribed to each patient; this substitute confounder is guaranteed to capture all multi-medication confounders, observed or unobserved (arXiv:1805.06826). It then uses this substitute confounder to adjust for the confounding bias in the analysis. We validate the medical deconfounder on two simulated and two real medical data sets. Compared to classical approaches, the medical deconfounder produces closer-to-truth treatment effect estimates; it also identifies effective medications that are more consistent with the findings in the medical literature. △ Less

Submitted 17 August, 2019; v1 submitted 3 April, 2019; originally announced April 2019.

arXiv:1902.01978 [pdf, other]

The Parameter Houlihan: a solution to high-throughput identifiability indeterminacy for brutally ill-posed problems

Authors: DJ Albers, M Levine, L Mamykina, G Hripcsak

Abstract: One way to interject knowledge into clinically impactful forecasting is to use data assimilation, a nonlinear regression that projects data onto a mechanistic physiologic model, instead of a set of functions, such as neural networks. Such regressions have an advantage of being useful with particularly sparse, non-stationary clinical data. However, physiological models are often nonlinear and can h… ▽ More One way to interject knowledge into clinically impactful forecasting is to use data assimilation, a nonlinear regression that projects data onto a mechanistic physiologic model, instead of a set of functions, such as neural networks. Such regressions have an advantage of being useful with particularly sparse, non-stationary clinical data. However, physiological models are often nonlinear and can have many parameters, leading to potential problems with parameter identifiability, or the ability to find a unique set of parameters that minimize forecasting error. The identifiability problems can be minimized or eliminated by reducing the number of parameters estimated, but reducing the number of estimated parameters also reduces the flexibility of the model and hence increases forecasting error. We propose a method, the parameter Houlihan, that combines traditional machine learning techniques with data assimilation, to select the right set of model parameters to minimize forecasting error while reducing identifiability problems. The method worked well: the data assimilation-based glucose forecasts and estimates for our cohort using the Houlihan-selected parameter sets generally also minimize forecasting errors compared to other parameter selection methods such as by-hand parameter selection. Nevertheless, the forecast with the lowest forecast error does not always accurately represent physiology, but further advancements of the algorithm provide a path for improving physiologic fidelity as well. Our hope is that this methodology represents a first step toward combining machine learning with data assimilation and provides a lower-threshold entry point for using data assimilation with clinical data by helping select the right parameters to estimate. △ Less

Submitted 5 February, 2019; originally announced February 2019.

arXiv:1811.06183 [pdf]

Characterizing Design Patterns of EHR-Driven Phenotype Extraction Algorithms

Authors: Yizhen Zhong, Luke Rasmussen, Yu Deng, Jennifer Pacheco, Maureen Smith, Justin Starren, Wei-Qi Wei, Peter Speltz, Joshua Denny, Nephi Walton, George Hripcsak, Christopher G Chute, Yuan Luo

Abstract: The automatic development of phenotype algorithms from Electronic Health Record data with machine learning (ML) techniques is of great interest given the current practice is very time-consuming and resource intensive. The extraction of design patterns from phenotype algorithms is essential to understand their rationale and standard, with great potential to automate the development process. In this… ▽ More The automatic development of phenotype algorithms from Electronic Health Record data with machine learning (ML) techniques is of great interest given the current practice is very time-consuming and resource intensive. The extraction of design patterns from phenotype algorithms is essential to understand their rationale and standard, with great potential to automate the development process. In this pilot study, we perform network visualization on the design patterns and their associations with phenotypes and sites. We classify design patterns using the fragments from previously annotated phenotype algorithms as the ground truth. The classification performance is used as a proxy for coherence at the attribution level. The bag-of-words representation with knowledge-based features generated a good performance in the classification task (0.79 macro-f1 scores). Good classification accuracy with simple features demonstrated the attribution coherence and the feasibility of automatic identification of design patterns. Our results point to both the feasibility and challenges of automatic identification of phenotyping design patterns, which would power the automatic development of phenotype algorithms. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: 4 pages, accepted by IEEE BIBM 2018 as short paper

arXiv:1803.10791 [pdf]

A systematic approach to improving the reliability and scale of evidence from health care data

Authors: Martijn J. Schuemie, Patrick B. Ryan, George Hripcsak, David Madigan, Marc A. Suchard

Abstract: Concerns over reproducibility in science extend to research using existing healthcare data; many observational studies investigating the same topic produce conflicting results, even when using the same data. To address this problem, we propose a paradigm shift. The current paradigm centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or… ▽ More Concerns over reproducibility in science extend to research using existing healthcare data; many observational studies investigating the same topic produce conflicting results, even when using the same data. To address this problem, we propose a paradigm shift. The current paradigm centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time. The new paradigm advocates for high-throughput observational studies using consistent and standardized methods, allowing evaluation, calibration, and unbiased dissemination to generate a more reliable and complete evidence base. We demonstrate this new paradigm by comparing all depression treatments for a set of outcomes, producing 17,718 hazard ratios, each using methodology on par with state-of-the-art studies. We furthermore include control hypotheses to evaluate and calibrate our evidence generation process. Results show good transitivity and consistency between databases, and agree with four out of the five findings from clinical trials. The distribution of effect size estimates reported in literature reveals an absence of small or null effects, with a sharp cutoff at p = 0.05. No such phenomena were observed in our results, suggesting more complete and more reliable evidence. △ Less

Submitted 28 March, 2018; originally announced March 2018.

Comments: 24 pages, 6 figures, 2 tables, 28 pages supplementary materials

arXiv:1801.08929 [pdf]

Methodological variations in lagged regression for detecting physiologic drug effects in EHR data

Authors: Matthew E. Levine, David J. Albers, George Hripcsak

Abstract: We studied how lagged linear regression can be used to detect the physiologic effects of drugs from data in the electronic health record (EHR). We systematically examined the effect of methodological variations ((i) time series construction, (ii) temporal parameterization, (iii) intra-subject normalization, (iv) differencing (lagged rates of change achieved by taking differences between consecutiv… ▽ More We studied how lagged linear regression can be used to detect the physiologic effects of drugs from data in the electronic health record (EHR). We systematically examined the effect of methodological variations ((i) time series construction, (ii) temporal parameterization, (iii) intra-subject normalization, (iv) differencing (lagged rates of change achieved by taking differences between consecutive measurements), (v) explanatory variables, and (vi) regression models) on performance of lagged linear methods in this context. We generated two gold standards (one knowledge-base derived, one expert-curated) for expected pairwise relationships between 7 drugs and 4 labs, and evaluated how the 64 unique combinations of methodological perturbations reproduce gold standards. Our 28 cohorts included patients in Columbia University Medical Center/NewYork-Presbyterian Hospital clinical database. The most accurate methods achieved AUROC of 0.794 for knowledge-base derived gold standard (95%CI [0.741, 0.847]) and 0.705 for expert-curated gold standard (95% CI [0.629, 0.781]). We observed a 0.633 mean AUROC (95%CI [0.610, 0.657], expert-curated gold standard) across all methods that re-parameterize time according to sequence and use either a joint autoregressive model with differencing or an independent lag model without differencing. The complement of this set of methods achieved a mean AUROC close to 0.5, indicating the importance of these choices. We conclude that time- series analysis of EHR data will likely rely on some of the beneficial pre-processing and modeling methodologies identified, and will certainly benefit from continued careful analysis of methodological perturbations. This study found that methodological variations, such as pre-processing and representations, significantly affect results, exposing the importance of evaluating these components when comparing machine-learning methods. △ Less

Submitted 26 January, 2018; originally announced January 2018.

arXiv:1709.00163 [pdf, other]

Offline and online data assimilation for real-time blood glucose forecasting in type 2 diabetes

Authors: Matthew E Levine, George Hripcsak, Lena Mamykina, Andrew Stuart, David J Albers

Abstract: We evaluate the benefits of combining different offline and online data assimilation methodologies to improve personalized blood glucose prediction with type 2 diabetes self-monitoring data. We collect self-monitoring data (nutritional reports and pre- and post-prandial glucose measurements) from 4 individuals with diabetes and 2 individuals without diabetes. We write online to refer to methods th… ▽ More We evaluate the benefits of combining different offline and online data assimilation methodologies to improve personalized blood glucose prediction with type 2 diabetes self-monitoring data. We collect self-monitoring data (nutritional reports and pre- and post-prandial glucose measurements) from 4 individuals with diabetes and 2 individuals without diabetes. We write online to refer to methods that update state and parameters sequentially as nutrition and glucose data are received, and offline to refer to methods that estimate parameters over a fixed data set, distributed over a time window containing multiple nutrition and glucose measurements. We fit a model of ultradian glucose dynamics to the first half of each data set using offline (MCMC and nonlinear optimization) and online (unscented Kalman filter and an unfiltered model---a dynamical model driven by nutrition data that does not update states) data assimilation methods. Model parameters estimated over the first half of the data are used within online forecasting methods to issue forecasts over the second half of each data set. Offline data assimilation methods provided consistent advantages in predictive performance and practical usability in 4 of 6 patient data sets compared to online data assimilation methods alone; yet 2 of 6 patients were best predicted with a strictly online approach. Interestingly, parameter estimates generated offline led to worse predictions when fed to a stochastic filter than when used in a simple, unfiltered model that incorporates new nutritional information, but does not update model states based on glucose measurements. The relative improvements seen from the unfiltered model, when carefully trained offline, exposes challenges in model sensitivity and filtering applications, but also opens possibilities for improved glucose forecasting and relaxed patient self-monitoring requirements. △ Less

Submitted 1 September, 2017; originally announced September 2017.

arXiv:1305.7271 [pdf, other]

A methodology for detecting and exploring non-convulsive seizures in patients with SAH

Authors: D J Albers, J Claassen, M J Schmidt, G Hripcsak

Abstract: A methodology for understanding and de- tecting nonconvulsive seizures in individuals with sub- arachnoid hemorrhage is introduced. Specifically, begin- ning with an EEG signal, the power spectrum is esti- mated yielding a multivariate time series which is then ana- lyzed using empirical orthogonal functional analysis. This methodology allows for easy identification and observation of seizures tha… ▽ More A methodology for understanding and de- tecting nonconvulsive seizures in individuals with sub- arachnoid hemorrhage is introduced. Specifically, begin- ning with an EEG signal, the power spectrum is esti- mated yielding a multivariate time series which is then ana- lyzed using empirical orthogonal functional analysis. This methodology allows for easy identification and observation of seizures that are otherwise only identifiable though ex- pert analysis of the raw EEG. △ Less

Submitted 30 May, 2013; originally announced May 2013.

Comments: Submitted to NOLTA 2013

arXiv:1110.4102 [pdf, other]

doi 10.1063/1.3675621

Using time-delayed mutual information to discover and interpret temporal correlation structure in complex populations

Authors: D. J. Albers, George Hripcsak

Abstract: This paper addresses how to calculate and interpret the time-delayed mutual information for a complex, diversely and sparsely measured, possibly non-stationary population of time-series of unknown composition and origin. The primary vehicle used for this analysis is a comparison between the time-delayed mutual information averaged over the population and the time-delayed mutual information of an a… ▽ More This paper addresses how to calculate and interpret the time-delayed mutual information for a complex, diversely and sparsely measured, possibly non-stationary population of time-series of unknown composition and origin. The primary vehicle used for this analysis is a comparison between the time-delayed mutual information averaged over the population and the time-delayed mutual information of an aggregated population (here aggregation implies the population is conjoined before any statistical estimates are implemented). Through the use of information theoretic tools, a sequence of practically implementable calculations are detailed that allow for the average and aggregate time-delayed mutual information to be interpreted. Moreover, these calculations can be also be used to understand the degree of homo- or heterogeneity present in the population. To demonstrate that the proposed methods can be used in nearly any situation, the methods are applied and demonstrated on the time series of glucose measurements from two different subpopulations of individuals from the Columbia University Medical Center electronic health record repository, revealing a picture of the composition of the population as well as physiological features. △ Less

Submitted 18 October, 2011; originally announced October 2011.

arXiv:1110.3317 [pdf, ps, other]

doi 10.1371/journal.pone.0048058

Population physiology: leveraging population scale (EHR) data to understand human endocrine dynamics

Authors: DJ Albers, George Hripcsak, Michael Schmidt

Abstract: Studying physiology over a broad population for long periods of time is difficult primarily because collecting human physiologic data is intrusive, dangerous, and expensive. Electronic health record (EHR) data promise to support the development and testing of mechanistic physiologic models on diverse population, but limitations in the data have thus far thwarted such use. For instance, using uncon… ▽ More Studying physiology over a broad population for long periods of time is difficult primarily because collecting human physiologic data is intrusive, dangerous, and expensive. Electronic health record (EHR) data promise to support the development and testing of mechanistic physiologic models on diverse population, but limitations in the data have thus far thwarted such use. For instance, using uncontrolled population-scale EHR data to verify the outcome of time dependent behavior of mechanistic, constructive models can be difficult because: (i) aggregation of the population can obscure or generate a signal, (ii) there is often no control population, and (iii) diversity in how the population is measured can make the data difficult to fit into conventional analysis techniques. This paper shows that it is possible to use EHR data to test a physiological model for a population and over long time scales. Specifically, a methodology is developed and demonstrated for testing a mechanistic, time-dependent, physiological model of serum glucose dynamics with uncontrolled, population-scale, physiological patient data extracted from an EHR repository. It is shown that there is no observable daily variation the normalized mean glucose for any EHR subpopulations. In contrast, a derived value, daily variation in nonlinear correlation quantified by the time-delayed mutual information (TDMI), did reveal the intuitively expected diurnal variation in glucose levels amongst a wild population of humans. Moreover, in a population of intravenously fed patients, there was no observable TDMI-based diurnal signal. These TDMI-based signals, via a glucose insulin model, were then connected with human feeding patterns. In particular, a constructive physiological model was shown to correctly predict the difference between the general uncontrolled population and a subpopulation whose feeding was controlled. △ Less

Submitted 14 October, 2011; originally announced October 2011.

arXiv:1110.1615 [pdf, ps, other]

doi 10.1016/j.chaos.2012.03.003

Estimation of time-delayed mutual information and bias for irregularly and sparsely sampled time-series

Authors: DJ Albers, George Hripcsak

Abstract: A method to estimate the time-dependent correlation via an empirical bias estimate of the time-delayed mutual information for a time-series is proposed. In particular, the bias of the time-delayed mutual information is shown to often be equivalent to the mutual information between two distributions of points from the same system separated by infinite time. Thus intuitively, estimation of the bias… ▽ More A method to estimate the time-dependent correlation via an empirical bias estimate of the time-delayed mutual information for a time-series is proposed. In particular, the bias of the time-delayed mutual information is shown to often be equivalent to the mutual information between two distributions of points from the same system separated by infinite time. Thus intuitively, estimation of the bias is reduced to estimation of the mutual information between distributions of data points separated by large time intervals. The proposed bias estimation techniques are shown to work for Lorenz equations data and glucose time series data of three patients from the Columbia University Medical Center database. △ Less

Submitted 7 October, 2011; originally announced October 2011.

Showing 1–22 of 22 results for author: Hripcsak, G