Abstract
Tree-based methods have become one of the most flexible, intuitive, and powerful analytic tools for exploring complex data structures. The best documented, and arguably most popular uses of tree-based methods are in biomedical research, where multivariate outcomes occur commonly (e.g. diastolic and systolic blood pressure and nerve conduction measures in studies of neuropathy). Existing tree-based methods for multivariate outcomes do not appropriately take into account the correlation that exists in such data. In this paper, we develop goodness-of-split measures for building multivariate regression trees for continuous multivariate outcomes. We propose two general approaches: minimizing within-node homogeneity and maximizing between-node separation. Within-node homogeneity is measured using the average Mahalanobis distance and the determinant of the variance-covariance matrix. Between-node separation is measured using the Mahalanobis distance, Euclidean distance and standardized Euclidean distance. To enhance prediction accuracy we extend the single multivariate regression tree to an ensemble of multivariate trees. Extensive simulations are presented to examine the properties of our goodness-of-split measures. Finally, the proposed methods are illustrated using two clinical datasets of neuropathy and pediatric cardiac surgery.
Similar content being viewed by others
Data Availability
Study datasets are available from the corresponding author on reasonable request.
References
Banerjee, M., Reynolds, E., Andersson, H.B. and Nallamothu, B.K. (2019). Tree-Based Analysis: A Practical Approach to Create Clinical Decision-Making Tools. Circ Cardiovasc Qual Outcomes.
Bharucha, N.E., Bharucha, A.E. and Bharucha, E.P. (1991). Prevalence of peripheral neuropathy in the Parsi community of Bombay. Neurology 41, 1315–1317. 591–600.
Breiman, L. (1999). Bagging predictors. Mach. Learn. 24, 123–140.
Breiman, L. (2001). Random forests. Mach. Learn. 45, 5–32.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. California, Wadsworth, Belmont.
Cai, T.T., Ren, Z. and Zhou, H.H. (2016). Estimating structured high-dimensional covariance and precision matrices:Optimal rates and adaptive estimation. Electron. J. Stat. 10(1). https://doi.org/10.1214/15-EJS1081.
Callaghan, B.C., Gao, L., Li, Y., Zhou, X., Reynolds, E., Banerjee, M. and Ji, L. (2018). Diabetes and obesity are the main metabolic drivers of peripheral neuropathy. Annal. Clin. Trans. Neurol. 5, 397–405.
Callaghan, B.C., Xia, R., Banerjee, M., de Rekeneire, N., Harris, T.B., Satterfield, S., Schwartz, A.V., Vinik, A.I., Feldman, E.L. and Strotmeyer, E.S. (2016). Metabolic syndrome components are associated with symptomatic polyneuropathy independent of glycemic status. Diabetes Care 39, 801–807.
Callaghan, B.C., Xia, R., Reynolds, E., Banerjee, M., Burant, C., Rothberg, A., Pop-Busui, R., Villegas-Umana, E. and Feldman, E. (2018). Better diagnostic accuracy of neuropathy in obesity: A new challenge for neurologists. Clinical Neurophysiolgy 129, 654–662.
Callaghan, B.C., Xia, R., Reynolds, E., Banerjee, M., Rothberg, A.E. and Burant, C.F. (2016). Association between metabolic syndrome components and polyneuropathy in an obese population. JAMA Neurol. 73, 1468–1476.
Cimino, J.J. (2013). Improving the electronic health record: getting what we wished for. J. Am. Med. Assoc. 309, 991–992.
De’Ath, G. (2002). Multivariate regression trees a new technique for modeling Species-Environment relationships. Ecology 83, 1105–1117.
Deo, R.C. (2015). Machine learning in medicine. Circulation 132, 1920–1930.
Fan, J., Liao, Y. and Liu, H. (2016). An overview of the estimation of large covariance and precision matrices. Econom. J. 19, C1–C32. http://doi.org/10.1111/ectj.12061.
Gaies, M., Cooper, D.S., Tabbutt, S., Schwartz, S.M., Ghanayem, N., Chanani, N.K., Costello, J.M., Thiagarajan, R.R., Laussen, P.C., Shekerdemian, L.S., Donohue, J.E., Willis, G.M., Gaynor, J.W., Jacobs, J.P., Ohye, R.G., Charpie, J.R., Pasquali, S.K. and Scheurer, M.A. (2015). Collaborative quality improvement in the cardiac intensive care unit: Development of the paediatric cardiac critical care consortium (PC4). Cardiol. Young 25, 951–957.
Gaies, M., Donohue, J.E., Willis, G.M., Kennedy, A.T., Butcher, J., Scheurer, M.A., Alten, J.A., Gaynor, J.W., Schuette, J.J., Cooper, D.S., Jacobs, J.P., Pasquali, S.K. and Tabbutt, S. (2016). Data integrity of the Pediatric Cardiac Critical Care Consortium (PC4) clinical registry. Cardiol. Young 26, 1090–1096.
Gaies, M., Werho, D.K., Zhang, W., Donohue, J.E., Tabbutt, S., Ghanayem, N.S., Scheurer, M.A., Costello, J.M., Gaynor, W., Pasquali, S.K., Dimick, J.B., Banerjee, M. and Schwartz, S.M. (2018). Duration of postoperative mechanical ventilation as a quality metric for pediatric cardiac surgical programs. Ann. Thorac. Surg. 105, 615–621.
Haque, M., Sartelli, M., McKimm, J. and Abu Bakar, M. (2018). Health care-associated infections - an overview. Infect Drug Resist 11, 2321–2333.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New York.
Johnson, A.E.W., Ghassemi, M.M., Nemati, S., Niehaus, K.E., Clifton, D.A. and Clifford, G.D. (2016). Machine learning and decision support in critical care. Proc. IEEE 104, 444–466.
Khairat, S., Coleman, G.C., Russomagno, S. and Gotz, D. (2018). Assessing the status quo of EHR accessibility, usability, and knowledge dissemination. eGEMs: Generating Evidence & Methods to Improve Patient Outcomes 6, 9.
Krassowski, M., Das, V., Sahu, S.K. and Misra, B.B. (2020). State of the field in Multi-Omics research: From computational needs to data mining and sharing. Front Genet. 11, 610798. http://doi.org/10.3389/fgene.2020.610798.
Lam, C. (2020). High-dimensional covariance matrix estimation. WIREs Comput Stat 12(2). http://doi.org/10.1002/wics.1485.
Larsen, D. and Speckman, P.L. (2004). Multivariate regression trees for analysis of abundance data. Biometrics. 60, 543–549.
LeBlanc, M. and Crowley, J. (1993). Survival trees by goodness of split. J. Am. Stat. Assoc. 88, 457–467.
Mahalanobis, P.C. (1936). On the Generalized Distance in Statistics.
Quinlan, J. (1996). Bagging, boosting, and C4.5. Proceedings Thirteenth American Association for Artificial Intelligence National Conference on Artificial Intelligence. AAAI Press, Menlo Park, p. 725–730.
Reynolds, E.L., Kerber, K.A., Hill, C., De Lott, L.B., Magliocco, B., Esper, G.J. and Callaghan, B.C. (2020). The effects of the Medicare NCS reimbursement policy: utilization, payments, and patient access. Neurology 95, e930–e935.
Savettieri, G., Rocca, W.A., Salemi, G., Meneghini, F., Grigoletto, F., Morgante, L., Reggio, A., Costa, V., Coraci, M.A. and Di Perri, R. (1993). Prevalence of diabetic neuropathy with somatic symptoms: a door-to-door survey in two Sicilian municipalities. Neurology 43, 1115–1120.
Segal, M.R. (1988). Regression trees for censored data. Biometrics 35–47.
Tabbutt, S., Schuette, J., Gaynor, J.W., Ghanayem, N., Jacobs, J.P., Alten, J.A., Dimick, J.B., Zhang, W., Donohue, J.E., Pasquali, S., Banerjee, M., Cooper, D. and Gaies, M.A. (2018). Novel model demonstrates variation in case mix adjusted mortality in pediatric cardiac intensive care units after cardiac surgery: a first step to disentangling surgical from CICU quality of care pediatric critical care medicine.
Wilks, S.S. (1967). Muldimensional Statistical Scatter. Collected Papers, Contributions to Mathematical Statistics. Wiley, New York, Anderson, T. W. (ed.), p. 597–614.
Zhang, H. and Singer, B. (1999). Recursive Partitioning in the Health Sciences. Springer, New York.
Acknowledgments
None.
Funding
Dr. Reynolds is supported by NIH K99DK129785. Dr. Banerjee is supported by NIH R21CA152775.
Author information
Authors and Affiliations
Contributions
Dr. Reynolds developed the methodological approach, performed and interpreted results from the simulation study and illustrative examples, and wrote the manuscript. Dr. Callaghan was integrally involved in interpretation of the data, and critical revisions of the manuscript. Dr. Gaeis was integrally involved in interpretation of the data, and critical revisions of the manuscript. Dr. Banerjee developed the methodological approach, interpreted results from the simulation study and illustrative examples, and wrote the manuscript.
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Reynolds, E.L., Callaghan, B.C., Gaies, M. et al. Regression Trees and Ensemble for Multivariate Outcomes. Sankhya B 85, 77–109 (2023). https://doi.org/10.1007/s13571-023-00301-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-023-00301-z