×

Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets. (English) Zbl 1538.62196

Summary: Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J07 Ridge regression; shrinkage estimators (Lasso)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

RobMixReg; flexmix
Full Text: DOI

References:

[1] Balakrishnan, S., Wainwright, M.J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45: 77-120, doi:10.1214/16-aos1435. · Zbl 1367.62052 · doi:10.1214/16-aos1435
[2] Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603-607, doi:10.1038/nature11003. · doi:10.1038/nature11003
[3] Bayazit, Y.A. and Yilmaz, M. (2006). An overview of hereditary hearing loss. ORL J. Otorhinolaryngol. Relat. Spec. 68: 57-63, doi:10.1159/000091090. · doi:10.1159/000091090
[4] Chang, W., Wan, C., Yu, C., Yao, W., Zhang, C., and Cao, S. (2020a). RobMixReg: an R package for robust, flexible and high dimensional mixture regression. bioRxiv, 2020.2008.2002.233460.
[5] Chang, W., Wan, C., Zang, Y., Zhang, C., and Cao, S. (2020b). Supervised clustering of high-dimensional data using regularized mixture modeling. Briefings Bioinf. 22: 1-11, doi:10.1093/bib/bbaa291. · doi:10.1093/bib/bbaa291
[6] Chang, W., Zhang, C., and Cao, S. (2022). Response to ‘Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression based clustering’, Zhang et al. Briefings Bioinf. 23: 1-3, doi:10.1093/bib/bbac262. · doi:10.1093/bib/bbac262
[7] Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., and Wang, Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8: 37-49, doi:10.1038/nrc2294. · doi:10.1038/nrc2294
[8] Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R., and Hobbs, H.H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305: 869-872, doi:10.1126/science.1099870. · doi:10.1126/science.1099870
[9] Cohen, J.C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G.L., Grundy, S.M., and Hobbs, H.H. (2006). Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc. Natl. Acad. Sci. U. S. A. 103: 1810-1815, doi:10.1073/pnas.0508483103. · doi:10.1073/pnas.0508483103
[10] Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39: 1-22, doi:10.1111/j.2517-6161.1977.tb01600.x. · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[11] Dror, A.A. and Avraham, K.B. (2009). Hearing loss: mechanisms revealed by genetics and cell biology. Annu. Rev. Genet. 43: 411-437, doi:10.1146/annurev-genet-102108-134135. · doi:10.1146/annurev-genet-102108-134135
[12] Eschrich, S., Yang, I., Bloom, G., Kwong, K.Y., Boulware, D., Cantor, A., Coppola, D., Kruhøffer, M., Aaltonen, L., Orntoft, T.F., et al.. (2005). Molecular staging for survival prediction of colorectal cancer patients. J. Clin. Oncol. 23: 3526-3535, doi:10.1200/jco.2005.00.695. · doi:10.1200/jco.2005.00.695
[13] Fahmi, S., Yang, C., Esmail, S., Hobbs, H.H., and Cohen, J.C. (2008). Functional characterization of genetic variants in NPC1L1 supports the sequencing extremes strategy to identify complex trait genes. Hum. Mol. Genet. 17: 2101-2107, doi:10.1093/hmg/ddn108. · doi:10.1093/hmg/ddn108
[14] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96: 1348-1360, doi:10.1198/016214501753382273. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[15] Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97: 611-631, doi:10.1198/016214502760047131. · Zbl 1073.62545 · doi:10.1198/016214502760047131
[16] Frayling Timothy, M., Timpson Nicholas, J., Weedon Michael, N., Zeggini, E., Freathy Rachel, M., Lindgren, C.M., Perry, J.R.B., Elliott, K.S., Lango, H., Rayner, N.W., et al.. (2007). A common variant in the FTO gene is associated with body mass Index and predisposes to childhood and adult obesity. Science 316: 889-894, doi:10.1126/science.1141634. · doi:10.1126/science.1141634
[17] Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., Fan, J.B., Gao, Y., et al.. (2013). Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49: 359-367, doi:10.1016/j.molcel.2012.10.016. · doi:10.1016/j.molcel.2012.10.016
[18] Harbeck, N., Penault-Llorca, F., Cortes, J., Gnant, M., Houssami, N., Poortmans, P., Ruddy, K., Tsang, J., and Cardoso, F. (2019). Breast cancer. Nat. Rev. Dis. Primers 5: 66, doi:10.1038/s41572-019-0111-2. · doi:10.1038/s41572-019-0111-2
[19] Leisch, F. (2004). FlexMix: a general framework for finite mixture models and latent class regression in R. J. Stat. Softw. 11: 1-18, doi:10.18637/jss.v011.i08. · doi:10.18637/jss.v011.i08
[20] Mallick, H., Alhamzawi, R., Paul, E., and Svetnik, V. (2021). The reciprocal Bayesian LASSO. Stat. Med. 40: 4830-4849, doi:10.1002/sim.9098. · doi:10.1002/sim.9098
[21] Matsui, S., Yamanaka, T., Barlogie, B., Shaughnessy, J.D.Jr., and Crowley, J. (2008). Clustering of significant genes in prognostic studies with microarrays: application to a clinical study for multiple myeloma. Stat. Med. 27: 1106-1120, doi:10.1002/sim.2997. · doi:10.1002/sim.2997
[22] Melchor, L., Molyneux, G., Mackay, A., Magnay, F.A., Atienza, M., Kendrick, H., Nava‐Rodrigues, D., López‐García, M.Á., Milanezi, F., Greenow, K., et al.. (2014). Identification of cellular and genetic drivers of breast cancer heterogeneity in genetically engineered mouse tumour models. J. Pathol. 233: 124-137, doi:10.1002/path.4345. · doi:10.1002/path.4345
[23] Nigam, B., Ahirwal, P., Salve, S., and Vamney, S. (2011). Document classification using expectation maximization with semi supervised learning. Int. J. Soft Comput. 2: 386-397, doi:10.5121/ijsc.2011.2404. · doi:10.5121/ijsc.2011.2404
[24] Petit, C. (1996). Genes responsible for human hereditary deafness: symphony of a thousand. Nat. Genet. 14: 385-391, doi:10.1038/ng1296-385. · doi:10.1038/ng1296-385
[25] Romero, R., Espinoza, J., Gotsch, F., Kusanovic, J.P., Friel, L.A., Erez, O., Mazaki-Tovi, S., Than, N., Hassan, S., and Tromp, G. (2006). The use of high-dimensional biology (genomics, transcriptomics, proteomics, and metabolomics) to understand the preterm parturition syndrome. BJOG: Int. J. Obstet. Gynaecol. 113: 118-135, doi:10.1111/j.1471-0528.2006.01150.x. · doi:10.1111/j.1471-0528.2006.01150.x
[26] Shi, J., Ren, M., Jia, J., Tang, M., Guo, Y., Ni, X., and Shi, T. (2019). Genotype-phenotype association analysis reveals new pathogenic factors for osteogenesis imperfecta disease. Front. Pharmacol. 10: 1200, doi:10.3389/fphar.2019.01200. · doi:10.3389/fphar.2019.01200
[27] Siminovitch, K.A. (2004). PTPN22 and autoimmune disease. Nat. Genet. 36: 1248-1249, doi:10.1038/ng1204-1248. · doi:10.1038/ng1204-1248
[28] Walsh, T. and King, M.-C. (2007). Ten genes for inherited breast cancer. Cancer Cell 11: 103-105, doi:10.1016/j.ccr.2007.01.010. · doi:10.1016/j.ccr.2007.01.010
[29] Wang, H. and Leng, C. (2007). Unified LASSO estimation by least squares approximation. J. Am. Stat. Assoc. 102: 1039-1048, doi:10.1198/016214507000000509. · Zbl 1306.62167 · doi:10.1198/016214507000000509
[30] Wang, Y., Jatkoe, T., Zhang, Y., Mutch, M.G., Talantov, D., Jiang, J., McLeod, H.L., and Atkins, D. (2004). Gene expression profiles and molecular markers to predict recurrence of Dukes’ B colon cancer. J. Clin. Oncol. 22: 1564-1571, doi:10.1200/jco.2004.08.186. · doi:10.1200/jco.2004.08.186
[31] Wang, H., Lengerich, B.J., Aragam, B., and Xing, E.P. (2019). Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35: 1181-1187, doi:10.1093/bioinformatics/bty750. · doi:10.1093/bioinformatics/bty750
[32] Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. Ann. Stat. 11: 95-103, doi:10.1214/aos/1176346060. · Zbl 0517.62035 · doi:10.1214/aos/1176346060
[33] Xu, H., Caramanis, C., and Mannor, S. (2012). Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans. Pattern Anal. Mach. Intell. 34: 187-193.
[34] Yao, J., Zhao, Q., Yuan, Y., Zhang, L., Liu, X., Yung, W.K.A., and Weinstein, J.N. (2012). Identification of common prognostic gene expression signatures with biological meanings from microarray gene expression datasets. PLoS One 7: e45894, doi:10.1371/journal.pone.0045894. · doi:10.1371/journal.pone.0045894
[35] Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. J. R. Stat. Soc., B: Stat. Methodol. 69: 143-161, doi:10.1111/j.1467-9868.2007.00581.x. · Zbl 1120.62052 · doi:10.1111/j.1467-9868.2007.00581.x
[36] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38: 894-942, doi:10.1214/09-aos729. · Zbl 1183.62120 · doi:10.1214/09-aos729
[37] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Stat. 36: 1567-1594, doi:10.1214/07-aos520. · Zbl 1142.62044 · doi:10.1214/07-aos520
[38] Zhang, Y., Hapala, J., Brenner, H., and Wagner, W. (2017). Individual CpG sites that are associated with age and life expectancy become hypomethylated upon aging. Clin. Epigenet. 9: 1-6, doi:10.1186/s13148-017-0315-9. · doi:10.1186/s13148-017-0315-9
[39] Zhang, B., He, J., Hu, J., Koestler, D.C., and Chalise, P. (2021). Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression-based clustering. Briefings Bioinf. 23: 1-5, doi:10.1093/bib/bbab532. · doi:10.1093/bib/bbab532
[40] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101: 1418-1429, doi:10.1198/016214506000000735. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[41] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic Net. J. R. Stat. Soc., B: Stat. Methodol. 67: 301-320, doi:10.1111/j.1467-9868.2005.00503.x. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.