×

Nonparametric Bayesian two-level clustering for subject-level single-cell expression data. (English) Zbl 07601219

Summary: The advent of single-cell sequencing opens new avenues for personalized treatment. In this study, we address a two-level clustering problem of simultaneous subject subgroup discovery (subject level) and cell type detection (cell level) for single-cell expression data from multiple subjects. Current statistical approaches either cluster cells without considering the subject heterogeneity, or group subjects without using the single-cell information. To bridge the gap between cell clustering and subject grouping, we develop a nonparametric Bayesian model, Subject and Cell clustering for Single-Cell expression data (SCSC) model, to achieve subject and cell grouping simultaneously. The SCSC model does not need to prespecify the subject subgroup number or the cell type number. It automatically induces subject subgroup structures and matches cell types across subjects. Moreover, it directly models the single-cell raw count data by deliberately considering the data’s dropouts, library sizes, and over-dispersion. A blocked Gibbs sampler is proposed for the posterior inference. Simulation studies and an application to a multi-subject induced pluripotent stem cell single-cell RNA sequencing data set validate the ability of the SCSC model to simultaneously cluster subjects and cells.

MSC:

62-XX Statistics

References:

[1] Beraha, M., Guglielmi, A. and Quintana, F. A. (2021). The semi-hierarchical Dirichlet Process and its application to clustering homogeneous distributions. Bayesian Analysis 16, 1187-1219. · Zbl 07808151
[2] Buettner, F., Natarajan, K. N., Casale, F. P., Proserpio, V., Scialdone, A., Theis, F. J. et al. (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155-160.
[3] Busch, S. E., Hanke, M. L., Kargl, J., Metz, H. E., Macpherson, D. and Houghton, A. M. (2016). Lung cancer subtypes generate unique immune responses. Journal of Immunol-ogy 197, 4493-4503.
[4] Butler, A., Hoffman, P., Smibert, P., Papalexi, E. and Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotech-nology 36, 411-420.
[5] Camerlenghi, F., Dumitrascu, B., Ferrari, F., Engelhardt, B. E. and Favaro, S. (2020). Non-parametric Bayesian multi-armed bandits for single cell experiment design. The Annals of Applied Statistics 14, 2003-2019. · Zbl 1498.62195
[6] Camerlenghi, F., Dunson, D. B., Lijoi, A., Prünster, I. and Rodríguez, A. (2019). Latent nested nonparametric priors (with discussion). Bayesian Analysis 14, 1303-1356. · Zbl 1436.62108
[7] Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. In Ismb 8, 93-103.
[8] Denti, F., Camerlenghi, F., Guindani, M. and Mira, A. (2021). A common atom model for the Bayesian nonparametric analysis of nested data. Journal of the American Statistical Association, 1-12.
[9] Edgar, R., Domrachev, M. and Lash, A. E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30, 207-210.
[10] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209-230. · Zbl 0255.62037
[11] Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification 2, 193-218.
[12] Huo, Z., Ding, Y., Liu, S., Oesterreich, S. and Tseng, G. (2016). Meta-analytic framework for sparse K-means to identify disease subtypes in multiple transcriptomic studies. Journal of the American Statistical Association 111, 27-42.
[13] Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, 161-173. · Zbl 1014.62006
[14] James, L. (2008). Discussion of nested Dirichlet process paper by Rodriguez, Dunson and Gelfand. Journal of the American Statistical Association 483, 1131-1154. · Zbl 1205.62062
[15] Kim, K., Zhao, R., Doi, A., Ng, K., Unternaehrer, J., Cahan, P. et al. (2011). Donor cell type can influence the epigenome and differentiation potential of human induced pluripotent stem cells. Nature Biotechnology 29, 1117-1119.
[16] Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T. et al. (2017). SC3: consensus clustering of single-cell RNA-seq data. Nature Methods 14, 483-486.
[17] Konopka, T. (2019). UMAP: uniform manifold approximation and projection. R Package (Ver-sion 0.2.1.0).
[18] Liu, Y., Warren, J. L. and Zhao, H. (2019). A hierarchical Bayesian model for single-cell clus-tering using RNA-sequencing data. The Annals of Applied Statistics 13, 1733-1752. · Zbl 1433.62311
[19] Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. density estimates. The Annals of Statistics 12, 351-357. · Zbl 0557.62036
[20] Luo, X. and Wei, Y. (2019). Batch effects correction with unknown subtypes. Journal of the American Statistical Association 114, 581-594. · Zbl 1420.62458
[21] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Edited by L. M. L. Cam and J. Neyman), 281-297. University of California Press, Oakland. · Zbl 0214.46201
[22] Makki, J. (2015). Diversity of breast carcinoma: Histological subtypes and clinical relevance. Clinical Medicine Insights Pathology 8, 23-31.
[23] Olaitan, P. B., Odesina, V., Ademola, S. A., Fadiora, S. O., Oluwatosin, O. M. and Reichen-berger, E. J. (2014). Recruitment of Yoruba families from Nigeria for genetic research: Experience from a multisite keloid study. BMC Medical Ethics 15, 65.
[24] Paisley, J., Wang, C., Blei, D. M. and Jordan, M. I. (2015). Nested hierarchical Dirichlet pro-cesses. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 256-270.
[25] Pan, W. and Shen, X. (2007). Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research 8, 1145-1164. · Zbl 1222.68279
[26] Pitman, J. (1996). Some developments of the Blackwell-Macqueen urn scheme. Lecture Notes-Monograph Series 30, 245-267.
[27] Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability 25, 855-900. · Zbl 0880.60076
[28] Prabhakaran, S., Azizi, E., Carr, A. and Pe’er, D. (2016). Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In International Confer-ence on Machine Learning (Edited by M. F. Balcan and K. Q. Weinberger) 48, 1070-1079.
[29] Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association 103, 1131-1154. · Zbl 1205.62062
[30] Sarkar, A. K., Tung, P.-Y., Blischak, J. D., Burnett, J. E., Li, Y. I., Stephens, M. et al. (2019). Discovery and characterization of variance QTLs in human induced pluripotent stem cells. PLoS Genetics 15, e1008045.
[31] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistics Sinica 4, 639-650. · Zbl 0823.62007
[32] Song, F., Chan, G. M. A. and Wei, Y. (2020). Flexible experimental designs for valid single-cell RNA-sequencing experiments allowing batch effects correction. Nature Communica-tions 11, 3274.
[33] Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W. M. et al. (2019). Comprehensive integration of single cell data. Cell 177, 1888-1902.
[34] Sun, Z., Wang, T., Deng, K., XF, W., Lafyatis, R., Ding, Y. et al. (2017). DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformat-ics 34, 139-146.
[35] Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association 100, 602-617. · Zbl 1117.62433
[36] Tanner, M. and Wong, W. H. (1987). The calculation of posterior distributions by data aug-mentation. Journal of the American Statistical Association 82, 528-540. · Zbl 0619.62029
[37] Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 1566-1581. · Zbl 1171.62349
[38] Tekumalla, L. S., Agrawal, P. and Bhattacharya, I. (2015). Nested hierarchical Dirich-let processes for multi-level non-parametric admixture modeling. arXiv preprint arXiv: 1508.06446.
[39] Turner, H., Bailey, T. C. and Krzanowski, W. J. (2005). Improved biclustering of microarray data demonstrated through systematic performance tests. Computational Statistics & Data Analysis 48, 235-254. · Zbl 1429.62267
[40] Wang, S. and Zhu, J. (2008). Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440-448. · Zbl 1137.62041
[41] Witten, D. M. and Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association 105, 713-726. · Zbl 1392.62194
[42] zurauskienė, J. and Yau, C. (2016). pcaReduce: Hierarchical clustering of single cell transcrip-tional profiles. BMC Bioinformatics 17, 140.
[43] E-mail: xiangyuluo@ruc.edu.cn (Received August 2020; accepted February 2021)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.