×

A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. (English) Zbl 07260723

Summary: In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag-of-words benchmark data set, where the features are represented by the frequencies of occurrence of each word.

MSC:

62-XX Statistics
68-XX Computer science

References:

[1] E. M.Airoldi and J. M.Bischof, Improving and Evaluating Topic Models and Other Models of Text, J. Am. Stat. Assoc.111 (2016), pp. 1381-1403.
[2] E. M.Airoldi, A. G.Anderson, S. E.Fienberg, and K. K.Skinner, Who wrote Ronald Reagan’s radio addresses?, Bayesian Anal.1 (2006), pp. 289-319. · Zbl 1331.62491
[3] E. M.Airoldi, E. A.Erosheva, S. E.Fienberg, C.Joutard, T.Love, and S.Shringarpure, Reconceptualizing the classification of PNAS articles, Proc. Natl Acad. Sci.107 (2010), pp. 899-904.
[4] E. M.Airoldi (ed.) et al. (eds.), Handbook of mixed membership models and their applications, CRC Press, 2014.
[5] S.Anders and W.Huber, Differential expression analysis for sequence count data, Genome Biol.11 (2010), No. 10, pp. R106.
[6] C. E.Antoniak, Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems, Ann. Stat.2 (1974), No. 6, pp. 1152-1174. · Zbl 0335.60034
[7] S.Banerjee, B. P.Carlin, and A. E.Gelfand, Hierarchical modeling and analysis for spatial data, CRC Press, 2014.
[8] S.Blackshaw, S.Harpavat, J.Trimarchi, L.Cai, H.Huang, W. P.Kuo, G.Weber, K.Lee, R. E.Fraioli, and S.‐H.Cho, Genomic analysis of mouse retinal development, PLoS Biol.2 (2004), pp. e247.
[9] T.Broderick, J.Pitman, and M. I.Jordan, Feature allocations, probability functions, and paintboxes, Bayesian Anal.12 (2013), No. 4, pp. 801-836. · Zbl 1329.62278
[10] P.Brown, M.Vannucci, and T.Fearn, Bayesian wavelength selection in multicomponent analysis, J. Chemom.12 (1998), No. 3, pp. 173-182.
[11] P. J.Brown, M.Vannucci, and T.Fearn, Multivariate Bayesian variable selection and prediction, J. R. Stat. Soc. B Stat. Methodol.60 (1998), No. 3, pp. 627-641. · Zbl 0909.62022
[12] J. H.Bullard, E.Purdom, K. D.Hansen, and S.Dudoit, Evaluation of statistical methods for normalization and differential expression in mRNA‐Seq experiments, BMC Bioinform.11 (2010), No. 1, pp. 94.
[13] A. C.Cameron and P. K.Trivedi, Microeconometrics: methods and applications, Cambridge University Press, 2005.
[14] Regression analysis of count data, vol. 53 Cambridge University Press, 2013. · Zbl 1301.62003
[15] A.Canale and D. B.Dunson, Bayesian kernel mixtures for counts, J. Am. Stat. Assoc.106 (2011), No. 496, pp. 1528-1539. · Zbl 1233.62041
[16] A.Canale and I.Prünster, Robustifying Bayesian nonparametric mixtures for count data, Biometrics73 (2017), No. 1, pp. 174-184. · Zbl 1366.62205
[17] J.Chen and H.Li, Variable selection for sparse dirichlet‐multinomial regression with an application to microbiome data analysis, Ann. Appl. Stat.7 (2013). · Zbl 1454.62317
[18] H.Crane and etaltext, The ubiquitous ewens sampling formula, Stat. Sci. 31 (2016), No. 1, pp. 1-19. · Zbl 1442.60010
[19] D. B.Dahl, Model‐based clustering for expression data via a Dirichlet process mixture model, In Bayesian Inference for Gene Expression and Proteomics (Kim‐Anh Do, Peter Mueller, and Marina Vannucci, eds.), Cambridge University Press.
[20] M. D.Escobar and M.West, Bayesian density estimation and inference using mixtures, J. Am. Stat. Assoc.90 (1995), No. 430, pp. 577-588. · Zbl 0826.62021
[21] W. J.Ewens, The sampling theory of selectively neutral alleles, Theor. Popul. Biol.3 (1972), No. 1, pp. 87-112. · Zbl 0245.92009
[22] J. M.Gee, R. M.Warwick, M.Schaanning, J. A.Berge, and W. G.Ambrose., Effects of organic enrichment on meiofaunal abundance and community structure in sublittoral soft sediments, J. Exp. Mar. Biol. Ecol.91 (1985), pp. 247-262.
[23] A.Gelman, Objections to Bayesian statistics, Bayesian Anal.3 (2008), No. 3, pp. 445-449. · Zbl 1330.62046
[24] E. I.George and R. E.McCulloch, Approaches for Bayesian variable selection, Stat. Sin.7 (1997), No. 2, pp. 339-373. · Zbl 0884.62031
[25] T. L.Griffiths and Z.Ghahramani, The Indian Buffet process: an introduction and review, J. Mach. Learn. Res.12 (2011), pp. 1185-1224. · Zbl 1280.62038
[26] T. L.Griffiths and M.Steyvers, Finding scientific topics, Proc. Natl. Acad. Sci.101 (2004), No. 1, pp. 5228-5235.
[27] M.Guindani, P.Müller, and S.Zhang, A Bayesian discovery procedure, J. R. Stat. Soc. B Stat. Methodol.71 (2009), No. 5, pp. 905-925. · Zbl 1411.62224
[28] M.Guindani, N.Sepúlveda, C. D.Paulino, and P.Müller, A Bayesian semiparametric approach for the differential analysis of sequence counts data, J. R. Stat. Soc. C Appl. Stat.)63 (2014), No. 3, pp. 385-404.
[29] P.Gustafson, On model expansion, model contraction, identifiability, and prior information: two illustrative scenarios involving mismeasured variables (with discussion), Stat. Sci.20 (2005), pp. 111-140. · Zbl 1087.62037
[30] N. L.Hjort et al., Bayesian nonparametrics, vol. 28 Cambridge University Press, 2010.
[31] P.Hoff, Nonparametric estimation of convex models via mixture, Ann. Stat.31 (2003), pp. 174-200. · Zbl 1018.62023
[32] L.Hubert and P.Arabie, Comparing partitions, J. Classif.2 (1985), No. 1, pp. 193-218.
[33] H.Ishwaran and L. F.James, Gibbs sampling methods for stick‐breaking priors, J. Am. Stat. Assoc.96 (2001), No. 453, pp. 161-173. · Zbl 1014.62006
[34] S.Kim, M. G.Tadesse, and M.Vannucci, Variable selection in clustering via Dirichlet process mixture models, Biometrika93 (2006), No. 4, pp. 877-893. · Zbl 1436.62266
[35] H.Kucera and N.Francis, Computational analysis of present‐day American English, Brown University Press, 1967.
[36] M.Kyung, J.Gill, and G.Casella, Sampling schemes for generalized linear Dirichlet process random effects models, Stat. Methods Appl.20 (2011), No. 3, pp. 259-290. · Zbl 1241.65007
[37] D.Lambert, Zero‐inflated Poisson regression, with an application to defects in manufacturing, Technometrics34 (1992), pp. 1-14. · Zbl 0850.62756
[38] J.Lee, P.Müller, K.Gulukota, and Y.Ji, A Bayesian feature allocation model for tumor heterogeneity, Ann. Appl. Stat06 (2015), No. 2, pp. 621-639. · Zbl 1397.62457
[39] J.Lee, P.Müller, P. S.Sengupta, K.Gulukota, and Y.Ji, Bayesian inference for intratumour heterogeneity in mutations and copy number variation, J. R. Stat. Soc. C Appl. Stat.65 (2016), No. 4, pp. 547-563.
[40] J. C.Marioni, C. E.Mason, S. M.Mane, M.Stephens, and Y.Gilad, RNA‐Seq: an assessment of technical reproducibility and comparison with gene expression arrays, Gen. Res.18 (2008), pp. 1509-1517.
[41] C. N.Morris, Parametric empirical Bayes inference: theory and applications, J. Am. Stat. Assoc.78 (1983), No. 381, pp. 47-55. · Zbl 0506.62005
[42] A.Mortazavi, B. A.Williams, K.McCue, L.Schaeffer, and B.Wold, Mapping and quantifying mammalian transcriptomes by RNA‐Seq, Nat. Methods5 (2008), No. 7, pp. 621-628.
[43] P.Müller, G.Parmigiani, C.Robert, and J.Rousseau, Optimal sample size for multiple testing: the case of gene expression microarrays, J. Am. Stat. Assoc.99 (2004), No. 468, pp. 990-1001. · Zbl 1055.62127
[44] P.Müller et al., Bayesian nonparametric data analysis, Springer, 2015. · Zbl 1333.62003
[45] R. M.Neal, Markov chain sampling methods for Dirichlet process mixture models, J. Comput. Graph. Stat.9 (2000), No. 2, pp. 249-265.
[46] A. E.Raftery and N.Dean, Variable selection for model‐based clustering, J. Am. Stat. Assoc.101 (2006), No. 473, pp. 168-178. · Zbl 1118.62339
[47] W. M.Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc.66 (1971), No. 336, pp. 846-850.
[48] B. J.Reich, H. D.Bondell, and H. J.Wang, Flexible Bayesian quantile regression for independent and clustered data, Biostatistics11 (2010), No. 2, pp. 337-352. · Zbl 1437.62589
[49] S.Richardson and P. J.Green, On Bayesian analysis of mixtures with an unknown number of components (with discussion), J. R. Stat. Soc. B Stat. Methodol.)59 (1997), No. 4, pp. 731-792. · Zbl 0891.62020
[50] M. E.Roberts, B. M.Stewart, and E. M.Airoldi, A model of text for experimentation in the social sciences, J. Am. Stat. Assoc.111 (2016), pp. 988-1003.
[51] M. D.Robinson and A.Oshlack, A scaling normalization method for differential expression analysis of RNA‐Seq data, Genome Biol.11 (2010), No. 3, pp. R25.
[52] M. D.Robinson, D. J.McCarthy, and G. K.Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics26 (2010), No. 1, pp. 139-140.
[53] J. M.Santos and M.EmbrechtsOn the use of the adjusted Rand index as a metric for evaluating supervised classification, 2009. Artificial Neural Networks‐ICANN 2009, Springer.
[54] J.Sethuraman, A constructive definition of Dirichlet priors, Stat. Sin.4 (1994), pp. 639-650. · Zbl 0823.62007
[55] M. A.Taddy and A.Kottas, Mixture modeling for Marked Poisson processes, Bayesian Anal.7 (2012), No. 2, pp. 335-362. · Zbl 1330.62200
[56] M. G.Tadesse, N.Sha, and M.Vannucci, Bayesian variable selection in clustering high‐dimensional data, J. Am. Stat. Assoc.100 (2005), No. 470, pp. 602-617. · Zbl 1117.62433
[57] L.Trippa and G.Parmigiani, False discovery rates in somatic mutation studies of cancer, Ann. Appl. Stat. (2011), pp. 1360-1378. · Zbl 1454.62410
[58] S.Williamson et al., The IBP compound Dirichlet process and its application to focused topic modeling, 2010. Proceedings of the 27th International Conference on Machine Learning (ICML‐10), Haifa, Israel.
[59] D. M.Witten, Classification and clustering of sequencing data using a Poisson model, Ann. Appl. Stat.5 (2011), pp. 2493-2518. · Zbl 1234.62150
[60] M.Zhou, Beta‐negative binomial process and exchangeable random partitions for mixed‐membership modeling, Neural Information Processing Systems (NIPS2014), Montreal, Canada, 2014.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.