×

Quantile-based clustering. (English) Zbl 1434.62121

Summary: A new cluster analysis method, \(K\)-quantiles clustering, is introduced. \(K\)-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd’s algorithm for \(K\)-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although \(K\)-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of \(K\)-quantiles clustering is proved, and it is shown that \(K\)-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, \(K\)-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by \(K\)-quantiles.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G08 Nonparametric regression and quantile regression

References:

[1] Banerjee, A., I. S. Dhillon, J. Ghosh, and S. Sra (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions., Journal of Machine Learning Research 6(Sep), 1345-1382. · Zbl 1190.62116
[2] Ben-Israel, A. and C. Iyigun (2008). Probabilistic d-clustering., Journal of Classification 25(1), 5. · Zbl 1260.62039 · doi:10.1007/s00357-008-9002-z
[3] Bryant, P. and J. A. Williamson (1978). Asymptotic behaviour of classification maximum likelihood estimates., Biometrika 65(2), 273-281. · Zbl 0393.62011 · doi:10.1093/biomet/65.2.273
[4] Bubeck, S. and U. von Luxburg (2009). Nearest neighbor clustering: A baseline method for consistent clustering with arbitrary objective functions., Journal of Machine Learning Research 10, 657-698. · Zbl 1235.68134
[5] Clemencon, S. (2014). A statistical view of clustering performance through the theory of u-processes., Journal of Multivariate Analysis 124, 42-56. · Zbl 1278.62095 · doi:10.1016/j.jmva.2013.10.001
[6] Diaconis, P. (1988)., Grouped Representations in probability and statistics, Volume 11. Hayward, CA: Institute of Mathematical Statistics Lecture Notes - Monograph Series. · Zbl 0695.60012
[7] Fligner, M. A. and J. S. Verducci (1986). Distance based ranking models., Journal of the Royal Statistical Society. Series B (Methodological) 48(3), 359-369. · Zbl 0658.62031 · doi:10.1111/j.2517-6161.1986.tb01420.x
[8] Frey, B. J. and D. Dueck (2007). Clustering by passing messages between data points., Science 315(5814), 972-976. · Zbl 1226.94027 · doi:10.1126/science.1136800
[9] Friedman, J. H. and J. J. Meulman (2004). Clustering objects on subsets of attributes (with discussion)., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4), 815-849. · Zbl 1060.62064 · doi:10.1111/j.1467-9868.2004.02059.x
[10] Gnanadesikan, R., J. R. Kettenring, and S. L. Tsao (1995). Weighting and selection of variables., Journal of Classification 12, 113-136. · Zbl 0825.62540 · doi:10.1007/BF01202271
[11] Golub, T., D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring., Science 286, 531-537.
[12] Halkidi, M., M. Vazirgiannis, and C. Hennig (2016). Method-independent indices for cluster validation and estimating the number of clusters. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 26, pp. 595-618. Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1396.62136
[13] Hennig, C. (2015). What are the true clusters?, Pattern Recognition Letters 64, 53-62. Philosophical Aspects of Pattern Recognition.
[14] Hennig, C. (2016). Clustering strategy and method selection. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 31, pp. 703-730. Chapman & Hall/CRC, Boca Raton FL. · Zbl 1396.62138
[15] Hennig, C. and C. Viroli (2016). Quantile-based classifiers., Biometrika 103(2), 435. · Zbl 1499.62212 · doi:10.1093/biomet/asw015
[16] Hennig, C., C. Viroli, and L. Anderlucci (2019). Quantile-based clustering, arXiv:1806.10403 (supplement included from v2). · Zbl 1434.62121
[17] Hubert, L. and P. Arabie (1985). Comparing partitions., Journal of the Classification 2, 193-218. · Zbl 0587.62128
[18] Iyigun, C. (2010). Probabilistic distance clustering. In J. J. Cochran, L. A. Cox, P. Keskinocak, J. P. Kharoufeh, and J. C. Smith (Eds.), Wiley Encyclopedia of Operations Research and Management Science. John Wiley & Sons.
[19] Jain, A. K. (2010). Data clustering: 50 years beyond k-means., Pattern Recognition Letters 31(8), 651-666.
[20] Joe, H. (2006). Generating random correlation matrices based on partial correlations., Journal of Multivariate Analysis 97, 2177-2189. · Zbl 1112.62055 · doi:10.1016/j.jmva.2005.05.010
[21] Kaufman, L. and P. Rousseeuw (2005)., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. · Zbl 1345.62009
[22] Kozubowski, T. J. and K. Podgorski (2000). A multivariate and asymmetric generalization of Laplace distribution., Computational Statistics 15(4), 531-540. · Zbl 1035.60010 · doi:10.1007/PL00022717
[23] Leisch, F. (2016). Resampling methods for exploring clustering stability. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 28, pp. 637-652. Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1396.62143
[24] Linder, T., G. Lugosi, and K. Zeger (1994). Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding., IEEE Transactions on Information Theory 40(6), 1728-1740. · Zbl 0826.94006 · doi:10.1109/18.340451
[25] Lloyd, S. (1982). Least squares quantization in pcm., IEEE Trans. Inf. Theor. 28(2), 129-137. · Zbl 0504.94015 · doi:10.1109/TIT.1982.1056489
[26] Mallows, C. L. (1957). Non-null ranking models. I., Biometrika 44(1/2), 359-369. · Zbl 0087.34001 · doi:10.1093/biomet/44.1-2.114
[27] McLachlan, G. J. and D. Peel (2000)., Finite Mixture Models. Wiley. · Zbl 0963.62061
[28] Murphy, T. B. and D. Martin (2003). Mixtures of distance-based models for ranking data., Computational Statistics & Data Analysis 41, 645-655. · Zbl 1429.62258 · doi:10.1016/S0167-9473(02)00165-2
[29] Ng, A. Y., M. I. Jordan, and Y. Weiss (2002). On spectral clustering: Analysis and an algorithm. In, Advances in Neural Information Processing Systems, pp. 849-856.
[30] Ostrovsky, R., Y. Rabani, L. J. Schulman, and C. Swamy (2006). The effectiveness of lloyd-type methods for the k-means problem. In, Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley. IEEE. · Zbl 1281.68229 · doi:10.1145/2395116.2395117
[31] Pollard, D. (1981). Strong consistency of \(k\)-means clustering., Ann. Statist. 9(1), 135-140. · Zbl 0451.62048 · doi:10.1214/aos/1176345339
[32] van der Vaart, A. W. (1998)., Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. · Zbl 0910.62001
[33] von Luxburg, U., M. Belkin, and O. Bousquet (2008, 04). Consistency of spectral clustering., The Annals of Statistics 36(2), 555-586. · Zbl 1133.62045 · doi:10.1214/009053607000000640
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.