×

A review on modal clustering. (English) Zbl 07763532

Summary: In spite of the current availability of numerous methods of cluster analysis, evaluating a clustering configuration is questionable without the definition of a true population structure, representing the ideal partition that clustering methods should try to approximate. A precise statistical notion of cluster, unshared by most of the mainstream methods, is provided by the density-based approach, which assumes that clusters are associated to some specific features of the probability distribution underlying the data. The non-parametric formulation of this approach, known as modal clustering, draws a correspondence between the groups and the modes of the density function. An appealing implication is that the ill-specified task of cluster detection can be regarded to as a more circumscribed problem of estimation, and the number of clusters is also conceptually well defined. In this work, modal clustering is critically reviewed from both conceptual and operational standpoints. The main directions of current research are outlined as well as some challenges and directions of further research.
{© 2015 The Authors. International Statistical Review © 2015 International Statistical Institute}

MSC:

62-XX Statistics
Full Text: DOI

References:

[1] AnkerstM., BreunigM. M., KriegelH. & SanderJ. (1999). Optics: Ordering points to identify the clustering structure. In ACM Sigmod Record, Vol. 28: ACM; 49-60.
[2] AzzaliniA. & MenardiG. (2014). Clustering via nonparametric density estimation: The R package pdfCluster. J. Stat. Softw., 57(11), 1-26.
[3] AzzaliniA. & TorelliN. (2007). Clustering via nonparametric density estimation. Stat. Comput., 17, 71-80.
[4] BaílloA. (2003). Total error in a plug‐in estimator of level sets. Statist. Probab. Lett., 65(4), 411-417. · Zbl 1116.62338
[5] BaílloA., CuevasA. & JustelA. (2000). Set estimation and nonparametric detection. Canad. J. Statist., 28(4), 765-782. · Zbl 1057.62026
[6] BaílloA., Cuesta‐AlbertosJ. A. & CuevasA. (2001). Convergence rates in nonparametric estimation of level sets. Statist. Probab. Lett., 53(1), 27-35. · Zbl 0980.62022
[7] Ben‐HurA., HornD., SiegelmannH. T. & VapnikV. (2001). Support vector clustering. J. Mach. Learn. Res., 2, 125-137. · Zbl 1002.68598
[8] BowmanA. W. (1984). An alternative method of cross‐validation for the smoothing of density estimates. Biometrika, 71(2), 353-360.
[9] BurmanP. & PolonikW. (2009). Multivariate mode hunting: Data analytic tools with measures of significance. J. Multivar. Anal., 100(6), 1198-1218. · Zbl 1159.62032
[10] CarmichaelJ. W., GeorgeJ. A. & JuliusR. S. (1968). Finding natural clusters. Syst. Zool., 17(2), 144-150.
[11] Carreira‐PerpinanM. A. (2007). Gaussian mean‐shift is an EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 29(5), 767-776.
[12] Carreira‐PerpinanM. A. (2008). Generalised blurring mean‐shift algorithms for nonparametric clustering. In CVPR: 2013 IEEE Conference on Computer Vision and Pattern Recognition; 1-8.
[13] ChacónJ. & DuongT. (2013). Data‐driven density derivative estimation, with applications to nonparametric clustering and bump hunting. Electron. J. Stat., 7, 499-532. · Zbl 1337.62067
[14] ChacónJ. E. (2014). A population background for nonparametric density‐based clustering. ArXiv e‐prints: 1408.1381.
[15] ChacónJ. E. & DuongT. (2010). Multivariate plug‐in bandwidth selection with unconstrained pilot bandwidth matrices. Test, 19(2), 375-398. · Zbl 1203.62054
[16] ChacónJ. E & MonfortP. (2013). A comparison of bandwidth selectors for mean shift clustering. arXiv preprint:1310.7855.
[17] ChakravarthyS. V. & GhoshJ. (1996). Scale‐based clustering using the radial basis function network.. IEEE Trans. Neural Netw., 7(5), 1250-1261.
[18] ChaudhuriK. & DasguptaS. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pp. 343-351. USA: Curran Associates.
[19] ChaudhuriK., DasguptaS., KpotufeS. & vonLuxburgU. (2014). Consistent procedures for cluster tree estimation and pruning. arXiv preprint math.PR/0000000.
[20] ChengY. (1995). Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17(8), 790-799.
[21] ChengY. & RayS. (2014a). Multivariate modality inference using Gaussian kernel. Open J. Statist., 4(05), 419-434.
[22] ChengY. & RayS. (2014b). Parallel and hierarchical mode association clustering with an r package modalclust. Open J. Statist., 4(10), 826-836.
[23] ComaniciuD. (2003). An algorithm for data‐driven bandwidth selection. IEEE Trans. Pattern Anal. Mach. Intell., 25(2), 281-288.
[24] ComaniciuD. & MeerP. (2002). Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5), 603-619.
[25] Contreras‐ReyesJ. E. & AzzaliniA. (2012). On the spatial correlation between areas of high coseismic slip and aftershock clusters of the Maule earthquake Mw=8.8. ArXiv e‐prints: 1208.1517.
[26] CormenT. H., LeisersonC. E., RivestR. L. & SteinC. (2001). Introduction to Algorithms, Vol. 2 Cambridge: MIT Press. · Zbl 1047.68161
[27] CuevasA., FebreroM. & FraimanR. (2000). Estimating the number of clusters. Canad. J. Statist., 28, 367-382. · Zbl 0981.62054
[28] CuevasA., FebreroM. & FraimanR. (2001). Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal., 36, 441-459. · Zbl 1053.62537
[29] De BinR. & RissoD. (2011). A novel approach to the clustering of microarray data via nonparametric density estimation. BMC Bioinformatics, 12(49), 1-8.
[30] DuongT. & HazeltonM. (2003). Plug‐in bandwidth matrices for bivariate kernel density estimation. J. Nonparametr. Stat., 15(1), 17-30. · Zbl 1019.62032
[31] DuongT. & HazeltonM. L. (2005). Cross‐validation bandwidth matrices for multivariate kernel density estimation. Scand. J. Stat., 32(3), 485-506. · Zbl 1089.62035
[32] EinbeckJ. (2011). Bandwidth selection for mean‐shift based unsupervised learning techniques: a unified approach via self‐coverage. J. Pattern. Recogn. Res., 6(2), 175-192.
[33] EinbeckJ. & EversL. (2013). Lpcm: Local principal curve methods. Available at http://CRAN.R-project.org/package=LPCM. R package version 0.44‐8.
[34] ErtozL., SteinbachM. & KumarV. (2002). A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd Siam International Conference on Data Mining, pp. 105-115. Arlington, USA.
[35] EsterM., KriegelH. P., SanderJ. & XuX. (1996). A density based algorithm for discovering clusters in large spatial databases withe noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD‐96), pp. 226-231. Portland: AAAI Press.
[36] FraleyC. & RafteryA. (1998). How many cluster? Which clustering method? Answers via model‐based cluster analysis. Comput. J., 41, 578-588. · Zbl 0920.68038
[37] FraleyC. & RafteryA. E. (2002). Model‐based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc., 97, 611-631. · Zbl 1073.62545
[38] FukunagaK. & HostetlerL. D. (1975). The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Inform. Theory, 21(1), 32-40. · Zbl 0297.62025
[39] GenoveseC., Perone‐PacificoM., VerdinelliI. & WassermanL. (2013). Nonparametric inference for density modes. ArXiv e‐prints.
[40] GiordanM. & DianaG. (2011). A clustering method for categorical ordinal data. Commun. Stat.‐Theor. Methods, 40(7), 1315-1334. · Zbl 1220.62082
[41] GuhaS., RastogiR. & ShimK. (1998). Cure: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, Vol. 27, pp. 73-84: Seattle: ACM.
[42] HallP., MinnotteM. C. & ZhangC. (2004). Bump hunting with non‐Gaussian kernels. Ann. Stat., 32(5), 2124-2141. · Zbl 1056.62049
[43] HartiganJ. A. (1975). Clustering Algorithms. New York: J. Wiley & Sons. · Zbl 0372.62040
[44] HartiganJ. A. (1981). Consistency of single linkage for high‐density clusters. J. Am. Stat. Assoc., 76(374), 388-394. · Zbl 0468.62053
[45] HennigC. (2010). Methods for merging Gaussian mixture components. Adv. Data Anal. Classif., 4(1), 3-34. · Zbl 1306.62141
[46] HennigC. (2014). fpc: Flexible procedures for clustering. Available at http://CRAN.R-project.org/package=fpc. R package version 2.1‐7.
[47] HinneburgA. & KeimD. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining (KDD’,98), Vol. 98,. pp. 58-65. New York.
[48] HubertL. & ArabieP. (1985). Comparing partitions. J. Classif., 2, 193-218.
[49] HyndmanR. J. (2013). hdrcde. Highest Density Regions and Conditional Density Estimation. R Package. Available at http://cran.r-project.org/package=pdfCluster.
[50] JangW. (2006). Nonparametric density estimation and clustering in astronomical sky surveys. Comput. Stat. Data Anal., 50(3), 760-774. · Zbl 1432.62377
[51] KlemeläJ. (2004). Visualization of multivariate density estimates with level set trees. J. Comput. Graph. Stat., 13(3), 599-620.
[52] KlemeläJ. (2009). Smoothing of Multivariate Data: Density Estimation and Visualization. Hoboken: Wiley. · Zbl 1218.62027
[53] KoontzW. & FukunagaK. (1972). A nonparametric valley‐seeking technique for cluster analysis. IEEE Trans. Comput., 100(2), 171-178. · Zbl 0229.68036
[54] KriegelH., KrögerP., SanderJ. & ZimekA. (2011). Density‐based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov., 1(3), 231-240.
[55] LebourgeoisF., DriraF., GacebD. & DuongJ. (2013). Fast integral meanshift: Application to color seg‐ mentation of document images. In 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 52-56. Washinton, DC: IEEE.
[56] LeeH. & LiJ. (2012). Variable selection for clustering by separability based on ridgelines. J. Comput. Graph. Stat., 21(2), 315-337.
[57] LeungY., ZhangJ. S. & XuZ. (2000). Clustering by scale‐space filtering. IEEE Trans. Pattern Anal. Mach. Intell., 22(12), 1396-1410.
[58] LiJ., RayS. & LindsayB. G. (2007). A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res., 8, 1687-1723. · Zbl 1222.62076
[59] MaierM., HeinbM. & vonLuxburgU. (2009). Optimal construction of ‐nearest‐neighbor graphs for identifying noisy clusters. Theor. Comput. Sci., 410(19), 1749-1764. · Zbl 1167.68045
[60] MasonD. M. & PolonikW. (2009). Asymptotic normality of plug‐in level set estimates. Ann. Appl. Probab., 19(3), 1108-1142. · Zbl 1180.62048
[61] MatsumotoY. (2002). An Introduction to Morse Theory. Providence: American Mathematical Society. · Zbl 0990.57001
[62] McLachlanG. J. & BasfordK. E. (1988). Mixture Models, Vol. 74. New York: Marcel Dekker.
[63] MenardiG. & AzzaliniA. (2014). An advancement in clustering via nonparametric density estimation. Stat. Comput., 24(5), 753-767. · Zbl 1322.62175
[64] MenardiG. & TorelliN. (2013). Reducing data dimension for cluster detection. J. Stat. Comput. Sim., 83(11), 2047-2063. · Zbl 1453.62547
[65] NugentR. & StuetzleW. (2010). Clustering with confidence: A low‐dimensional binning approach. In Classification as a Tool for Research, Eds. H.Jocarek‐Junge (ed.) & C.Weihs (ed.), pp. 117-125. Berlin: Springer.
[66] OoiH. ,. Density visualization and mode hunting using trees. J. Comput. Graph. Stat., 11(2), 328-347.
[67] ParisS. & DurandF. (2007). A topological approach to hierarchical segmentation using meanshift. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition CVPR ’07, pp. 1-8. Minneapolis.
[68] RayS. & LindsayB. G. (2005). The topography of multivariate normal mixtures. Ann. Statist., 33, 2042-2065. · Zbl 1086.62066
[69] RigolletP. & VertR. (2009). Optimal rates for plug‐in estimators of density level sets. Bernoulli, 15(4), 1154-1178. · Zbl 1200.62034
[70] RinaldoA. & WassermanL. (2010). Generalized density clustering. Ann. Statist., 38(5), 2678-2722. · Zbl 1200.62066
[71] SainS. R., BaggerlyK. A. & ScottD. W. (1994). Cross‐validation of multivariate densities. J. Am. Stat. Assoc., 89(427), 807-817. · Zbl 0805.62059
[72] SamworthR. J. & WandM. P. (2010). Asymptotics and optimal bandwidth selection for highest density region estimation. Ann. Statist., 38(3), 1767-1792. · Zbl 1189.62061
[73] ScottD. & SainS. (2005). Multidimensional density estimation. Handbook Statist., 24, 229-261.
[74] ScottD. W. (1992). Multivariate Density Estimation: Theory, Practice and VisualizationNew York: Wiley. · Zbl 0850.62006
[75] StuetzleW. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif., 20, 25-47. · Zbl 1055.62075
[76] StuetzleW. & NugentR. (2010). A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Stat., 19(2), 397-418.
[77] TaoW., JinH. & ZhangY. (2007). Color image segmentation based on mean shift and normalized cuts. IEEE Trans Syst Man Cybern, Part B: Cybern., 37(5), 1382-1389.
[78] TsybakovA. B. (1997). On nonparametric estimation of density level sets. Ann. Statist., 25(3), 948-969. · Zbl 0881.62039
[79] Von LuxburgU. (2007). A tutorial on spectral clustering. Stat. Comp., 17(4), 395-416.
[80] WaltherG. (1997). Granulometric smoothing. Ann. Statist., 25(6), 2273-2299, 12. · Zbl 0919.62026
[81] WandM. P. & JonesM. C. (1994). Multivariate plug‐in bandwidth selection. Comput. Stat., 9(2), 97-116. · Zbl 0937.62055
[82] WishartD. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical taxonomy, Ed. A. J.Cole (ed.), pp. 282-308. London: Academic Press.
[83] WolfeJ. H. (1970). Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res., 5(3), 329-350.
[84] WongA. M. & LaneT. (1983). The k‐th nearest neighbour clustering procedure. J. R. Stat. Soc. Series B, 45, 362-368. · Zbl 0535.62055
[85] YangC., DuraiswamiR., DeMenthonD. & DavisL. (2003a). Mean‐shift analysis using quasinewton methods. In Proceedings of the International Conference on Image Processing, (ICIP), Vol. 2. pp. II‐447. Nice: IEEE.
[86] YangC., DuraiswamiR., GumerovN. A. & DavisL. (2003b). Improved fast gauss transform and efficient kernel density estimation. In Proceedings of the ninth IEEE International Conference on Computer Vision: IEEE, pp. 664-671.
[87] YaoW. & LindsayB. G. (2009). Bayesian mixture labeling by highest posterior density. J. Am. Statist. Assoc., 104(486), 758-767. · Zbl 1388.62007
[88] YaoW., LindsayB. G. & LiR. (2012). Local modal regression. J. Nonparametr. Stat., 24(3), 647-663. · Zbl 1254.62059
[89] YuanX., HuB. & HeR. (2012). Agglomerative mean‐shift clustering. IEEE Trans. Knowl. Data Engin., 24(2), 209-219.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.