×

Fully adaptive density-based clustering. (English) Zbl 1327.62382

Summary: The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is fed with histogram-based level set estimates, we provide a finite sample analysis, which is then used to show that the algorithm consistently estimates both the smallest level and the corresponding connected components. We further establish rates of convergence for the two estimation problems, and last but not least, we present a simple, yet adaptive strategy for determining the width-parameter of the involved density estimator in a data-depending way.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
91C20 Clustering in the social and behavioral sciences
62G07 Density estimation

References:

[1] Baíllo, A., Cuesta-Albertos, J. A. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets. Statist. Probab. Lett. 53 27-35. · Zbl 0980.62022 · doi:10.1016/S0167-7152(01)00006-2
[2] Baíllo, A., Cuevas, A. and Justel, A. (2000). Set estimation and nonparametric detection. Canad. J. Statist. 28 765-782. · Zbl 1057.62026 · doi:10.2307/3315915
[3] Ben-David, S. and Lindenbaum, M. (1997). Learning distributions by their density levels: A paradigm for learning without a teacher. J. Comput. System Sci. 55 171-182. · Zbl 0880.68106 · doi:10.1006/jcss.1997.1507
[4] Chaón, J. C. (2014). A population background for nonparametric density-based clustering. Technical report. Available at . arXiv:1408.1381
[5] Chaudhuri, K. and Dasgupta, S. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 343-351.
[6] Cuevas, A. and Fraiman, R. (1997). A plug-in approach to support estimation. Ann. Statist. 25 2300-2312. · Zbl 0897.62034 · doi:10.1214/aos/1030741073
[7] Devroye, L. and Wise, G. L. (1980). Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38 480-488. · Zbl 0479.62028 · doi:10.1137/0138038
[8] Donoho, D. L. (1988). One-sided inference about functionals of a density. Ann. Statist. 16 1390-1420. · Zbl 0665.62040 · doi:10.1214/aos/1176351045
[9] Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. Henri Poincaré Probab. Stat. 38 907-921. · Zbl 1011.62034 · doi:10.1016/S0246-0203(02)01128-7
[10] Hartigan, J. A. (1975). Clustering Algorithms . Wiley, New York. · Zbl 0372.62040
[11] Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc. 76 388-394. · Zbl 0468.62053 · doi:10.2307/2287840
[12] Hartigan, J. A. (1987). Estimation of a convex density contour in two dimensions. J. Amer. Statist. Assoc. 82 267-270. · Zbl 0607.62045 · doi:10.2307/2289162
[13] Kpotufe, S. and von Luxburg, U. (2011). Pruning nearest neighbor cluster trees. In Proceedings of the 28 th International Conference on Machine Learning (L. Getoor and T. Scheffer, eds.) 225-232. ACM, New York.
[14] Maier, M., Hein, M. and von Luxburg, U. (2009). Optimal construction of \(k\)-nearest-neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci. 410 1749-1764. · Zbl 1167.68045 · doi:10.1016/j.tcs.2009.01.009
[15] Müller, D. W. and Sawitzki, G. (1991). Excess mass estimates and tests for multimodality. J. Amer. Statist. Assoc. 86 738-746. · Zbl 0733.62040 · doi:10.2307/2290406
[16] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-An excess mass approach. Ann. Statist. 23 855-881. · Zbl 0841.62045 · doi:10.1214/aos/1176324626
[17] Rigollet, P. (2007). Generalized error bounds in semi-supervised classification under the cluster assumption. J. Mach. Learn. Res. 8 1369-1392. · Zbl 1222.68288
[18] Rigollet, P. and Vert, R. (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli 15 1154-1178. · Zbl 1200.62034 · doi:10.3150/09-BEJ184
[19] Rinaldo, A., Singh, A., Nugent, R. and Wasserman, L. (2012). Stability of density-based clustering. J. Mach. Learn. Res. 13 905-948. · Zbl 1283.62130
[20] Rinaldo, A. and Wasserman, L. (2010). Generalized density clustering. Ann. Statist. 38 2678-2722. · Zbl 1200.62066 · doi:10.1214/10-AOS797
[21] Scovel, C., Hush, D. and Steinwart, I. (2005). Learning rates for density level detection. Anal. Appl. ( Singap. ) 3 357-371. · Zbl 1101.68622 · doi:10.1142/S0219530505000625
[22] Singh, A., Scott, C. and Nowak, R. (2009). Adaptive Hausdorff estimation of density level sets. Ann. Statist. 37 2760-2782. · Zbl 1173.62019 · doi:10.1214/08-AOS661
[23] Sriperumbudur, B. K. and Steinwart, I. (2012). Consistency and rates for clustering with DBSCAN. In Proceedings of the 15 th International Conference on Artificial Intelligence and Statistics 2012 (N. Lawrence and M. Girolami, eds.). JMLR Workshop and Conference Proceedings 22 1090-1098.
[24] Steinwart, I. (2011). Adaptive density level set clustering. In Proceedings of the 24 th Conference on Learning Theory 2011 (S. Kakade and U. von Luxburg, eds.). JMLR Workshop and Conference Proceedings 19 703-738. JMRL.
[25] Steinwart, I. (2015). Supplement to “Fully adaptive density-based clustering.” . · Zbl 1327.62382 · doi:10.1214/15-AOS1331
[26] Steinwart, I., Hush, D. and Scovel, C. (2005). A classification framework for anomaly detection. J. Mach. Learn. Res. 6 211-232. · Zbl 1222.68309
[27] Stuetzle, W. (2003). Estimating the cluster type of a density by analyzing the minimal spanning tree of a sample. J. Classification 20 25-47. · Zbl 1055.62075 · doi:10.1007/s00357-003-0004-6
[28] Stuetzle, W. and Nugent, R. (2010). A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Statist. 19 397-418. · doi:10.1198/jcgs.2009.07049
[29] Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. Ann. Statist. 25 948-969. · Zbl 0881.62039 · doi:10.1214/aos/1069362732
[30] Walther, G. (1997). Granulometric smoothing. Ann. Statist. 25 2273-2299. · Zbl 0919.62026 · doi:10.1214/aos/1030741072
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.