×

Automatic topography of high-dimensional data sets by non-parametric density peak clustering. (English) Zbl 1484.62075

Summary: Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of the data, namely a human-readable chart of the probability density from which the data are harvested. The approach is based on an unsupervised extension of Density Peak clustering and on a non-parametric density estimator that measures the probability density in the manifold containing the data. This allows finding automatically the number and the height of the peaks of the probability density, and the depth of the “valleys” separating them. Importantly, the density estimator provides a measure of the error, which allows distinguishing genuine density peaks from density fluctuations due to finite sampling. The approach thus provides robust and visual information about the density peaks height, their statistical reliability and their hierarchical organization, offering a conceptually powerful extension of the standard clustering partitions. We show that this framework is particularly useful in the analysis of complex data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation

References:

[1] Bentley, J. L., Multidimensional binary search trees used for associative searching, Commun. ACM, 18, 509-517 (1975) · Zbl 0306.68061
[2] Blei, D. M.; Jordan, M. I., Variational inference for dirichlet process mixtures, Bayesian Anal., 1, 121-143 (2006) · Zbl 1331.62259
[3] Bunte, K.; Biehl, M.; Hammer, B., A general framework for dimensionality-reducing data visualization mapping, Neural Comput., 24, 771-804 (2012) · Zbl 1238.68117
[4] Camastra, F.; Staiano, A., Intrinsic dimension estimation: advances and open problems, Inf. Sci., 328, 26-41 (2016) · Zbl 1392.62171
[5] R.J. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2013, Springer, pp. 160-172.
[6] Ceriotti, M.; Tribello, G. A.; Parrinello, M., Simplifying the representation of complex free-energy landscapes using sketch-map, Proc. Nat. Acad. Sci., 108, 13023-13028 (2011)
[7] Chang, H.; Yeung, D. Y., Robust path-based spectral clustering, Pattern Recogn., 41, 191-203 (2008) · Zbl 1122.68525
[8] Chaudhuri, K.; Dasgupta, S.; Kpotufe, S.; von Luxburg, U., Consistent procedures for cluster tree estimation and pruning, IEEE Trans. Inf. Theory, 60, 7900-7912 (2014) · Zbl 1359.62234
[9] Coifman, R. R.; Lafon, S.; Lee, A. B.; Maggioni, M.; Nadler, B.; Warner, F.; Zucker, S. W., Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps, Proce. Nat. Acad. Sci. USA, 102, 7426-7431 (2005) · Zbl 1405.42043
[10] Comaniciu, D.; Meer, P., Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell., 24, 603-619 (2002)
[11] Ester, M.; Kriegel, H. P.; Sander, J.; Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, (Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), AAAI Press), 226-231
[12] Facco, E.; d’Errico, M.; Rodriguez, A.; Laio, A., Estimating the intrinsic dimension of datasets by a minimal neighborhood information, Sci. Rep., 7, 12140 (2017)
[13] Facco, E.; Pagnani, A.; Russo, E. T.; Laio, A., The intrinsic dimension of protein sequence evolution, PLOS Comput. Biol., 15, 1-16 (2019)
[14] Finn, R. D.; Bateman, A.; Clements, J.; Coggill, P.; Eberhardt, R. Y.; Eddy, S. R.; Heger, A.; Hetherington, K.; Holm, L.; Mistry, J.; Sonnhammer, E. L.L.; Tate, J.; Punta, M., Pfam: the protein families database, Nucleic Acids Res., 42, D222-D230 (2014)
[15] Finn, R. D.; Coggill, P.; Eberhardt, R. Y.; Eddy, S. R.; Mistry, J.; Mitchell, A. L.; Potter, S. C.; Punta, M.; Qureshi, M.; Sangrador-Vegas, A.; Salazar, G. A.; Tate, J.; Bateman, A., The Pfam protein families database: towards a more sustainable future, Nucleic Acids Res., 44, D279-D285 (2016)
[16] Gionis, A.; Mannila, H.; Tsaparas, P., Clustering aggregation, ACM Trans. Knowl. Discovery Data (TKDD), 1, 4 (2007)
[17] Gisbrecht, A.; Hammer, B., Data visualization by nonlinear dimensionality reduction, Wiley Interdisc. Rev. Data Min. Knowl. Discovery, 5, 51-73 (2015)
[18] Granata, D.; Carnevale, V., Accurate estimation of the intrinsic dimension using graph distances: Unraveling the geometric complexity of datasets, Sci. Rep., 6, 31377 (2016)
[19] Hartigan, J. A., Consistency of single linkage for high-density clusters, J. Am. Stat. Assoc., 76, 388-394 (1981) · Zbl 0468.62053
[20] Hartigan, J. A., Consistency of single linkage for high-density clusters, J. Am. Stat. Assoc., 76, 388-394 (1981) · Zbl 0468.62053
[21] Hess, S.; Duivesteijn, W.; Honysz, P.; Morik, K., The spectacl of nonconvex clustering: a spectral approach to density-based clustering, in, (Proceedings of the AAAI Conference on Artificial Intelligence (2019)), 3788-3795
[22] A.K. Jain, M.H. Law, Data clustering: a user’s dilemma, in: International Conference on Pattern Recognition and Machine Intelligence, 2005, Springer, pp. 1-10
[23] Jiang, J.; Chen, Y.; Meng, X.; Wang, L.; Li, K., A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process, Physica A, 523, 702-713 (2019) · Zbl 07563409
[24] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278-2324 (1998)
[25] Levina, E.; Bickel, P. J., Maximum likelihood estimation of intrinsic dimension, (Saul, L. K.; Weiss, Y.; Bottou, L., Advances in Neural Information Processing Systems 17 (2005), MIT Press), 777-784
[26] Liang, Z.; Chen, P., Delta-density based clustering with a divide-and-conquer strategy: 3dc clustering, Pattern Recogn. Lett., 73, 52-59 (2016)
[27] Maaten, L.v.d.; Hinton, G., Visualizing data using t-sne, J. Mach. Learn. Res., 9, 2579-2605 (2008) · Zbl 1225.68219
[28] McInnes, L.; Healy, J.; Astels, S., hdbscan: hierarchical density based clustering, J. Open Source Software, 2 (2017)
[29] R. Mehmood, G. Zhang, R. Bie, H. Dawood, H. Ahmad, Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 208 (2016) 210-217. SI: BridgingSemantic.
[30] Minnotte, M. C., Nonparametric testing of the existence of modes, Ann. Stat., 1646-1660 (1997) · Zbl 0936.62056
[31] Neyman, J.; Pearson, E. S., On the problem of the most efficient tests of statistical hypotheses, Philos. Trans. Roy. Soc. Lond. Ser. A Contain. Papers Math. Phys. Charact., 231, 289-337 (1933) · JFM 59.1163.02
[32] Ng, A. Y.; Jordan, M. I.; Weiss, Y., On spectral clustering: analysis and an algorithm, Adv. Neural Inf. Process. Syst., 849-856 (2002)
[33] Omohundro, S. M., Five Balltree Construction Algorithms (1989), International Computer Science Institute Berkeley
[34] V.S. Pande, K. Beauchamp, G.R. Bowman, Everything you wanted to know about markov state models but were afraid to ask. Methods 52 (2010) 99-105. Protein Folding.
[35] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res., 12, 2825-2830 (2011) · Zbl 1280.68189
[36] Ringnér, M., What is principal component analysis?, Nat. Biotechnol., 26, 303-304 (2008)
[37] Rodriguez, A.; d’Errico, M.; Facco, E.; Laio, A., Computing the free energy without collective variables, J. Chem. Theory Comput., 14, 1206-1215 (2018)
[38] Rodriguez, A.; Laio, A., Clustering by fast search and find of density peaks, Science, 344, 1492-1496 (2014)
[39] Roweis, S. T.; Saul, L. K., Nonlinear dimensionality reduction by locally linear embedding, Science, 290, 2323-2326 (2000)
[40] E.T. Russo, A. Laio, M. Punta, Dpcfam: a new method for unsupervised protein family classification, 2020. bioRxiv.
[41] Shieh, A. D.; Hashimoto, T. B.; Airoldi, E. M., Tree preserving embedding, Proc. Nat. Acad. Sci., 108, 16916-16921 (2011)
[42] Silverman, B. W., Using kernel density estimates to investigate multimodality, J. Roy. Stat. Soc. Ser. B (Methodol.), 97-99 (1981)
[43] Simard, P.; LeCun, Y.; Denker, J. S., Efficient pattern recognition using a new transformation distance, Adv. Neural Inf. Process. Syst., 50-58 (1993)
[44] Sittel, F.; Stock, G., Robust density-based clustering to identify metastable conformational states of proteins, J. Chem. Theory Comput., 12, 2426-2435 (2016)
[45] Sormani, G.; Rodriguez, A.; Laio, A., Explicit characterization of the free-energy landscape of a protein in the space of all its cα)carbons, J. Chem. Theory Comput., 16, 80-87 (2019)
[46] Tenenbaum, J. B.; De Silva, V.; Langford, J. C., A global geometric framework for nonlinear dimensionality reduction, Science, 290, 2319-2323 (2000)
[47] Torgerson, W. S., Multidimensional scaling: I. Theory and method, Psychometrika, 17, 401-419 (1952) · Zbl 0049.37603
[48] Vinh, N. X.; Epps, J.; Bailey, J., Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance, J. Mach. Learn. Res., 11, 2837-2854 (2010) · Zbl 1242.62062
[49] Wang, Y.; Pang, W.; Zhou, Y., Density propagation based adaptive multi-density clustering algorithm, Plos One, 13, Article e0198948 pp. (2018)
[50] Xu, D.; Tian, Y., A comprehensive survey of clustering algorithms, Ann. Data Sci., 2, 165-193 (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.