×

SingleCross-clustering: an algorithm for finding elongated clusters with automatic estimation of outliers and number of clusters. (English) Zbl 07545871

Summary: Many clustering methods perform well with spherical clusters but poorly with elongated clusters. The Single-linkage method is suitable for finding such type of long clusters, but it can be sensitive to outliers and noise in the data, causing the so-called chain-effect. This work proposes a modification of Cross-Clustering algorithm, the SingleCross-Clustering (SCC), a partial clustering algorithm that estimates the number of clusters, recognizes outliers and that is useful for the identification of elongated clusters. SCC has been validated by comparing it with a number of existing clustering methods, showing on both simulated and real datasets that SCC is a reliable solution for the identification of the correct number of clusters and of the number of clusters memberships. The algorithm has been implemented in the R package CrossClustering, which can be downloaded for free from the CRAN contributed package repository.

MSC:

62-XX Statistics
Full Text: DOI

References:

[1] Ailon, N.; Charikar, M.; Newman, A., Aggregating inconsistent information: Ranking and clustering, JACM, 55, 5, 23 (2008) · Zbl 1325.68102
[2] Amadasun, M.; King, R. A., Low-level segmentation of multispectral images via agglomerative clustering of uniform neighbourhoods, Pattern Recognition, 21, 3, 261-8 (1988) · doi:10.1016/0031-3203(88)90060-X
[3] Azzalini, A.; Bowman, A. W., A look at some data on the old faithful geyser, Applied Statistics, 39, 3, 357-65 (1990) · Zbl 0707.62186
[4] Banfield, J.; Raftery, A., Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, 49, 3, 803-21 (1993) · Zbl 0794.62034 · doi:10.2307/2532201
[5] Bien, J.; Tibshirani, R., Hierarchical clustering with prototypes via minimax linkage, Journal of the American Statistical Association, 106, 495, 1075 (2011) · Zbl 1229.62083
[6] Birch, F.; Kennedy, G. C., Geophysical Monograph Series. vol. 16, Notes on geyser temperatures in Iceland and Yellowstone National Park, 329-36 (1972), Washington, DC: American Geophysical Union, Washington, DC
[7] Brusco, M. J.; Cradit, J. D., Bicriterion methods for partitioning dissimilarity matrices, British Journal of Mathematical and Statistical Psychology, 58, 2, 319-32 (2005) · doi:10.1348/000711005X63890
[8] Burney, S. M.; Tariq, H., K-means cluster analysis for image segmentation, International Journal of Computer Applications, 96, 4 (2014) · doi:10.5120/16779-6360
[9] Cassisi, C.; Montalto, P.; Aliotta, M.; Cannata, A.; Pulvirenti, A., Advances in Data Mining Knowledge Discovery and Applications, ed. A. Karahoca., Similarity measures and dimensionality reduction techniques for time series data mining (2012)
[10] Chawla, S.; Gionis, A., k-means: A unified approach to clustering and outlier detection, Proceedings of the 2013 SIAM International Conference on Data Mining, 189-197 (2013)
[11] Cook, R. D.; Weisberg, S., Residuals and influence in regression (1982), London: Chapman and Hall, London · Zbl 0564.62054
[12] Cornea, A.; Conn, P. M., Fluorescence microscopy: Super-resolution and other novel techniques (2014)
[13] Delattre, M.; Hansen, P., Bicriterion cluster analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 4, 277-91 (1980) · Zbl 0458.62049 · doi:10.1109/TPAMI.1980.4767027
[14] Denby, L.; Pregibon, D., An example of the use of graphics in regression, The American Statistician, 41, 33-8 (1987) · doi:10.2307/2684315
[15] Dilts, D.; Khamalah, J.; Plotkin, A., Using cluster analysis for medical resource decision making, Medical Decision Making, 15, 4, 333-47 (1995) · doi:10.1177/0272989X9501500404
[16] Dubes, R. C., How many clusters are best? An experiment, Pattern Recognition, 20, 6, 645-63 (1987) · doi:10.1016/0031-3203(87)90034-3
[17] Duda, P. O.; Hart, P. E.; Stork, D. G., Pattern classification (2001), New York, NY: Wiley, New York, NY · Zbl 0968.68140
[18] Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, 226-31 (1996)
[19] Evans, K.; Love, T.; Thurston, S. W., Outlier identification in model-based cluster analysis, Journal of Classification, 32, 1, 63-84 (2015) · Zbl 1331.62306 · doi:10.1007/s00357-015-9171-5
[20] Everitt, B.; Landau, S.; Leese, M., Cluster analysis (2001) · Zbl 1205.62076
[21] Fisher, D. H., Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2, 2, 139-72 (1987) · doi:10.1007/BF00114265
[22] Florek, K.; Łukaszewicz, J.; Perkal, J.; Steinhaus, H.; Zubrzycki, S., Sur la liaison et la division des points d’un ensemble fini, Colloquium Mathematicum, 2, 3-4, 282-5 (1951) · Zbl 0045.26103 · doi:10.4064/cm-2-3-4-282-285
[23] Forgy, E. W., Cluster analysis of multivariate data: Efficiency versus interpretability of classifications, Biometrics, 21, 768-9 (1965)
[24] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97, 458, 611-31 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[25] Fukunaga, K., Introduction to statistical pattern recognition (1990), New York, NY: Academic Press, New York, NY · Zbl 0711.62052
[26] Gionis, A.; Mannila, H.; Tsaparas, P., Bicriterion cluster analysis, ICDE, 1, 1, 4 (2005)
[27] Guo, J.; Li, H.; Yang, H.; Wang, R.; Yang, Y.; Ma, M.; Liu, B., Information computing and applications. communications in computer and information science, 391, Study on economic development based on factor and cluster analysis (2013), Berlin: Springer, Berlin
[28] Hahsler, M.; Piekenbrock, M. (2017)
[29] Hartigan, J. A., Clustering algorithms (1975) · Zbl 0372.62040
[30] Hartigan, J. A., Consistency of single linkage for high-density clusters, Journal of the American Statistical Association, 76, 374, 388-94 (1981) · Zbl 0468.62053 · doi:10.1080/01621459.1981.10477658
[31] Hartigan, J. A.; Wong, M. A., Algorithm AS 136: A k-means clustering algorithm, Applied Statistics, 28, 1, 100-8 (1979) · Zbl 0447.62062 · doi:10.2307/2346830
[32] Hubert, L.; Arabie, P., Comparing partitions, Journal of Classification, 2, 1, 193-218 (1985) · doi:10.1007/BF01908075
[33] Kaufman, L.; Rousseeuw, P., Finding groups in data: An introduction to cluster analysis (1990) · Zbl 1345.62009
[34] Kumar, V.; Kumar, S.; Kumar Singh, A., Outlier detection: A clustering-based approach, International Journal of Science and Modern Engineering (IJISME), 1, 7 (2013)
[35] Lei, Y.; Zhili, W.; Luoming, M.; Xuesong, Q., Clustering and recommendation for semantic web service in time series, KSII Transactions on Internet and Information Systems, 8, 8, 2743-62 (2014)
[36] Liao, M.; Li, Y.; Kianifard, F.; Obi, E.; Arcona, S., Cluster analysis and its application to healthcare claims data: A study of end-stage renal disease patients who initiated hemodialysis, BMC Nephrology, 17, 1, 25 (2016) · doi:10.1186/s12882-016-0238-2
[37] Liu, Z.; George, R., Computer and Information Sciences ISCIS 2003, 2869, Lecture Notes on Computer Science, Fuzzy cluster analysis of spatio-temporal data (2003), Springer
[38] Lizunov, V. A.; Stenkula, K.; Troy, A.; Cushman, S. W.; Zimmerberg, J., Insulin regulates Glut4 confinement in plasma membrane clusters in adipose cells, PLoS One, 8, 3, e57559 (2013) · doi:10.1371/journal.pone.0057559
[39] MacQueen, J.
[40] Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K.. 2017. cluster: Cluster analysis basics and extensions. R package version 2.0.6.
[41] Martin, D.; Fowlkes, C.; Tal, D.; Malik, J., A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, Proceedings Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada. ICCV 2001, July 7-14 (2001)
[42] McQuitty, L. L., Elementary linkage analysis for isolating orthogonal and oblique types and typal relevancies, Educational and Psychological Measurement, 17, 2, 207-22 (1957) · doi:10.1177/001316445701700204
[43] Melnykov, V.; Chen, W. C.; Maitra, R., MixSim: An R package for simulating data to study performance of clustering algorithms, Journal of Statistical Software, 51, 12, 1-25 (2012) · doi:10.18637/jss.v051.i12
[44] Milligan, G. W.; Cooper, M. C., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 2, 159-79 (1985) · doi:10.1007/BF02294245
[45] Milligan, G. W.; Cooper, M. C., A study of the comparability of external criteria for hierarchical cluster analysis, Multivariate Behavioral Research, 21, 4, 441-58 (1986) · doi:10.1207/s15327906mbr2104_5
[46] Milligan, G. W.; Cooper, M. C., Methodology review: Clustering methods, Applied Psychological Measurement, 11, 4, 329-54 (1987) · doi:10.1177/014662168701100401
[47] Moonesinghe, H. D. K.; Tan, P. N., OutRank: A graph-based outlier detection framework using random walk, International Journal on Artificial Intelligence Tools, 17, 1, 19-36 (2008) · doi:10.1142/S0218213008003753
[48] Nithya, N. S.; Duraiswamy, K.; Gomathy, P., A survey on clustering techniques in medical diagnosis, International Journal of Computer Science Trends and Technology, 1, 2, 17-22 (2013)
[49] Omer, I.; Werman, M., Color lines: Image specific color representation, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2004, 27 June-2 July, CVPR 2004 (2004)
[50] Postman, M., Distribution of galaxies, clusters, and superclusters (2006), IOP Publishing Ltd
[51] R Core Team, R: A language and environment for statistical computing (2017), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria
[52] Rand, W. M., Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 336, 846-50 (1971) · doi:10.1080/01621459.1971.10482356
[53] Raykov, Y. P.; Boukouvalas, A.; Baig, F.; Little, M. A., What to do when K-means clustering fails: A simple yet principled alternative algorithm, PLoS One, 11, 9, e0162259 (2016) · doi:10.1371/journal.pone.0162259
[54] Rinehart, J. S., Thermal and seismic indications of Old Faithful Geyser’s inner working, Journal of Geophysical Research, 74, 2, 566-73 (1969) · doi:10.1029/JB074i002p00566
[55] SAS Institute Inc, SAS/STAT[textregistered] 9.2 user’s guide (2008), Cary, NC: SAS Institute Inc, Cary, NC
[56] Scrucca, L.; Fop, M.; Murphy, T. B.; Raftery, A. E., mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8, 1, 289-33 (2016) · doi:10.32614/RJ-2016-021
[57] Shahid, R.; Bertazzon, S.; Knudtson, M. L.; Ghali, W. A., Comparison of distance measures in spatial analytical modeling for health service planning, BMC Health Services Research, 9, 1, 200 (2009) · doi:10.1186/1472-6963-9-200
[58] Silverman, B. W., Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with discussion), Journal of the Royal Statistical Society: Series B (Methodological), 47, 1-52 (1985) · Zbl 0606.62038 · doi:10.1111/j.2517-6161.1985.tb01327.x
[59] Sneath, P. H. A., The application of computers to taxonomy, Microbiology, 17, 201-26 (1957) · doi:10.1099/00221287-17-1-201
[60] Specht, S.; Heidbach, O.; Cotton, F.; Zang, A. (2017)
[61] Steinley, D., Properties of the Hubert-Arable adjusted rand index, Psychological Methods, 9, 3, 386 (2004) · doi:10.1037/1082-989X.9.3.386
[62] Steinley, D.; Brusco, M. J., A new variable weighting and selection procedure for K-means cluster analysis, Multivariate Behavioral Research, 43, 1, 77-108 (2008) · doi:10.1080/00273170701836695
[63] Steinley, D.; Brusco, M. J.; Hubert, L., The variance of the adjusted Rand index, Psychological Methods, 21, 2, 261-72 (2016) · doi:10.1037/met0000049
[64] Su, Z.; Yang, Q.; Zhang, H.; Xu, X.; Hu, Y.-H.; Ma, S., Correlation-based web document clustering for adaptive web interface design, Knowledge and Information Systems, 4, 2, 151-67 (2002) · doi:10.1007/s101150200002
[65] Tellaroli, P.; Bazzi, M.; Donato, M.; Brazzale, A. R.; Drăghici, S., Cross-clustering: A partial clustering algorithm with automatic estimation of the number of clusters, PLoS One, 11, 3, e0152333 (2016) · doi:10.1371/journal.pone.0152333
[66] Tellaroli, P.; Bazzi, M.; Donato, M.; Finos, L.; Courcoux, P. (2018)
[67] Tran, T. N.; Wehrens, R.; Buydens, L. M. C., KNN density-based clustering for high-dimensional multispectral images, Proc. 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, URBAN 2003 (2003)
[68] Wang, Y.; Wang, X.; Wang, X. L., Machine Learning and Data Mining in Pattern Recognition, ed. P. Perner, Lecture Notes in Computer Science, vol., 9729, MDLM 2016, A spectral clustering based outlier detection technique (2016), Springer
[69] Ward, J. H. Jr., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58, 301, 236-44 (1963) · doi:10.1080/01621459.1963.10500845
[70] Weisberg, S., Applied linear regression, 207-11 (1980), New York: Wiley, New York · Zbl 0529.62054
[71] Xu, R.; Wunsch, D., Survey of clustering algorithms. Neural networks, IEEE Transactions on Neural Networks, 16, 3, 645-78 (2005) · doi:10.1109/TNN.2005.845141
[72] Yuen, D. A.; Dzwinel, W.; Ben-Zion, Y.; Kadlec, B. J., Encyclopedia of Complexity and Systems Science, Earthquake Clusters over Multi-dimensional Space, Visualization of, 2347-71 (2009), Springer-Verlag
[74] Zhang, C.; Zhang, X.; Zhang, M.; Li, Y., Neighbor number, valley seeking and clustering, Pattern Recognition Letters, 28, 2, 173-80 (2007) · doi:10.1016/j.patrec.2006.07.003
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.