×

Combining attribute content and label information for categorical data ensemble clustering. (English) Zbl 1508.62163

Summary: Ensemble clustering has been attracting increasing attention in recent years, because it is able to combine multiple base clusterings (ensemble members) into a more robust clustering. It mainly consists of two parts, generating multiple ensemble members and finding a final partition. The construction of the information matrix plays an important role for finding a final partition. In general categorical data ensemble clustering framework, most existing information matrices are constructed only relying on label information of ensemble members without considering original information of data sets. To solve this problem, a new ensemble clustering framework for categorical data is proposed, in which the information matrix considers label information and original data information together, and is instantiated into the ALM matrix in this paper. The ALM matrix takes account of not only the distribution of attribute content in each ensemble member, but also the relationship among ensemble members based on the distribution. To simplicity, the \(k\)-means technique is used to cluster the ALM matrix and form a new ensemble clustering algorithm. The experimental results have shown the benefits of the ALM matrix by comparing the proposed algorithm with other ensemble clustering algorithms.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence

Software:

COOLCAT; ROCK; SimRank; UCI-ml
Full Text: DOI

References:

[1] Han, J.; Kamber, M.; Pei, J., Data Mining Concept and Techniques (2011)
[2] Macqueen, J., Some methods for classification and analysis of multivariate observations, Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, 281-297 (1967) · Zbl 0214.46201
[3] Ester, M.; Kriegel, H. P.; Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, International Conference on Knowledge Discovery and Data Mining, 226-231 (1996)
[4] Jain, A. K., Data clustering: 50 years beyond k-means, Pattern Recognit. Lett., 31, 651-666 (2010)
[5] Jain, A. K.; Murty, M. N.; Flynn, P. J., Data clustering:a review, ACM Comput. Surv., 31, 3, 264-323 (1999)
[6] Xu, R.; Wunsch II, D., Survey of clustering algorithms, IEEE Trans. Neural Netw., 16, 3, 645-678 (2005)
[7] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2, 3, 283-304 (1998)
[8] Huang, Z.; Ng, M. K., A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst., 7, 4, 446-452 (1999)
[9] Ng, M. K.; Li, M. J.; Huang, Z.; He, Z., On the impact of dissimilarity measure in k-modes clustering algorithm, IEEE Trans. Pattern Anal. Mach. Intell., 29, 3, 503-507 (2007)
[10] Bai, L.; Liang, J.; Dang, C.; Cao, F., The impact of cluster representatives on the convergence of the k-modes type clustering, IEEE Trans. Pattern Anal. Mach. Intell., 35, 6, 1509-1522 (2013)
[11] Cao, F.; Liang, J.; Li, D.; Zhao, X., A weighting k-modes algorithm for subspace clustering of categorical data, Neurocomputing, 108, 5, 23-30 (2013)
[12] Chen, L.; Wang, S.; Wang, K.; Zhu, J., Soft subspace clustering of categorical data with probabilistic distance, Pattern Recognit., 51, 322-332 (2016)
[13] Cao, F.; Huang, Z.; Liang, J.; Zhao, X.; Meng, Y.; Feng, K.; Qian, Y., An algorithm for clustering categorical data with set-valued features, IEEE Trans. Neural Netw. Learn. Syst., 29, 10, 4593-4606 (2018)
[14] Guha, S.; Rastogi, R.; Shim, K., ROCK: A robust clustering algorithm for categorical attributes, Inf. Syst., 25, 5, 345-366 (2000)
[15] Barbara, D.; Couto, J.; Li, Y., COOLCAT: an entropy-based algorithm for categorical clustering, Proceedings of the 11th international conference on Information and knowledge management, 582-589 (2002)
[16] Zhao, X.; Cao, F.; Liang, J., A sequential ensemble clusterings generation algorithm for mixed data, Appl. Math. Comput., 335, 264-277 (2018) · Zbl 1427.68280
[17] Ghosh, J.; Acharya, A., Cluster ensembles, Wiley Interdiscip. Rev. Data Min.Knowl. Discov., 1, 4, 305-315 (2011)
[18] Ayad, H. G.; Kamel, M. S., On voting-based consensus of cluster ensembles, Pattern Recognit., 43, 5, 1943-1953 (2010) · Zbl 1191.68552
[19] Huang, D.; Lai, J.; Wang, C. D., Ensemble clustering using factor graph, Pattern Recognit., 50, 131-142 (2016) · Zbl 1395.62157
[20] Iam-On, N.; Boongoen, T.; Garrett, S.; Price, C., A link-based cluster ensemble approach for categorical data clustering, IEEE Trans. Knowl. Data Eng., 24, 3, 413-425 (2012)
[21] Al-Razgan, M.; Domeniconi, C.; Barbara, D., Random subspace ensembles for clustering categorical data, Supervised and Unsupervised Ensemble Methods and their Applications, 31-48 (2008)
[22] Iam-On, N.; Boongoen, T.; Garrett, S., Refining pairwise similarity matrix for cluster ensemble problem with cluster relations, Proceedings of International Conference on Discovery Science, 222-233 (2008)
[23] Jeh, G.; Widom, J., SimRank: a measure of structural-context similarity, Proceedings of the 8th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 538-543 (2002)
[24] Lu, Y.; Wan, Y., Pha: a fast potential-based hierarchical agglomerative clustering method, Pattern Recognit., 46, 5, 1227-1239 (2013)
[25] Cilibrasi, R. L.; Vitnyi, P. M.B., A fast quartet tree heuristic for hierarchical clustering, Pattern Recognit., 44, 3, 662-677 (2011) · Zbl 1209.68448
[26] He, Z.; Xu, X.; Deng, S., A cluster ensemble method for clustering categorical data, Inf. Fusion, 6, 2, 143-151 (2005)
[27] Karypis, G.; Kumar, V., Multilevelk-way partitioning scheme for irregular graphs, J. Parallel Distrib. Comput., 48, 2, 96-129 (1998)
[28] Ng, A.; Jordan, M.; Weiss, Y., On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems, 14, 849-856 (2001)
[29] Jing, L.; Tian, K.; Huang, Z., Stratified feature sampling method for ensemble clustering of high dimensional data, Pattern Recognit., 48, 11, 3688-3702 (2015)
[30] Yu, Z.; Li, L.; Gao, Y.; You, J.; Liu, J.; Wong, H. S.; Han, G., Hybrid clustering solution selection strategy, Pattern Recognit., 47, 10, 3362-3375 (2014)
[31] Chen, H. L.; Chuang, K. T.; Chen, M. S., Labeling unclustered categorical data into clusters based on the important attribute values, IEEE International Conference on Data Mining, 8 (2006)
[32] Cao, F.; Yu, L.; Huang, J. Z.; Liang, J., k-mw-modes: an algorithm for clustering categorical matrix-object data, Appl. Soft Comput., 57, 605-614 (2017)
[33] K. Bache, M. Lichman, UCI machine learning repository, 2014, http://archive.ics.uci.edu/ml.
[34] Liang, J.; Bai, L.; Dang, C.; Cao, F., The k-means-type algorithms versus imbalanced data distributions, IEEE Trans. Fuzzy Syst., 20, 4, 728-745 (2012)
[35] Strehl, A.; Ghosh, J., Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., 3, 583-617 (2003) · Zbl 1084.68759
[36] Strehl, A.; Ghosh, J., Cluster ensembles: a knowledge reuse framework for combining partitionings, J. Mach. Learn. Res., 3, 583-617 (2002) · Zbl 1084.68759
[37] Liu, H.; Liu, T.; Wu, J.; Tao, D.; Yun, F., Spectral ensemble clustering, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 715-724 (2015)
[38] Yu, Z., Graph-based consensus clustering for class discovery from gene expression data, Bioinformatics, 23, 21, 2888-2896 (2007)
[39] Zhao, X.; Liang, J.; Dang, C., Clustering ensemble selection for categorical data based on internal validity indices, Pattern Recognit., 69, 150-168 (2017)
[40] Ar, J., Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7, 1, 1-30 (2006) · Zbl 1222.68184
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.