×

Determining the number of clusters using information entropy for mixed data. (English) Zbl 1234.68343

Summary: In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI

References:

[1] Han, J. W.; Kamber, M., Data Mining Concepts and Techniques (2001), Morgan Kaufmann: Morgan Kaufmann San Francisco
[2] Jain, A. K.; Murty, M. N.; Flynn, P. J., Data clustering: a review, ACM Computing Surveys, 31, 3, 264-323 (1999)
[3] Xu, R.; Wunsch II, D., Survey of clustering algorithms, IEEE Transactions on Neural Networks, 16, 3, 645-678 (2005)
[4] Jain, A. K., Data clustering: 50 years beyond \(k\)-means, Pattern Recognition Letters, 31, 8, 651-666 (2010)
[5] Kriegel, H. P.; Kröger, P.; Zimek, A., Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering, ACM Transactions on Knowledge Discovery from Data, 3, 1, 1-58 (2009)
[6] Bai, L.; Liang, J. Y.; Dang, C. Y.; Cao, F. Y., A novel attribute weighting algorithm for clustering high-dimensional categorical data, Pattern Recognition, 44, 12, 2843-2861 (2011) · Zbl 1218.68115
[7] Liao, T. W., Clustering of time series data survey, Pattern Recognition, 38, 11, 1857-1874 (2005) · Zbl 1077.68803
[8] C.C. Aggarwal, J.W. Han, J.Y. Wang, P.S. Yu, A framework for clustering evolving data streams, in: Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003.; C.C. Aggarwal, J.W. Han, J.Y. Wang, P.S. Yu, A framework for clustering evolving data streams, in: Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003.
[9] Cao, F. Y.; Liang, J. Y.; Bai, L.; Zhao, X. W.; Dang, C. Y., A framework for clustering categorical time-evolving data, IEEE Transactions on Fuzzy Systems, 18, 5, 872-882 (2010)
[10] Hunt, L.; Jorgensen, M., Clustering mixed data, WIREs Data Mining and Knowledge Discovery, 1, 4, 352-361 (2011)
[11] V. Estivill-Castro, J. Yang, A fast and robust general purpose clustering algorithm, in: Proceeding of 6th Pacific Rim International Conference Artificial Intelligence, Melbourne, Australia, 2000, pp. 208-218.; V. Estivill-Castro, J. Yang, A fast and robust general purpose clustering algorithm, in: Proceeding of 6th Pacific Rim International Conference Artificial Intelligence, Melbourne, Australia, 2000, pp. 208-218.
[12] Ahmad, A.; Dey, L., A \(k\)-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, 63, 2, 503-527 (2007)
[13] Huang, Z. X., Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, 2, 3, 283-304 (1998)
[14] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.; J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297. · Zbl 0214.46201
[15] Höppner, F.; Klawonn, F.; Kruse, R., Fuzzy Cluster Analysis: Methods for Classification, Data Analysis, and Image Recognition (1999), Wiley: Wiley New York · Zbl 0944.65009
[16] Z.X. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining, in: Proceeding of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997, pp. 1-8.; Z.X. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining, in: Proceeding of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997, pp. 1-8.
[17] Kaufman, L.; Rousseeuw, P., Finding Groups in Data: An Introduction to Cluster Analysis (1990), Wiley · Zbl 1345.62009
[18] T.K. Xiong, S.R. Wang, A. Mayers, E. Monga, DHCC: Divisive hierarchical clustering of categorical data, Data Mining and Knowledge Discovery, doi:10.1007/s10618-011-0221-2; T.K. Xiong, S.R. Wang, A. Mayers, E. Monga, DHCC: Divisive hierarchical clustering of categorical data, Data Mining and Knowledge Discovery, doi:10.1007/s10618-011-0221-2 · Zbl 1235.68208
[19] Li, C.; Biswas, G., Unsupervised learning with mixed numeric and nominal data, IEEE Transactions on Knowledge and Data Engineering, 14, 4, 673-690 (2002)
[20] Hsu, C. C.; Chen, C. L.; Su, Y. W., Hierarchical clustering of mixed data based on distance hierarchy, Information Sciences, 177, 20, 4474-4492 (2007)
[21] S. Guha, R. Rastogi, K. Shim, CURE: An efficient clustering algorithm for large databases, in: Proceeding of ACM SIGMOD International Conference Management of Data, 1998, pp. 73-84.; S. Guha, R. Rastogi, K. Shim, CURE: An efficient clustering algorithm for large databases, in: Proceeding of ACM SIGMOD International Conference Management of Data, 1998, pp. 73-84. · Zbl 1006.68661
[22] Karypis, G.; Han, E.; Kumar, V., Chameleon: hierarchical clustering using dynamic modeling, IEEE Computer, 32, 8, 68-75 (1999)
[23] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proceeding of ACM SIGMOD International Conference Management of Data, 1996, pp. 103-114.; T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proceeding of ACM SIGMOD International Conference Management of Data, 1996, pp. 103-114.
[24] Guha, S.; Rastogi, R.; Shim, K., ROCK: a robust clustering algorithm for categorical attributes, Information Systems, 25, 5, 345-366 (2000)
[25] Mirkin, B., Choosing the number of clusters, WIREs Data Mining and Knowledge Discovery, 1, 3, 252-260 (2011)
[26] Sun, H. J.; Wang, S. R.; Jiang, Q. S., FCM-based model selection algorithms for determining the number of clusters, Pattern Recognition, 37, 10, 2027-2037 (2004) · Zbl 1056.68583
[27] Kothari, R.; Pitts, D., On finding the number of clusters, Pattern Recognition Letters, 20, 4, 405-416 (1999)
[28] Li, M. J.; Ng, M. K.; Cheung, Y.; Huang, Z. X., Agglomerative fuzzy \(k\)-means clustering algorithm with selection of number of clusters, IEEE Transactions on Knowledge and Data Engineering, 20, 11, 1519-1534 (2008)
[29] Leung, Y.; Zhang, J. S.; Xu, Z. B., Clustering by scale-space filtering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 12, 1394-1410 (2000)
[30] Bandyopadhyay, S.; Maulik, U., Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recognition, 35, 6, 1197-1208 (2002) · Zbl 1001.68926
[31] Bandyopadhyay, S.; Saha, S., A point symmetry-based clustering technique for automatic evolution of clusters, IEEE Transactions on Knowledge and Data Engineering, 20, 11, 1441-1457 (2008)
[32] Bandyopadhyay, S., Genetic algorithms for clustering and fuzzy clustering, WIREs Data Mining and Knowledge Discovery, 1, 6, 524-531 (2011)
[33] Sugar, C. A.; James, G. M., Finding the number of clusters in a data set: an information theoretic approach, Journal of the American Statistical Association, 98, 463, 750-763 (2003) · Zbl 1046.62064
[34] M. Aghagolzadeh, H. Soltanian-Zadeh, B.N. Araabi, A. Aghagolzadeh, Finding the number of clusters in a dataset using an information theoretic hierarchical algorithm, in: Proceedings of the 13th IEEE International Conference on Electronics, Circuits and Systems, 2006, pp. 1336-1339.; M. Aghagolzadeh, H. Soltanian-Zadeh, B.N. Araabi, A. Aghagolzadeh, Finding the number of clusters in a dataset using an information theoretic hierarchical algorithm, in: Proceedings of the 13th IEEE International Conference on Electronics, Circuits and Systems, 2006, pp. 1336-1339. · Zbl 1229.94024
[35] Bai, L.; Liang, J. Y.; Dang, C. Y., An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data, Knowledge-Based Systems, 24, 6, 785-795 (2011)
[36] K.K. Chen, L. Liu, The best \(k\); K.K. Chen, L. Liu, The best \(k\)
[37] Yan, H.; Chen, K. K.; Liu, L.; Bae, J., Determining the best \(k\) for clustering transactional datasets: a coverage density-based approach, Data & Knowledge Engineering, 68, 1, 28-48 (2009)
[38] R. Jensen, Q. Shen, Fuzzy-rough sets for descriptive dimensionality reduction, in: Proceeding of the 2002 IEEE International Conference on Fuzzy Systems, 2002, pp. 29-34.; R. Jensen, Q. Shen, Fuzzy-rough sets for descriptive dimensionality reduction, in: Proceeding of the 2002 IEEE International Conference on Fuzzy Systems, 2002, pp. 29-34.
[39] Shannon, C. E., A mathematical theory of communication, Bell Systems Technical Journal, 27, 3-4, 379-423 (1948) · Zbl 1154.94303
[40] D. Barbara, Y. Li, J. Couto, Coolcat: an entropy-based algorithm for categorical clustering, in: Proceeding of the 2002 ACM CIKM International Conference on Information and Knowledge Management, 2002, pp. 582-589.; D. Barbara, Y. Li, J. Couto, Coolcat: an entropy-based algorithm for categorical clustering, in: Proceeding of the 2002 ACM CIKM International Conference on Information and Knowledge Management, 2002, pp. 582-589.
[41] He, Z. Y.; Deng, S. C.; Xu, X. F., An optimization model for outlier detection in categorical data, (Lecture Notes in Computer Science, vol. 3644 (2005)), 400-409
[42] Düntsch, I.; Gediga, G., Uncertainty measures of rough set prediction, Artificial Intelligence, 106, 1, 109-137 (1998) · Zbl 0909.68040
[43] A. Renyi, On measures of entropy and information, in: Proceeding of the 4th Berkeley Symposium on Mathematics of Statistics and Probability, 1961, pp. 547-561.; A. Renyi, On measures of entropy and information, in: Proceeding of the 4th Berkeley Symposium on Mathematics of Statistics and Probability, 1961, pp. 547-561. · Zbl 0106.33001
[44] Parzen, E., On the estimation of a probability density function and the mode, Annals of Mathematical Statistics, 33, 3, 1065-1076 (1962) · Zbl 0116.11302
[45] Jenssen, R.; Eltoft, T.; Erdogmus, D.; Principe, J. C., Some equivalences between kernel methods and information theoretic methods, Journal of VLSI Signal Processing Systems, 49, 1-2, 49-65 (2006)
[46] Gokcay, E.; Principe, J. C., Information theoretic clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2, 158-171 (2002)
[47] Liang, J. Y.; Chin, K. S.; Dang, C. Y.; Yam Richard, C. M., A new method for measuring uncertainly and fuzziness in rough set theory, International Journal of General Systems, 31, 4, 331-342 (2002) · Zbl 1010.94004
[48] Qian, Y. H.; Liang, J. Y.; Pedrycz, W.; Dang, C. Y., Positive approximation: an accelerator for attribute reduction in rough set theory, Artificial Intelligence, 174, 9-10, 597-618 (2010) · Zbl 1205.68310
[49] Qian, Y. H.; Liang, J. Y.; Li, D. Y.; Zhang, H. Y.; Y Dang, C., Measures for evaluating the decision performance of a decision table in rough set theory, Information Sciences, 178, 1, 181-202 (2008) · Zbl 1128.68102
[50] Liang, J. Y.; Li, D. Y., Uncertainty and Knowledge Acquisition in Information Systems (2005), Science Press: Science Press Beijing, China
[51] Liang, J. Y.; Shi, Z. Z.; Li, D. Y.; Wierman, M. J., The information entropy, rough entropy and knowledge granulation in incomplete information system, International Journal of General Systems, 35, 6, 641-654 (2006) · Zbl 1115.68130
[52] Halkidi, M.; Vazirgiannis, M., A density-based cluster validity approach using multi-representatives, Pattern Recognition Letters, 29, 6, 773-786 (2008)
[53] Rezaee, M.; Lelieveldt, B.; Reiber, J., A new cluster validity index for the fuzzy c-mean, Pattern Recognition Letters, 19, 3-4, 237-346 (1998) · Zbl 0905.68127
[54] Wang, W. N.; Zhang, Y. J., On fuzzy cluster validity indices, Fuzzy Sets and Systems, 158, 19, 2095-2117 (2007) · Zbl 1123.62046
[55] Gluck, M. A.; Corter, J. E., Information, uncertainty, and the utility of categories, (Proceeding of the 7th Annual Conference of the Cognitive Science Society (1985), Lawrence Erlbaum Associates: Lawrence Erlbaum Associates Irvine, CA), 283-287
[56] Fisher, D. H., Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2, 2, 139-172 (1987)
[57] K. McKusick, K. Thompson, COBWEB/3: A Portable Implementation, Technical Report FIA-90-6-18-2, NASA Ames Research Center, 1990.; K. McKusick, K. Thompson, COBWEB/3: A Portable Implementation, Technical Report FIA-90-6-18-2, NASA Ames Research Center, 1990.
[58] Mirkin, B., Reinterpreting the category utility function, Machine Learning, 45, 2, 219-228 (2001) · Zbl 1007.68155
[59] J. Al-Shaqsi, W.J. Wang, A clustering ensemble method for clustering mixed data, in: The 2010 International Joint Conference on Neural Networks, 2010.; J. Al-Shaqsi, W.J. Wang, A clustering ensemble method for clustering mixed data, in: The 2010 International Joint Conference on Neural Networks, 2010.
[60] W.D. Zhao, W.H. Dai, C.B. Tang, K-centers algorithm for clustering mixed type data, in: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2007, pp. 1140-1147.; W.D. Zhao, W.H. Dai, C.B. Tang, K-centers algorithm for clustering mixed type data, in: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2007, pp. 1140-1147.
[61] Bezdek, J. C., Pattern Recognition in Handbook of Fuzzy Computation (1998), IOP Publishing Limited: IOP Publishing Limited Boston, New York, (Chapter F6)
[62] Hubert, L.; Arabie, P., Comparing partitions, Journal of Classification, 2, 1, 193-218 (1985)
[63] M. Bohanec, V. Rajkovic, Knowledge acquisition and explanation for multi-attribute decision making, in: Proceeding of the 8th International Workshop on Expert Systems and Their Applications, Avignon, France, 1988, pp. 59-78.; M. Bohanec, V. Rajkovic, Knowledge acquisition and explanation for multi-attribute decision making, in: Proceeding of the 8th International Workshop on Expert Systems and Their Applications, Avignon, France, 1988, pp. 59-78.
[64] UCI Machine Learning Repository \(\langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \); UCI Machine Learning Repository \(\langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.