A variable-selection heuristic for K-means clustering

1302 Accesses
97 Citations
Explore all metrics

Abstract

One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anderberg, M.R. (1973).Cluster analysis for applications. New York, NY: Academic Press.
Google Scholar
Arabie, P., & Hubert, L.J. (1994). Cluster analysis in marketing research. In R.P. Bagozzi (Ed),Advanced methods in marketing research (pp. 160–189). Oxford, England: Blackwell.
Google Scholar
Arabie, P., & Hubert, L.J. (1996). An overview of combinatorial data analysis. In P. Arabie, L.J. Hubert, & G. De Soete (Eds),Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific Publishing.
Google Scholar
Art, D., Gnanadesikan, R., & Kettenring, J.R. (1982). Data-based metrics for cluster analysis.Utilitas Mathematica, Series A, 21, 75–99.
Google Scholar
Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., & Lewis, P.A. (1994). A study of the classification capabilities of neural networks using unsupervised learning: A comparison with K-means clustering.Psychometrika, 59, 509–525.
Google Scholar
Balasubramanian, S., Gupta, S., Kamakura, W., & Wedel, M. (1998). Modelling large data sets in marketing.Statistica Neerlandica, 52, 303–323.
Google Scholar
Berry, M.J.A., & Linoff, G. (1997).Data mining techniques: For marketing, sales, and customer support. New York, NY: John Wiley & Sons.
Google Scholar
Blattberg, R., Glazer, R., & Little, J. (1994).The marketing information revolution. Boston, MA: Harvard Business School Press.
Google Scholar
Box, G.E.P., & Muller, M.E. (1958). A note on the generation of random normal deviates.Annals of Mathematical Statistics, 29, 610–611.
Google Scholar
Breckenridge, J.N. (1989). Replicating cluster analysis: Method, consistency, and validity.Multivariate Behavioral Research, 24, 147–161.
Google Scholar
Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segmentation by identifying noisy variables.Journal of Marketing Research, 36, 501–509.
Google Scholar
Chaturvedi, A., Carroll, J.D., Green, P.E., & Rotondo, J.A. (1997). A feature-based approach to market segmentation via overlapping K-centroids clustering.Journal of Marketing Research, 34, 370–377.
Google Scholar
Cheng, R., & Milligan, G.W. (1996). K-means clustering methods with influence detection.Educational and Psychological Measurement, 56, 833–838.
Google Scholar
Cormack, R.M. (1971). A review of classification (with Discussion).Journal of the Royal Statistical Society, Series A, 134, 321–367.
Google Scholar
DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with different weighting of variables.Psychometrika, 49, 57–78.
Google Scholar
DeSarbo, W.S., Manrai, A.K., & Manrai, L.A. (1993). Non-spatial tree models for the assessment of competitive market structure: An integrated review of the marketing and psychometric literature. In J. Eliashberg & G. Lilien (Eds),Handbook in operations research and management science: Marketing (pp. 193–257), New York, NY: Elsevier.
Google Scholar
De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering.Quality and Quantity, 20, 169–180.
Google Scholar
De Soete, G. (1988). OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting.Journal of Classification, 5, 101–104.
Google Scholar
De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm.Journal of Classification, 2, 173–192.
Google Scholar
Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1987). Variable selection in clustering and other contexts. In C.L. Mallows (Ed.),Design, data, and analysis (pp. 13–34). New York, NY: John Wiley & Sons.
Google Scholar
Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering.Journal of Classification, 5, 205–228.
Google Scholar
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder).Journal of the American Statistical Association, 78, 553–584.
Google Scholar
Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data.Journal of the American Statistical Association, 62, 1159–1178.
Google Scholar
Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis.Journal of Classification, 12, 113–136.
Google Scholar
Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in K-means clustering.Journal of Classification, 7, 271–285.
Google Scholar
Helsen, K., & Green, P.E. (1991). A computational study of replicated clustering with an application to market segmentation.Decision Sciences, 22, 1124–1141.
Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions.Journal of Classification, 2, 193–218.
Google Scholar
Knuth, D.E. (1997).The art of computing: Vol. 1. Fundamental algorithms. Reading, MA: Addison-Wesley.
Google Scholar
Krieger, A., & Green, P.E. (1999). A generalized rand-index method for consensus clustering of separate partitions of the same data base.Journal of Classification, 16, 63–89.
Google Scholar
Kruskal, J.B. (1972). Linear transformations of multivariate data to reveal clustering. In R.N. Shepard, A.K. Romney, & S.B. Nerlove (Eds),Multidimensional scaling: Theory and applications in the behavioral sciences (pp. 181–191). New York, NY: Seminar Press.
Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, I, 231–297.
Google Scholar
McIntyre, R.M., & Blashfield, R.K. (1980). A nearest-centroid technique for evaluating the minimum variance clustering procedure.Multivariate Behavioral Research, 15, 225–238.
Google Scholar
Milligan, G.W. (1980). An examination of six types of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45, 325–342.
Google Scholar
Milligan, G.W. (1985). An algorithm for generating artificial test clusters.Psychometrika, 50, 123–127.
Google Scholar
Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis.Journal of Classification, 6, 53–71.
Google Scholar
Milligan, G.W. (1996). Clustering validation: Results and implications for applied analyses. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.),Clustering and classification (pp. 341–375). River Edge, NJ, World Scientific Publishing.
Google Scholar
Milligan, G.W., & Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis.Multivariate Behavioral Research, 21, 441–458.
Google Scholar
Milligan, G.W., & Cooper, M.C. (1988). A study of the standardization of variables in cluster analysis.Journal of Classification, 5, 181–204.
Google Scholar
Milligan, G.W., Soon, S.C., & Sokol, L.M. (1983). The effect of cluster size, dimensionality, and the number of clusters on the recovery of true cluster structure.IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.
Google Scholar
Morey, L.C., Blashfield, R.K., & Skinner, H.A. (1983). A comparison of cluster analysis techniques within a sequential validation framework.Multivariate Behavioral Research, 18, 309–329.
Google Scholar
Rand, W.M. (1971). Objective criteria for evaluating clustering methods.Journal of the American Statistical Association, 66, 846–850.
Google Scholar
Rohlf, F.J. (1970). Adaptive hierarchical clustering schemes.Systematic Zoology, 19, 58–82.
Google Scholar
Salstone, R., & Stange, K. (1996). A computer program to calculate Hubert and Arabie's adjusted Rand index.Journal of Classification, 13, 169–172.
Google Scholar
Ward, J.H. (1963). Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58, 236–244.
Google Scholar
Waller, N.G., Kaiser, H.A., Illian, J.B., & Manry, M. (1998). A comparison of the classification capabilities of the 1-dimensional Kohonen neural network with two partitioning and three hierarchical cluster analysis algorithms.Psychometrika, 63, 5–22.
Google Scholar
Wedel, M., & Kamakura, W.A. (1997).Market segmentation: Conceptual and methodological foundations. Boston, MA: Kluwer Academic Publishers.
Google Scholar

Download references

Author information

Authors and Affiliations

Marketing Department, College of Business, Florida State University, 32306-1110, Tallahassee, FL
Michael J. Brusco & J. Dennis Cradit

Authors

Michael J. Brusco
View author publications
You can also search for this author in PubMed Google Scholar
J. Dennis Cradit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Brusco.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brusco, M.J., Cradit, J.D. A variable-selection heuristic for K-means clustering. Psychometrika 66, 249–270 (2001). https://doi.org/10.1007/BF02294838

Download citation

Received: 18 May 1999
Revised: 18 April 2000
Issue Date: June 2001
DOI: https://doi.org/10.1007/BF02294838

A variable-selection heuristic for K-means clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Means Clustering

Combinatorial Optimization Approaches for Data Clustering

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Key words

Subscribe and save

Buy Now

Navigation

A variable-selection heuristic for K-means clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Means Clustering

Combinatorial Optimization Approaches for Data Clustering

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words

Subscribe and save

Buy Now

Search

Navigation