A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. (English) Zbl 1021.62048
Summary: We propose a hybrid clustering method, hierarchical ordered partitioning and collapsing hybrid (HOPACH), which is a hierarchical tree of clusters. The methodology combines the strengths of both partitioning and agglomerative clustering methods. At each node, a cluster is partitioned into two or more smaller clusters with an enforced ordering of the clusters. Collapsing steps uniting the two closest clusters into one cluster can be used to correct for errors made in the partitioning steps.
We implement a version of HOPACH which optimizes a measure of clustering strength, such as average silhouette, at each partitioning and collapsing step. An important benefit of a hierarchical tree is that one can look at clusters at increasing levels of detail. We propose to visualize the clusters at any level of the tree by plotting the distance matrix corresponding with an ordering of the clusters and an ordering of elements within the clusters. A final ordered list of elements is obtained by running down the tree completely. The bootstrap can be used to establish the reproducibility of the clusters and the overall variability of the followed procedure. The power of the methodology compared to current algorithms is illustrated with simulated and publicly available cancer gene expression data sets.
We implement a version of HOPACH which optimizes a measure of clustering strength, such as average silhouette, at each partitioning and collapsing step. An important benefit of a hierarchical tree is that one can look at clusters at increasing levels of detail. We propose to visualize the clusters at any level of the tree by plotting the distance matrix corresponding with an ordering of the clusters and an ordering of elements within the clusters. A final ordered list of elements is obtained by running down the tree completely. The bootstrap can be used to establish the reproducibility of the clusters and the overall variability of the followed procedure. The power of the methodology compared to current algorithms is illustrated with simulated and publicly available cancer gene expression data sets.
MSC:
62H30 | Classification and discrimination; cluster analysis (statistical aspects) |
62A09 | Graphical methods in statistics |
62G09 | Nonparametric statistical resampling methods |
Software:
clusfindReferences:
[1] | Chen, C. H., Statist. Sinica, 12, 7-29 (2002) · Zbl 1027.62047 |
[2] | DeRisi, J.; Penland, L.; Brown, P.; Bittner, M.; Meltzer, P.; Ray, M.; Chen, Y.; Su, Y.; Trent, J., Natur. Genet., 14, 457-460 (1996) |
[3] | Eisen, M.; Spellman, P.; Brown, P.; Botstein, D., Proc. Natl. Acad. Sci., 95, 14863-14868 (1998) |
[4] | Kaufman, L.; Rousseeuw, P., Finding Groups in Data: An Introduction to Cluster Analysis (1990), Wiley: Wiley New York · Zbl 1345.62009 |
[5] | Pollard, K.; van der Laan, M., Math. Biosci., 176, 1, 90-121 (2002) · Zbl 0997.62090 |
[6] | Pollard, K.S., van der Laan, M.J., 2002b. A method to identify significant clusters in gene expression data. Proceedings of SCI2002, Volume II: 318-325.; Pollard, K.S., van der Laan, M.J., 2002b. A method to identify significant clusters in gene expression data. Proceedings of SCI2002, Volume II: 318-325. |
[7] | Ross, D.; Scherf, U.; Eisen, M.; Perou, C.; Rees, C.; Spellman, P.; Iyer, V.; Jeffrey, S.; Van de Rijn, M.; Waltham, M.; Pergamenschikov, A.; Lee, J.; Lashkari, D.; Shalon, D.; Myers, T.; Weinstein, J.; Botstein, D.; Brown, P., Natur. Genet., 24, 227-235 (2000) |
[8] | Sokal, R.R., 1966. Sci. Amer. 106-116.; Sokal, R.R., 1966. Sci. Amer. 106-116. |
[9] | Törönen, P.; Kolehainen, M.; Wong, G.; Castern, E., FEBS Lett., 451, 142-146 (1999) |
[10] | van der Laan, M.; Bryan, J., Biostatistics, 2, 1-17 (2001) |
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.