×

Hierarchical variable clustering based on the predictive strength between random vectors. (English) Zbl 07885890

Summary: A rank-invariant clustering of variables is introduced that is based on the predictive strength between groups of variables, i.e., two groups are assigned a high similarity if the variables in the first group contain high predictive information about the behaviour of the variables in the other group and/or vice versa. The method presented here is model-free, dependence-based and does not require any distributional assumptions. Various general invariance and continuity properties are investigated, with special attention to those that are beneficial for the agglomerative hierarchical clustering procedure. A fully non-parametric estimator is considered whose excellent performance is demonstrated in several simulation studies and by means of real-data examples.

MSC:

68T37 Reasoning under uncertainty in the context of artificial intelligence

References:

[1] Koch, I., Analysis of Multivariate and High-Dimensional Data, 2013, Cambridge University Press
[2] Bonanno, B.; Caldarelli, G.; Lillo, F.; Micciché, S.; Vandewalle, N.; Mantegna, R., Networks of equities in financial markets, Eur. Phys. J. B, 38, 363-371, 2004
[3] Everitt, B.; Landau, S.; Leese, M.; Stahl, D., Cluster Analysis, vol. 5th, 2011, John Wiley & Sons · Zbl 1274.62003
[4] Fuchs, S.; Di Lascio, F. M.L.; Durante, F., Dissimilarity functions for rank-invariant hierarchical clustering of continuous variables, Comput. Stat. Data Anal., 159, Article 107201 pp., 2021 · Zbl 1510.62251
[5] Son, Y. S.; Baek, J., A modified correlation coefficient based similarity measure for clustering time-course gene expression data, Pattern Recognit. Lett., 29, 3, 232-242, 2008
[6] Bonanomi, A.; Nai Ruscone, M.; Osmetti, S. A., Defining subjects distance in hierarchical cluster analysis by copula approach, Qual. Quant., 51, 2, 859-872, 2017
[7] De Luca, G.; Zuccolotto, P., A tail dependence-based dissimilarity measure for financial time series clustering, Adv. Data Anal. Classif., 5, 4, 323-340, 2011
[8] Di Lascio, F. M.L.; Durante, F.; Pappadà, R., Copula-based clustering methods, (Úbeda-Flores, M.; de Amo, E.; Durante, F.; Fernández-Sánchez, J., Copulas and Dependence Models with Applications, 2017, Springer), 49-67 · Zbl 1380.62021
[9] Durante, F.; Pappadà, R.; Torelli, N., Clustering of financial time series in risky scenarios, Adv. Data Anal. Classif., 8, 359-376, 2014 · Zbl 1414.62241
[10] Banerjee, A.; Merugu, S.; Dhillon, I. S.; Ghosh, J., Clustering with Bregman divergences, J. Mach. Learn. Res., 6, 1705-1749, 2005 · Zbl 1190.62117
[11] Emmert-Streib, F.; Dehmer, M., Information Theory and Statistical Learning, 2008, Springer
[12] Jiang, B.; Pei, J.; Tao, Y.; Lin, X., Clustering uncertain data based on probability distribution similarity, IEEE Trans. Knowl. Data Eng., 25, 4, 751-763, 2013
[13] Kojadinovic, I., Agglomerative hierarchical clustering of continuous variables based on mutual information, Comput. Stat. Data Anal., 46, 2, 269-294, 2004 · Zbl 1429.62251
[14] Kraskov, A.; Stögbauer, H.; Andrzejak, R. G.; Grassberger, P., Hierarchical clustering using mutual information, Europhys. Lett., 70, 2, 278, 2005
[15] Martínez Sotoca, J.; Pla, F., Supervised feature selection by clustering using conditional mutual information-based distances, Pattern Recognit., 43, 6, 2068-2081, 2010 · Zbl 1191.68514
[16] Yang, J.; Grunsky, E.; Cheng, Q., A novel hierarchical clustering analysis method based on Kullback-Leibler divergence and application on dalaimiao geochemical exploration data, Comput. Geosci., 123, 10-19, 2019
[17] Dhaene, J.; Denuit, M.; Goovaerts, M. J.; Kaas, R.; Vyncke, D., The concept of comonotonicity in actuarial science and finance: theory, Insur. Math. Econ., 31, 1, 3-33, 2002 · Zbl 1051.62107
[18] Puccetti, G.; Wang, R., Extremal dependence concepts, Stat. Sci., 30, 4, 485-517, 2015 · Zbl 1426.62156
[19] De Keyser, S.; Gijbels, I., Hierarchical variable clustering via copula-based divergence measures between random vectors, Int. J. Approx. Reason., 165, Article 109090 pp., 2023 · Zbl 07834336
[20] Huang, Z.; Deb, N.; Sen, B., Kernel partial correlation coefficient — a measure of conditional dependence, J. Mach. Learn. Res., 23, 216, 1-58, 2022
[21] Ansari, J.; Fuchs, S., A simple extension of Azadkia & Chatterjee’s rank correlation to a vector of endogenous variables, 2023, Available at
[22] Azadkia, M.; Chatterjee, S., A simple measure of conditional dependence, Ann. Stat., 49, 6, 3070-3102, 2021 · Zbl 1486.62175
[23] Chatterjee, S., A new coefficient of correlation, J. Am. Stat. Assoc., 116, 2009-2022, 2021 · Zbl 1506.62317
[24] Lancaster, H. O., Correlation and complete dependence of random variables, Ann. Math. Stat., 34, 1315-1321, 1963 · Zbl 0121.35905
[25] Ansari, J.; Langthaler, P. B.; Fuchs, S.; Trutschnig, W., Quantifying and estimating dependence via sensitivity of conditional distributions, 2023, Available at
[26] Grabisch, M.; Marichal, J.-L.; Mesiar, R.; Pap, E., Aggregation Functions, 2009, Cambridge University Press · Zbl 1196.00002
[27] Durante, F.; Sempi, C., Principles of Copula Theory, 2016, Cambridge University Press · Zbl 1380.62008
[28] Joe, H., Dependence Modeling with Copulas, 2015, CRC Press: CRC Press Boca Raton, FL
[29] Nelsen, R. B., An Introduction to Copulas, 2006, Springer · Zbl 1152.62030
[30] Manning, C.; Raghavan, P.; Schütze, H., An Introduction to Information Retrieval, 2009, CRC Press: CRC Press Boca Raton FL
[31] Puccetti, G.; Scarsini, M., Multivariate comonotonicity, J. Multivar. Anal., 101, 4, 291-304, 2010 · Zbl 1184.62081
[32] Fuchs, S., Quantifying directed dependence via dimension reduction, J. Multivar. Anal., 201, Article 105266 pp., 2023 · Zbl 07823263
[33] Sweeting, T. J., On conditional weak convergence, J. Theor. Probab., 2, 4, 461-474, 1989 · Zbl 0695.60036
[34] Cambanis, S.; Huang, S.; Simons, G., On the theory of elliptically contoured distributions, J. Multivar. Anal., 11, 368-385, 1981 · Zbl 0469.60019
[35] Fang, K.-T.; Kotz, S.; Ng, K.-W., Symmetric Multivariate and Related Distributions, 1990, Chapman and Hall: Chapman and Hall London · Zbl 0699.62048
[36] Karger, D. N.; Conrad, O.; Böhner, J.; Kawohl, T.; Kreft, H.; Soria-Auza, R. W.; Zimmermann, N. E.; Linder, H. P.; Kessler, M., Climatologies at high resolution for the Earth’s land surface areas, Sci. Data, 4, 1, Article 170122 pp., 2017
[37] Pérez, A.; Prieto-Alaiz, M.; Chamizo, F.; Liebscher, E.; Úbeda Flores, M., Nonparametric estimation of the multivariate Spearman’s footrule: a further discussion, Fuzzy Sets Syst., 467, Article 108489 pp., 2023 · Zbl 1543.62362
[38] Fuchs, S.; McCord, Y., On the lower bound of Spearman’s footrule, Depend. Model., 7, 121-129, 2019
[39] Kullback, S.; Leibler, R. A., On information and sufficiency, Ann. Math. Stat., 22, 1, 79-86, 1951 · Zbl 0042.38403
[40] Hansen, P.; Jaumard, B., Cluster analysis and mathematical programming, Math. Program., 79, 1, 191-215, 1997 · Zbl 0887.90182
[41] Kaufman, L., Finding Groups in Data, 1990, John Wiley & Sons · Zbl 1345.62009
[42] Fowlkes, E. B.; Mallows, C. L., A method for comparing two hierarchical clusterings, J. Am. Stat. Assoc., 78, 383, 553-569, 1983 · Zbl 0545.62042
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.