×

A new Gini correlation between quantitative and qualitative variables. (English) Zbl 07607024

Summary: We propose a new Gini correlation to measure dependence between a categorical and numerical variables. Analogous to Pearson \(R^2\) in ANOVA model, the Gini correlation is interpreted as the ratio of the between-group variation and the total variation, but it characterizes independence (zero Gini correlation mutually implies independence). Closely related to the distance correlation, the Gini correlation is of simple formulation by considering the nature of categorical variable. As a result, the proposed Gini correlation has a simpler computation implementation than the distance correlation and is more straightforward to perform inference. Simulation and real data applications are conducted to demonstrate the advantages.

MSC:

62-XX Statistics

Software:

UCI-ml; energy

References:

[1] Baringhaus, L., & Franz, C. (2004). On a new multivariate two‐sample test. Journal of Multivariate Analysis, 88, 190-206. · Zbl 1035.62052
[2] Beknazaryan, A., Dang, X., & Sang, H. (2019). On mutual information estimation for mixed‐pair random variables. Statistics & Probability Letters, 148, 9-16. · Zbl 1407.62114
[3] Cramér, H. (1946). Mathematical methods of statistics. Upper Saddle River, NJ: Princeton Press. · Zbl 0063.01014
[4] Cui, H., Li, R., & Zhong, W. (2015). Model‐free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110, 630-641. · Zbl 1373.62305
[5] Cui, H., & Zhong, W. (2019). A distribution‐free test of independence based on mean variance index. Computational Statistics & Data Analysis, 139, 117-133. · Zbl 1507.62039
[6] Dang, X., Sang, H., & Weatherall, L. (2019). Gini covariance matrix and its affine equivariant version. Statistical Papers, 60(3), 291-316. · Zbl 1419.62129
[7] David, H. A. (1968). Gini’s mean difference rediscovered. Biometrika, 55, 573-575. · Zbl 0177.46501
[8] Dorfman, R. (1979). A formula for the Gini coefficient. Review of Economics and Statistics, 61, 146-149.
[9] Dua, D., & Graff, C. (2019). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Retrieved from. http://archive.ics.uci.edu/ml
[10] Edelmann, D., Richards, D. & Vogel, D. (2017). The distance standard deviation. arXiv:1705.05777v1.
[11] Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849-911. · Zbl 1411.62187
[12] Frick, R., Goebel, J., Schechtman, E., Wagner, G., & Yitzhaki, S. (2006). Using analysis of Gini (ANOGI) for detecting whether two sub‐samples represent the same universe: The German socio‐economic panel study (SOEP) experience. Sociological Methods and Research, 34, 427-468.
[13] Gao, W., Kannan, S., Oh, S. & Viswanath, P. (2017). Estimating mutual information for discrete‐continuous mixtures. Proceedings of 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA.
[14] Gini, C. (1914). Sulla misura della concentrazione e della variabilità dei caratteri. Atti del Reale Istituto Veneto di Scienze, Lettere ed Aeti, 62, 1203-1248 English Translation: On the measurement of concentration and variability of characters (2005). Metron LXIII (1), 3‐38.
[15] Goldman, M., Craft, B., Brooks, A.N., Zhu, J. and Haussler, D. (2018). The UCSC xena platform for cancer genomics data visualization and interpretation. bioRxiv.
[16] Hu, B., Shao, J., & Palta, M. (2006). Pseudo‐R^2 logistic regression model. Statistica Sinica, 16, 847-860. · Zbl 1107.62055
[17] Huo, X., & Székely, G. (2016). Fast computing for distance covariance. Technometrics, 58(4), 435-447.
[18] Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 60(2), 185-196.
[19] Kendall, M. G., & Gibbons, J. D. (1990). Rank correlation methods (5th ed.). London, UK: Griffin. · Zbl 0732.62057
[20] Koshevoy, G., & Mosler, K. (1997). Multivariate Gini indices. Journal of Multivariate Analysis, 60, 252-276. · Zbl 0873.62062
[21] Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129-1139. · Zbl 1443.62184
[22] Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41(5), 3284-3305. · Zbl 1292.62087
[23] Mari, D. D., & Kotz, S. (2001). Correlation and dependence. London, UK: Imperial College Press. · Zbl 0977.62004
[24] Mercer, J. (1909). Functions of positive and negative type, and their connection the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209, 415-446. · JFM 40.0408.02
[25] Parker, J. S. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27(8), 1160-1167.
[26] Renyi, A. (1959). On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10, 441-451. · Zbl 0091.14403
[27] Rizzo, M. L., & Székely, G. J. (2010). DISCO analysis: A nonparametric extension of analysis of variance. The Annals of Applied Statistics, 4(2), 1034-1055. · Zbl 1194.62054
[28] Ross, B. C. (2014). Mutual information between discrete and continuous data sets. PLoS One, 9(2), e87357. https://doi.org/10.1371/journal.pone.0087357 · doi:10.1371/journal.pone.0087357
[29] Sang, Y., Dang, X., & Sang, H. (2016). Symmetric Gini covariance and correlation. Canadian Journal of Statistics, 44(3), 323-342. · Zbl 1357.62217
[30] Sarmanov, O. V. (1958). Maximum correlation coefficient (symmetric case). Doklady Akad Nauk SSSR, 120, 715-718. · Zbl 0089.36102
[31] Schechtman, E., & Yitzhaki, S. (1987). A measure of association based on Gini’s mean difference. Communications in statistics‐Theory and Methods, 16(1), 207-231. · Zbl 0617.62061
[32] Schechtman, E., & Yitzhaki, S. (2003). A Family of correlation coefficients based on the extended Gini index. The Journal of Economic Inequality, 1(2), 129-146.
[33] Serfling, R. (1980). Approximation theorems of mathematical statistics. New York, NY: Wiley. · Zbl 0538.62002
[34] Shao, J., & Tu, D. (1996). The jackknife and bootstrap. New York, NY: Springer.
[35] Shevlyakov, G. L., & Oja, H. (2016). Robust correlation: Theory and applications. Chichester: Wiley. · Zbl 1381.62007
[36] Shevlyakov, G. L., & Smirnov, P. O. (2011). Robust estimation of the correlation coefficient: An attempt of survey. Austrian Journal of Statistics, 40, 147-156.
[37] Spearman, C. (1904). General intelligence objectively determined and measured. The American Journal of Psychology, 15, 201-293.
[38] Székely, G. J., & Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat, Nov(5), 1-16.
[39] Székely, G. J., & Rizzo, M. L. (2005). Hierarchical clustering via joint between‐within distances: Extending Ward’s minimum variance method. Journal of Classification, 22(2), 151-183. · Zbl 1336.62192
[40] Székely, G. J., & Rizzo, M. L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3(4), 1233-1303. · Zbl 1284.62347
[41] Székely, G. J., & Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8), 1249-1272. · Zbl 1278.62072
[42] Székely, G. J., & Rizzo, M. L. (2017). The energy of data. The Annual Review of Statistics and Its Application, 4(1), 447-479.
[43] Székely, G. J., Rizzo, M. L., & Bakirov, N. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769-2794. · Zbl 1129.62059
[44] Tjur, T. (2009). Coefficients of determination in logistic regression models? A new proposal: The coefficient of discrimination. The American Statistician, 63(4), 366-372. · Zbl 1182.62149
[45] Tsanas, A., Little, M. A., Fox, C., & Ramig, L. O. (2014). Objective automatic assessment of rehabilitative speech treatment in Parkinson’s diseases. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 22, 181-191.
[46] Tschuprow, A. (1939). Principles of the mathematical theory of correlation. England: W. Hodge & Co. · Zbl 0022.24801
[47] Yitzhaki, S., & Schechtman, E. (2013). The Gini methodology. New York, NY: Springer. · Zbl 1292.62013
[48] Zhang, S., Dang, X., Nguyen, D., Wilkins, D., & Chen, Y. (2019). Estimating feature ‐ label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2019.2960358 · doi:10.1109/TPAMI.2019.2960358
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.