×

Gini correlation for feature screening. (English) Zbl 1471.62386

Summary: In this paper we propose the Gini correlation screening (GCS) method to select the important variables with ultrahigh dimensional data. The new procedure is based on the Gini correlation coefficient via the covariance between the response and the rank of the predictor variables rather than the Pearson correlation and the Kendall \(\tau\) correlation coefficient. The new method does not require imposing a specific model structure on regression functions and only needs the condition which the predictors and response have continuous distribution function. We demonstrate that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. The procedure is computationally efficient and simple, and exhibits a competent empirical performance in our intensive simulations and real data analysis.

MSC:

62H20 Measures of association (correlation, canonical correlation, etc.)
62R07 Statistical aspects of big data and data science
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

ElemStatLearn
Full Text: DOI

References:

[1] Bhlmann, P.; van de Geer, S., Statistics for High-Dimensional Data Methods (2011), Heidelberg, Dordrecht, London, New York: Theory and Applications, Springer, Heidelberg, Dordrecht, London, New York · Zbl 1273.62015 · doi:10.1007/978-3-642-20192-9
[2] Chen, JH; Chen, ZH, Extended Bayesian information criteria for model selection with large model spaces, Biometrika, 95, 759-771 (2008) · Zbl 1437.62415 · doi:10.1093/biomet/asn034
[3] Fan, J.; Gijbels, I., Local Polynomial Modeling and Its Applications (1996), New York: Chapman and Hall, New York · Zbl 0873.62037
[4] Fan, J.; Li, R., Variable Selection via Nonconcave Penalized Likelihood and it Oracle Properties, Ann. Statist. Assoc., 96, 1348-1360 (2001) · Zbl 1073.62547 · doi:10.1198/016214501753382273
[5] Fan, J.; Ren, Y., Statistical analysis of DNA microarray data, Em Clin. Cancer Res., 12, 4469-4473 (2006) · doi:10.1158/1078-0432.CCR-06-1033
[6] Fan, J.; Song, R., Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist. Assoc., 38, 3567-3604 (2010) · Zbl 1206.68157
[7] Fan, M.; Ma, Y.; Dai, W., Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models, Ann. Statist. Assoc., 109, 1270-1284 (2013) · Zbl 1368.62095 · doi:10.1080/01621459.2013.879828
[8] Fan, J.; Lv, J., Sure independence screening for ultrahigh dimensional feature space (with discussion), J. Roy. Statist. Soc.B., 70, 849-911 (2008) · Zbl 1411.62187 · doi:10.1111/j.1467-9868.2008.00674.x
[9] Fan, J.; Feng, Y.; Song, R., Nonparametric independence screening in sparse ultra-highdimensional additive models, J. Am. Statist. Assoc., 106, 544-557 (2011) · Zbl 1232.62064 · doi:10.1198/jasa.2011.tm09779
[10] Hall, P.; Miller, H., Using generalized correlation to effect variable selection in very high dimensional problems, Journal of Computational and Graphical Statistics, 18, 533-550 (2009) · doi:10.1198/jcgs.2009.08041
[11] Hastie, T.; Tibshirani, R., Generalized additive models, Statistical Science, 3, 297-318 (1986) · Zbl 0645.62068
[12] Li, R.; Liang, H., Variable Selection in Semiparametric Regression Model, The Ann Statist., 36, 261-286 (1999) · Zbl 1132.62027
[13] Li, G.; Peng, H.; Zhang, J.; Zhu, L., Robust Rank Correlation Based Screening, Ann. Statist., 40, 1846-1877 (2012) · Zbl 1257.62067
[14] Luo, S.; Chen, Z., Sequential Lasso Cum Ebic For Feature Selection With Ultra-High Dimensional Feature Space, J. Am. Statist. Assoc., 109, 1229-1240 (2014) · Zbl 1368.62205 · doi:10.1080/01621459.2013.877275
[15] Redfern, CH; Coward, P.; Degtyarev, MY; Lee, EK; Kwa, AT; Hennighausen, L.; Bujard, H.; Fishman, GI; Conklin, BR, Conditional expression and signaling of a specifically designed Gi-coupled receptor in transgenic mice, Nat. Biotechnol, 17, 165-169 (1999) · doi:10.1038/6165
[16] Shorack, G.; Wellne, J., Empirical Processes with Applications to Statistics (1986), New York: Wiley, New York · Zbl 1170.62365
[17] Schechtman, E.; Yitzhaki, S., A measure of association based on Gini’s mean difference, Comm. Statist., 16, 1, 207-231 (1987) · Zbl 0617.62061 · doi:10.1080/03610928708829359
[18] Schechtman, E.; Yitzhaki, S., On the proper bounds of the Gini correlation, Econom. Lett., 63, 133-138 (1999) · Zbl 0924.90043 · doi:10.1016/S0165-1765(99)00033-6
[19] Schechtman, E.; Yitzhaki, S., A Family of Correlation Coefficients Based on the Extended Gini Index, J. Econ. Inequal., 12, 129-146 (2003) · doi:10.1023/A:1026152130903
[20] Schechtman, E.; Yizhaki, S.; Artsev, Y., The similarity between mean-variance and mean-Gini: Testing for equality of Gini correlations, Advances in Investment Analysis and Portfolio Management (AIAPM), 3, 103-128 (2007)
[21] Shevlyakov, GL; Smirnov, PO, Robust Estimation of the Correlation Coefficient: an Attempt of Survey, Austrian Journal of Statistics, 40, 147-156 (2011)
[22] Storey, JD; Tibshirani, R., Statistical significance for genome-wide studies, Proc. Natn. Acad. Sci. USA, 100, 9440-9445 (2003) · Zbl 1130.62385 · doi:10.1073/pnas.1530509100
[23] Hastie, T.; Tibshirani, R.; Friedman, J., Elements of statistical learning: data mining, Inference and Prediction (2009), Berlin: Springer, Berlin · Zbl 1273.62005
[24] Tibshirani, R., Regression Shrinkage and Selection via LASSO, Journal of the Royal Statistical Society, Series B, 58, 267-288 (1996) · Zbl 0850.62538
[25] Wang, H.; Xia, Y., Shrinkage Estimation of the Varying Coefficient Model, J. Am. Statis. Assoc., 104, 747-757 (2009) · Zbl 1388.62213 · doi:10.1198/jasa.2009.0138
[26] Li, G.; Peng, H.; Zhu, L., Nonconcave penalized M-estimation with a diverging number of parameters, Statist. Sinica, 21, 391-419 (2011) · Zbl 1206.62036
[27] Wang, H., Factor profiled sure independence screening, Biometrika, 99, 15C-28 (2012) · Zbl 1234.62108 · doi:10.1093/biomet/asr074
[28] Zhu, L.; Li, X.; Li, Z.; Zhu, X., Model-free feature screening for ultrahigh-demensional data, J. Amer. Statist. Assoc., 106, 1464-1474 (2011) · Zbl 1233.62195 · doi:10.1198/jasa.2011.tm10563
[29] Zhang, J.; Zhang, R.; Lu, Z., Quantile-adaptive variable screening in ultra-high dimensional varying coefficient models, Journal of Applied Statistics, 43, 643-654 (2016) · Zbl 1514.62970 · doi:10.1080/02664763.2015.1072141
[30] Zhang, J.; Zhang, R.; Zhang, J., Feature Screening for Nonparametric and Semiparametric Models with Ultrahigh-dimensional Covariates, J. Syst. Sci. Complex, 31, 1350-1361 (2018) · Zbl 1409.62093 · doi:10.1007/s11424-017-6310-6
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.