×

Cellwise outlier detection with false discovery rate control. (English. French summary) Zbl 07759493

Summary: This article is concerned with detecting cellwise outliers in large data matrices. We introduce a novel method that is able to fully exploit dependence structures among variables while controlling the false discovery rate (FDR). We reframe cellwise outlier identification into a high-dimensional variable selection paradigm and construct “binate references” for data screening, estimation and information pooling. With the binate references, the proposed procedure forms a series of statistics that incorporate covariance information and utilizes a global symmetry property of these statistics to approximate the false discovery proportion. We show that the proposed method can control the asymptotic FDR under some mild conditions. Extensive numerical studies demonstrate that our method has reasonable FDR control and satisfactory power in comparison to existing methods.
{© 2021 Statistical Society of Canada}

MSC:

62-XX Statistics
Full Text: DOI

References:

[1] Alqallaf, F., Van Aelst, S., Yohai, V. J., & Zamar, R. H. (2009). Propagation of outliers in multivariate data. The Annals of Statistics, 37, 311-331. · Zbl 1155.62043
[2] Benjamini, Y. (2010). Discovering the false discovery rate. Journal of the Royal Statistical Society: Series B, 72, 405-416. · Zbl 1411.62043
[3] Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57, 289-300. · Zbl 0809.62014
[4] Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29, 1165-1188. · Zbl 1041.62061
[5] Berk, R., Brown, L., Buja, A., Zhang, K., & Zhao, L. (2013). Valid post‐selection inference. The Annals of Statistics, 41, 802-837. · Zbl 1267.62080
[6] Bickel, P. J. & Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36, 199-227. · Zbl 1132.62040
[7] Bühlmann, P. & Mandozzi, J. (2014). High‐dimensional variable screening and bias in subsequent inference, with an empirical comparison. Computational Statistics, 29, 407-430. · Zbl 1306.65035
[8] Cai, T., Liu, W., & Luo, X. (2011). A constrained L_1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106, 594-607. · Zbl 1232.62087
[9] Cerioli, A. (2010). Multivariate outlier detection with high‐breakdown estimators. Journal of the American Statistical Association, 105, 147-156. · Zbl 1397.62167
[10] Efron, B. (2010). Correlated z‐values and the accuracy of large‐scale statistical estimates. Journal of the American Statistical Association, 105, 1042-1055. · Zbl 1390.62139
[11] Fan, J. & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B, 70, 849-911. · Zbl 1411.62187
[12] Fan, J. & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20, 101. · Zbl 1180.62080
[13] Fan, Y. & Lv, J. (2016). Innovated scalable efficient estimation in ultra‐large Gaussian graphical models. The Annals of Statistics, 44, 2098-2126. · Zbl 1349.62206
[14] Filzmoser, P., Maronna, R., & Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics and Data Analysis, 52, 1694-1711. · Zbl 1452.62370
[15] Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432-441. · Zbl 1143.62076
[16] Fritsch, V., Varoquaux, G., Thyreau, B., Poline, J.‐B., & Thirion, B.Detecting outlying subjects in high‐dimensional neuroimaging datasets with regularized minimum covariance determinant. In International Conference on Medical Image Computing and Computer‐Assisted Intervention, 264-271. Springer; 2011.
[17] Hall, P. & Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics, 38, 1686-1732. · Zbl 1189.62080
[18] Hardin, J. & Rocke, D. M. (2005). The distribution of robust distances. Journal of Computational and Graphical Statistics, 14, 928-946.
[19] Human Mortality Database (2015). Human Mortality Database. University of California; Max Planck Institute for Demographic Research, Berkeley. Available at www.mortality.org (data downloaded in November 2015), 138.
[20] Jin, J. (2012). Comment: Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association, 107, 1042-1045. · Zbl 1395.62222
[21] Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post‐selection inference, with application to the lasso. The Annals of Statistics, 44, 907-927. · Zbl 1341.62061
[22] Leek, J. T. & Storey, J. D. (2008). A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences of the United States of America, 105, 18718-18723. · Zbl 1359.62202
[23] Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129-1139. · Zbl 1443.62184
[24] Loh, P.‐L. & Tan, X. L. (2018). High‐dimensional robust precision matrix estimation: Cellwise corruption under \(\epsilon \)‐contamination. Electronic Journal of Statistics, 12, 1429-1467. · Zbl 1412.62057
[25] Maronna, R. A., Martin, R. D., & Yohai, V. J. (2006). Robust Statistics: Theory and Methods. John Wiley & Sons, New York. · Zbl 1094.62040
[26] Meinshausen, N., Meier, L., & Bühlmann, P. (2009). P‐values for high‐dimensional regression. Journal of the American Statistical Association, 104, 1671-1681. · Zbl 1205.62089
[27] Öllerer, V. & Croux, C. Robust high‐dimensional precision matrix estimation. Modern Nonparametric, Robust and Multivariate Methods, Springer, Cham, 325-350; 2015.
[28] Pan, W., Wang, X., Xiao, W., & Zhu, H. (2018). A generic sure independence screening procedure. Journal of the American Statistical Association, 114, 928-937. · Zbl 1420.62146
[29] Ro, K., Zou, C., Wang, Z., & Yin, G. (2015). Outlier detection for high‐dimensional data. Biometrika, 102, 589-599. · Zbl 1452.62378
[30] Rousseeuw, P. J. & Bossche, W. V. D. (2018). Detecting deviating data cells. Technometrics, 60, 135-145.
[31] Rousseeuw, P. J. & Leroy, A. M. (2005). Robust Regression and Outlier Detection, vol. 589. John Wiley & Sons, New York.
[32] Storey, J. D., Taylor, J. E., & Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. Journal of the Royal Statistical Society: Series B, 66, 187-205. · Zbl 1061.62110
[33] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267-288. · Zbl 0850.62538
[34] Van Aelst, S., Vandervieren, E., & Willems, G. (2012). A Stahel-Donoho estimator based on huberized outlyingness. Computational Statistics and Data Analysis, 56, 531-542.
[35] Wasserman, L. & Roeder, K. (2009). High dimensional variable selection. The Annals of Statistics, 37, 2178-2201. · Zbl 1173.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.