×

Non-asymptotic analysis and inference for an outlyingness induced winsorized mean. (English) Zbl 1527.62033

Summary: Robust estimation of a mean vector, a topic regarded as obsolete in the traditional robust statistics community, has recently surged in machine learning literature in the last decade. The latest focus is on the sub-Gaussian performance and computability of the estimators in a non-asymptotic setting. Numerous traditional robust estimators are computationally intractable, which partly contributes to the renewal of the interest in the robust mean estimation. Robust centrality estimators, however, include the trimmed mean and the sample median. The latter has the best robustness but suffers a low efficiency drawback. Trimmed mean and median of means, achieving sub-Gaussian performance have been proposed and studied in the literature. This article investigates the robustness of leading sub-Gaussian estimators of mean and reveals that none of them can resist greater than 25% contamination in data and consequently introduces an outlyingness induced winsorized mean which has the best possible robustness (can resist up to 50% contamination without breakdown) meanwhile achieving high efficiency. Furthermore, it has a sub-Gaussian performance for uncontaminated samples and a bounded estimation error for contaminated samples at a given confidence level in a finite sample setting. It can be computed in linear time.

MSC:

62G35 Nonparametric robustness
62G15 Nonparametric tolerance and confidence regions
62G05 Nonparametric estimation

References:

[1] Alon, N.; Matias, Y.; Szegedy, M., The space complexity of approximating the frequency moments, J Comput Syst Sci, 58, 137-147 (2002) · Zbl 0938.68153 · doi:10.1006/jcss.1997.1545
[2] Bernstein, SN, The theory of probabilities (1946), Moscow: Gastehizdat Publishing House, Moscow
[3] Boucheron, S.; Lugosi, G.; Massart, P., Concentration inequalities: a nonasymptotic theory of independence (2013), Oxford: Oxford University Press, Oxford · Zbl 1279.60005 · doi:10.1093/acprof:oso/9780199535255.001.0001
[4] Catoni, O., Challenging the empirical mean and empirical variance: a deviation study, Ann Inst Henri Poincaré, Prob Stat, 48, 4, 1148-1185 (2012) · Zbl 1282.62070 · doi:10.1214/11-AIHP454
[5] Catoni O, Giulini I (2018) Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv preprint arXiv:1802.04308
[6] Chen, M.; Gao, C.; Ren, Z., Robust covariance and scatter matrix estimation under Huber’s contamination model, Ann Stat, 46, 1932-1960 (2018) · Zbl 1408.62104 · doi:10.1214/17-AOS1607
[7] Davies, PL, Asymptotic behavior of S-estimators of multivariate location parameters and dispersion matrices, Ann Stat, 15, 1269-1292 (1987) · Zbl 0645.62057 · doi:10.1214/aos/1176350505
[8] Depersin J, Lecué G (2021) On the robustness to adversarial corruption and to heavy-tailed data of the Stahel-Donoho median of means. arXiv:2101.09117v1 · Zbl 1528.62019
[9] Diakonikolas I, Kane D (2019) Recent advances in algorithmic high-dimensional robust statistics. arXiv:1911.05911v1 · Zbl 07705538
[10] Donoho DL (1982) Breakdown properties of multivariate location estimators. Harvard University, PhD Qualifying paper
[11] Donoho, DL; Huber, PJ; Bickel, PJ; Doksum, KA; Hodges, JL, A festschrift foe Erich L. Lehmann, The notion of breakdown point, 157-184 (1983), Wadsworth: Chapman and Hall, Wadsworth · Zbl 0523.62032
[12] Hastie, T.; Tibshirani, R.; Wainwright, MJ, Statistical learning with sparsity: the lasso and generalizations (2015), Boca Raton: CRC Press, Boca Raton · Zbl 1319.68003 · doi:10.1201/b18401
[13] Hsu D (2010) Robust statistics. http://www.inherentuncertainty.org/2010/12/robust-statistics.html
[14] Hubert, M.; Rousseeuw, PJ; Van Aelst, S., High-breakdown robust multivariate methods, Stat Sci, 23, 1, 92-119 (2008) · Zbl 1327.62328 · doi:10.1214/088342307000000087
[15] Jerrum, M.; Valiant, L.; Vazirani, V., Random generation of combinatorial structures from a uniform distribution, Theor Comput Sci, 43, 186-188 (1986) · Zbl 0597.68056 · doi:10.1016/0304-3975(86)90174-X
[16] Lerasle M (2019) Selected topics on robust statistical learning theory, Lecture Notes. arXiv:1908.10761v1
[17] Lerasle M, Oliveira RI (2011) Robust empirical mean estimators. Preprint. Available at arXiv:1112.3914
[18] Liu, X., Approximating projection depth median of dimensions \(p \ge 3\), Commun Stat Simul C, 46, 3756-3768 (2017) · Zbl 1368.62054
[19] Lopuhaä, HP; Rousseeuw, J., Breakdown points of affine equivariant estimators of multivariate location and covariance matrices, Ann Statist, 19, 229-248 (1991) · Zbl 0733.62058 · doi:10.1214/aos/1176347978
[20] Lecué, G.; Lerasle, M., Robust machine learning by median-of-means: theory and practice, Ann Statist, 48, 906-931 (2020) · Zbl 1487.62034 · doi:10.1214/19-AOS1828
[21] Lugosi, G.; Mendelson, S., Mean estimation and regression under heavy-tailed distributions: a survey, Found Comput Math, 19, 1145-1190 (2019) · Zbl 1431.62123 · doi:10.1007/s10208-019-09427-x
[22] Lugosi, G.; Mendelson, S., Robust multivariate mean estimation: the optimality of trimmed mean, Ann Stat, 49, 1, 393-410 (2021) · Zbl 1461.62069 · doi:10.1214/20-AOS1961
[23] Nemirovsky AS, Yudin DB (1983) Problem complexity and method efficiency in optimization · Zbl 0501.90062
[24] Pauwels E (2020) Lecture notes: statistics, optimization and algorithms in high dimension. https://www.math.univ-toulouse.fr/ epauwels/M2RI/
[25] Rousseeuw, PJ, Least median of squares regression, J Am Stat Assoc, 79, 871-880 (1984) · Zbl 0547.62046 · doi:10.1080/01621459.1984.10477105
[26] Rousseeuw, PJ; Grossmann, W.; Pflug, G.; Vincze, I.; Wertz, W., Multivariate estimation with high breakdown point, Mathematical statistics and applications, 283-297 (1985), Kufstein: Riedel, Kufstein · Zbl 0609.62054 · doi:10.1007/978-94-009-5438-0_20
[27] Rousseeuw, PJ; Ruts, I., Construting the bivariate Tukey median, Stat Sin, 8, 3, 827-839 (1998) · Zbl 0905.62029
[28] Rousseeuw PJ, Yohai VJ (1984) Robust regression by means of S-estimators. In Robust and nonlinear time series analysis. Lecture Notes in Statist. Springer, New York. 26:256-272 · Zbl 0567.62027
[29] Stahel WA (1981) Robuste Schatzungen: Infinitesimale Optimalitiit und Schiitzungen von Kovarianzmatrizen. Ph.D. dissertation, ETH, Zurich · Zbl 0531.62036
[30] Sun, Q.; Zhou, WX; Fan, JQ, Adaptive Huber regression, J Am Stat Assoc, 115, 529, 254-265 (2020) · Zbl 1437.62250 · doi:10.1080/01621459.2018.1543124
[31] Weber A (1909) Uber den Standort der Industrien, Tubingen. In: Alfred Weber’s Theory of Location of Industries, University of Chicago Press. English translation by Freidrich, C.J. (1929)
[32] Weng, H.; Maleki, A.; Zheng, L., Overcoming the limitations of phase transition by higher order analysis of regularization techniques, Ann Stat, 46, 6, 3099-3129 (2018) · Zbl 1411.62194 · doi:10.1214/17-AOS1651
[33] Wu, M.; Zuo, Y., Trimmed and Winsorized means based on a scaled deviation, J Stat Plann Inference, 139, 2, 350-365 (2009) · Zbl 1149.62047 · doi:10.1016/j.jspi.2008.03.039
[34] Zuo, Y., Projection-based depth functions and associated medians, Ann Stat, 31, 1460-1490 (2003) · Zbl 1046.62056 · doi:10.1214/aos/1065705115
[35] Zuo, Y., Robust location and scatter estimators in multivariate analysis, 467-490 (2006), London: Imperial College Press, London · Zbl 1119.62051
[36] Zuo, Y., Multi-dimensional trimming based on projection depth, Ann Stat, 34, 5, 2211-2251 (2006) · Zbl 1106.62057 · doi:10.1214/009053606000000713
[37] Zuo, Y., A new approach for the computation of halfspace depth in high dimensions, Commun Stat Simul Comput, 48, 3, 900-921 (2018) · Zbl 07551473 · doi:10.1080/03610918.2017.1402040
[38] Zuo, Y.; Serfling, R., General notions of statistical depth function, Ann Stat, 28, 461-482 (2000) · Zbl 1106.62334
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.