×

Optimal nonparametric testing of missing completely at random and its connections to compatibility. (English) Zbl 1539.62121

Summary: Given a set of incomplete observations, we study the nonparametric problem of testing whether data are Missing Completely At Random (MCAR). Our first contribution is to characterise precisely the set of alternatives that can be distinguished from the MCAR null hypothesis. This reveals interesting and novel links to the theory of Fréchet classes (in particular, compatible distributions) and linear programming, that allow us to propose MCAR tests that are consistent against all detectable alternatives. We define an incompatibility index as a natural measure of ease of detectability, establish its key properties and show how it can be computed exactly in some cases and bounded in others. Moreover, we prove that our tests can attain the minimax separation rate according to this measure, up to logarithmic factors. Our methodology does not require any complete cases to be effective, and is available in the R package MCARtest.

MSC:

62G10 Nonparametric hypothesis testing
62D10 Missing data

Software:

rcdd; MCARtest; Gurobi

References:

[1] ABRAMSKY, S. (2017). Contextuality: At the borders of paradox. In Categories for the Working Philosopher 262-285. Oxford Univ. Press, Oxford. MathSciNet: MR3822113 · Zbl 1497.81005
[2] ABRAMSKY, S. and BRANDENBURGER, A. (2011). The sheaf-theoretic structure of non-locality and contextuality. New J. Phys. 13 113036. · Zbl 1448.81028
[3] AHUJA, R. K., MAGNANTI, T. L. and ORLIN, J. B. (1989). Network flows. In Optimization. Handbooks Oper. Res. Management Sci. 1 211-369. North-Holland, Amsterdam. Digital Object Identifier: 10.1016/S0927-0507(89)01005-4 Google Scholar: Lookup Link MathSciNet: MR1105103 · doi:10.1016/S0927-0507(89)01005-4
[4] ALEXANDROFF, P. (1924). Über die Metrisation der im Kleinen kompakten topologischen Räume. Math. Ann. 92 294-301. Digital Object Identifier: 10.1007/BF01448011 Google Scholar: Lookup Link MathSciNet: MR1512216 · JFM 50.0128.04 · doi:10.1007/BF01448011
[5] BELL, J. S. (1966). On the problem of hidden variables in quantum mechanics. Rev. Modern Phys. 38 447-452. Digital Object Identifier: 10.1103/RevModPhys.38.447 Google Scholar: Lookup Link MathSciNet: MR0208927 · Zbl 0152.23605 · doi:10.1103/RevModPhys.38.447
[6] BERRETT, T. B. and SAMWORTH, R. J. (2022). MCARtest: Optimal nonparametric testing of Missing Completely At Random. R package version 1.1. Available at https://cran.r-project.org/web/packages/MCARtest/index.html.
[7] BERRETT, T. B. and SAMWORTH, R. J. (2023). Supplement to “Optimal nonparametric testing of Missing Completely At Random and its connections to compatibility.” https://doi.org/10.1214/23-AOS2326SUPP
[8] BLANCHARD, G., CARPENTIER, A. and GUTZEIT, M. (2018). Minimax Euclidean separation rates for testing convex hypotheses in \(\mathbb{R}^{\mathit{d}} \). Electron. J. Stat. 12 3713-3735. Digital Object Identifier: 10.1214/18-ejs1472 Google Scholar: Lookup Link MathSciNet: MR3873534 · Zbl 1402.60126 · doi:10.1214/18-ejs1472
[9] Cai, T. T. and Zhang, L. (2019). High dimensional linear discriminant analysis: Optimality, adaptive algorithm and missing data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 81 675-705. MathSciNet: MR3997097 · Zbl 1428.62267
[10] CHEN, H. Y. and LITTLE, R. (1999). A test of Missing Completely At Random for generalised estimating equations with missing data. Biometrika 86 1-13. Digital Object Identifier: 10.1093/biomet/86.1.1 Google Scholar: Lookup Link MathSciNet: MR1688067 · Zbl 1101.62314 · doi:10.1093/biomet/86.1.1
[11] CLAUSER, J. F. and SHIMONY, A. (1978). Bell’s theorem. Experimental tests and implications. Rep. Progr. Phys. 41 1881.
[12] COONS, J. I., CUMMINGS, J., HOLLERING, B. and MARAJ, A. (2020). Generalized cut polytopes for binary hierarchical models. Algeb. Stat. To appear.
[13] DALL’AGLIO, G., KOTZ, S. and SALINETTI, G. (2012). Advances in Probability Distributions with Given Marginals: Beyond the Copulas. Springer, Berlin. · Zbl 0722.00031
[14] DAVISON, A. C. (2003). Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics 11. Cambridge Univ. Press, Cambridge. Digital Object Identifier: 10.1017/CBO9780511815850 Google Scholar: Lookup Link MathSciNet: MR1998913 · Zbl 1145.62001 · doi:10.1017/CBO9780511815850
[15] DE LOERA, J. A. and KIM, E. D. (2014). Combinatorics and geometry of transportation polytopes: An update. In Discrete Geometry and Algebraic Combinatorics. Contemp. Math. 625 37-76. Amer. Math. Soc., Providence, RI. Digital Object Identifier: 10.1090/conm/625/12491 Google Scholar: Lookup Link MathSciNet: MR3289405 · Zbl 1360.90169 · doi:10.1090/conm/625/12491
[16] DEZA, M. M. and LAURENT, M. (2010). Geometry of Cuts and Metrics. Algorithms and Combinatorics 15. Springer, Heidelberg. Digital Object Identifier: 10.1007/978-3-642-04295-9 Google Scholar: Lookup Link MathSciNet: MR2841334 · Zbl 1210.52001 · doi:10.1007/978-3-642-04295-9
[17] Dudley, R. M. (2002). Real Analysis and Probability. Cambridge Studies in Advanced Mathematics 74. Cambridge Univ. Press, Cambridge. Digital Object Identifier: 10.1017/CBO9780511755347 Google Scholar: Lookup Link MathSciNet: MR1932358 · Zbl 1023.60001 · doi:10.1017/CBO9780511755347
[18] ELSENER, A. and VAN DE GEER, S. (2019). Sparse spectral estimation with missing and corrupted measurements. Stat 8 e229. Digital Object Identifier: 10.1002/sta4.229 Google Scholar: Lookup Link MathSciNet: MR3978409 · Zbl 07851116 · doi:10.1002/sta4.229
[19] EMBRECHTS, P. and PUCCETTI, G. (2010). Bounds for the sum of dependent risks having overlapping marginals. J. Multivariate Anal. 101 177-190. Digital Object Identifier: 10.1016/j.jmva.2009.07.004 Google Scholar: Lookup Link MathSciNet: MR2557627 · Zbl 1177.60022 · doi:10.1016/j.jmva.2009.07.004
[20] ERIKSSON, N., FIENBERG, S. E., RINALDO, A. and SULLIVANT, S. (2006). Polyhedral conditions for the nonexistence of the MLE for hierarchical log-linear models. J. Symbolic Comput. 41 222-233. Digital Object Identifier: 10.1016/j.jsc.2005.04.003 Google Scholar: Lookup Link MathSciNet: MR2197157 · Zbl 1120.62015 · doi:10.1016/j.jsc.2005.04.003
[21] FARKAS, J. (1902). Theorie der einfachen Ungleichungen. J. Reine Angew. Math. 124 1-27. Digital Object Identifier: 10.1515/crll.1902.124.1 Google Scholar: Lookup Link MathSciNet: MR1580578 · JFM 32.0169.02 · doi:10.1515/crll.1902.124.1
[22] FIENBERG, S. E. (1968). The geometry of an \(\mathit{r} \times \mathit{c}\) contingency table. Ann. Math. Stat. 39 1186-1190. Digital Object Identifier: 10.1214/aoms/1177698242 Google Scholar: Lookup Link MathSciNet: MR0232525 · Zbl 0162.22104 · doi:10.1214/aoms/1177698242
[23] FOLLAIN, B., WANG, T. and SAMWORTH, R. J. (2022). High-dimensional changepoint estimation with heterogeneous missingness. J. R. Stat. Soc. Ser. B. Stat. Methodol. 84 1023-1055. MathSciNet: MR4460584 · Zbl 07909604
[24] FUCHS, C. (1982). Maximum likelihood estimation and model selection in contingency tables with missing data. J. Amer. Statist. Assoc. 77 270-278.
[25] GALE, D. (1957). A theorem on flows in networks. Pacific J. Math. 7 1073-1082. MathSciNet: MR0091855 · Zbl 0087.16303
[26] GEYER, C. J. and MEEDEN, G. D. (2021). rcdd: Computational Geometryats. R package version 1.5. Available at https://cran.r-project.org/web/packages/rcdd/index.html.
[27] GUROBI OPTIMIZATION, LLC (2021). Gurobi Optimizer Reference Manual.
[28] HOŞTEN, S. and SULLIVANT, S. (2002). Gröbner bases and polyhedral geometry of reducible and cyclic models. J. Combin. Theory Ser. A 100 277-301. Digital Object Identifier: 10.1006/jcta.2002.3301 Google Scholar: Lookup Link MathSciNet: MR1940337 · Zbl 1044.62065 · doi:10.1006/jcta.2002.3301
[29] ISII, K. (1964). Inequalities of the types of Chebyshev and Cramér-Rao and mathematical programming. Ann. Inst. Statist. Math. 16 277-293. Digital Object Identifier: 10.1007/BF02868576 Google Scholar: Lookup Link MathSciNet: MR0176836 · Zbl 0136.13905 · doi:10.1007/BF02868576
[30] JAMSHIDIAN, M. and JALAL, S. (2010). Tests of homoscedasticity, normality, and Missing Completely At Random for incomplete multivariate data. Psychometrika 75 649-674. Digital Object Identifier: 10.1007/s11336-010-9175-3 Google Scholar: Lookup Link MathSciNet: MR2741492 · Zbl 1208.62095 · doi:10.1007/s11336-010-9175-3
[31] JIAO, J., HAN, Y. and WEISSMAN, T. (2018). Minimax estimation of the \(\mathit{L}_1\) distance. IEEE Trans. Inf. Theory 64 6672-6706. Digital Object Identifier: 10.1109/TIT.2018.2846245 Google Scholar: Lookup Link MathSciNet: MR3860754 · Zbl 1401.94040 · doi:10.1109/TIT.2018.2846245
[32] JOE, H. (1997). Multivariate Models and Dependence Concepts. Monographs on Statistics and Applied Probability 73. CRC Press, London. Digital Object Identifier: 10.1201/b13150 Google Scholar: Lookup Link MathSciNet: MR1462613 · doi:10.1201/b13150
[33] KANTOROVICH, L. V. (2004). On mass transportation. Zap. Nauchn. Sem. S.-Peterburg. Otdel. Mat. Inst. Steklov. (POMI) 312 11-14. Digital Object Identifier: 10.1007/s10958-006-0049-2 Google Scholar: Lookup Link MathSciNet: MR2117876 · Zbl 1080.49507 · doi:10.1007/s10958-006-0049-2
[34] Kantorovitch, L. (1942). On the translocation of masses. C. R. (Dokl.) Acad. Sci. URSS 37 199-201. MathSciNet: MR0009619 · Zbl 0061.09705
[35] KELLERER, H. G. (1984). Duality theorems for marginal problems. Z. Wahrsch. Verw. Gebiete 67 399-432. Digital Object Identifier: 10.1007/BF00532047 Google Scholar: Lookup Link MathSciNet: MR0761565 · Zbl 0535.60002 · doi:10.1007/BF00532047
[36] KIM, K. H. and BENTLER, P. M. (2002). Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika 67 609-623. Digital Object Identifier: 10.1007/BF02295134 Google Scholar: Lookup Link MathSciNet: MR2227891 · Zbl 1297.62233 · doi:10.1007/BF02295134
[37] LAURITZEN, S. L., SPEED, T. P. and VIJAYAN, K. (1984). Decomposable graphs and hypergraphs. J. Aust. Math. Soc. A 36 12-29. MathSciNet: MR0719998 · Zbl 0533.05046
[38] LAURITZEN, S. L. and SPIEGELHALTER, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Roy. Statist. Soc. Ser. B 50 157-224. MathSciNet: MR0964177 · Zbl 0684.68106
[39] LEIGHTON, T. and RAO, S. (1999). Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. J. ACM 46 787-832. Digital Object Identifier: 10.1145/331524.331526 Google Scholar: Lookup Link MathSciNet: MR1753034 · Zbl 1065.68666 · doi:10.1145/331524.331526
[40] LI, J. and YU, Y. (2015). A nonparametric test of missing completely at random for incomplete multivariate data. Psychometrika 80 707-726. Digital Object Identifier: 10.1007/s11336-014-9410-4 Google Scholar: Lookup Link MathSciNet: MR3392026 · Zbl 1323.62116 · doi:10.1007/s11336-014-9410-4
[41] LITTLE, R. J. A. (1988). A test of Missing Completely At Random for multivariate data with missing values. J. Amer. Statist. Assoc. 83 1198-1202. MathSciNet: MR0997603
[42] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ. Digital Object Identifier: 10.1002/9781119013563 Google Scholar: Lookup Link MathSciNet: MR1925014 · Zbl 1011.62004 · doi:10.1002/9781119013563
[43] LOH, P.-L. and TAN, X. L. (2018). High-dimensional robust precision matrix estimation: Cellwise corruption under \(ϵ\)-contamination. Electron. J. Stat. 12 1429-1467. Digital Object Identifier: 10.1214/18-EJS1427 Google Scholar: Lookup Link MathSciNet: MR3804842 · Zbl 1412.62057 · doi:10.1214/18-EJS1427
[44] Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 1637-1664. Digital Object Identifier: 10.1214/12-AOS1018 Google Scholar: Lookup Link MathSciNet: MR3015038 · Zbl 1257.62063 · doi:10.1214/12-AOS1018
[45] MAIER, D. (1983). The Theory of Relational Databases. Computer Software Engineering Series. Computer Science Press, Rockville, MD. MathSciNet: MR0691493 · Zbl 0519.68082
[46] McMullen, P. (1970). The maximum numbers of faces of a convex polytope. Mathematika 17 179-184. Digital Object Identifier: 10.1112/S0025579300002850 Google Scholar: Lookup Link MathSciNet: MR0283691 · Zbl 0217.46703 · doi:10.1112/S0025579300002850
[47] Nelsen, R. B. (2006). An Introduction to Copulas, 2nd ed. Springer Series in Statistics. Springer, New York. Digital Object Identifier: 10.1007/s11229-005-3715-x Google Scholar: Lookup Link MathSciNet: MR2197664 · Zbl 1152.62030 · doi:10.1007/s11229-005-3715-x
[48] QU, A. and SONG, P. X.-K. (2002). Testing ignorable missingness in estimating equation approaches for longitudinal data. Biometrika 89 841-850. Digital Object Identifier: 10.1093/biomet/89.4.841 Google Scholar: Lookup Link MathSciNet: MR1946514 · Zbl 1107.62327 · doi:10.1093/biomet/89.4.841
[49] REEVE, H. W., CANNINGS, T. I. and SAMWORTH, R. J. (2021). Optimal subgroup selection. Ann. Statist. To appear.
[50] ROCKAFELLAR, R. T. (1997). Convex Analysis. Princeton Landmarks in Mathematics. Princeton Univ. Press, Princeton, NJ. MathSciNet: MR1451876 · Zbl 0932.90001
[51] RÜSCHENDORF, L. (2013). Mathematical Risk Analysis: Dependence, risk bounds, optimal allocations and portfolios. Springer Series in Operations Research and Financial Engineering. Springer, Heidelberg. Digital Object Identifier: 10.1007/978-3-642-33590-7 Google Scholar: Lookup Link MathSciNet: MR3051756 · Zbl 1266.91001 · doi:10.1007/978-3-642-33590-7
[52] SPOHN, M.-L., NÄF, J., MICHEL, L. and MEINSHAUSEN, N. (2021). PKLM: A flexible MCAR test using classification. ArXiv preprint. Available at arXiv:2109.10150.
[53] VLACH, M. (1986). Conditions for the existence of solutions of the three-dimensional planar transportation problem. Discrete Appl. Math. 13 61-78. Digital Object Identifier: 10.1016/0166-218X(86)90069-7 Google Scholar: Lookup Link MathSciNet: MR0829339 · Zbl 0601.90105 · doi:10.1016/0166-218X(86)90069-7
[54] VOROBEV, N. N. (1962). Consistent families of measures and their extensions. Theory Probab. Appl. 7 147-163.
[55] WAINWRIGHT, M. J. and JORDAN, M. I. (2003). Variational inference in graphical models: The view from the marginal polytope. In Proceedings of the Annual Allerton Conference on Communication Control and Computing 41 961-971.
[56] WAINWRIGHT, M. J. and JORDAN, M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover, MA. · Zbl 1193.62107
[57] WEI, Y., WAINWRIGHT, M. J. and GUNTUBOYINA, A. (2019). The geometry of hypothesis testing over convex cones: Generalized likelihood ratio tests and minimax radii. Ann. Statist. 47 994-1024. Digital Object Identifier: 10.1214/18-AOS1701 Google Scholar: Lookup Link MathSciNet: MR3909958 · Zbl 1415.62006 · doi:10.1214/18-AOS1701
[58] Wu, Y. and Yang, P. (2016). Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory 62 3702-3720. Digital Object Identifier: 10.1109/TIT.2016.2548468 Google Scholar: Lookup Link MathSciNet: MR3506758 · Zbl 1359.94375 · doi:10.1109/TIT.2016.2548468
[59] ZHU, Z., WANG, T. and SAMWORTH, R. J. (2022). High-dimensional principal component analysis with heterogeneous missingness. J. R. Stat. Soc. Ser. B. Stat. Methodol. 84 2000-2031. MathSciNet: MR4515564 · Zbl 07686605
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.