×

Multivariate goodness-of-fit tests based on Wasserstein distance. (English) Zbl 1471.62379

Summary: Goodness-of-fit tests based on the empirical Wasserstein distance are proposed for simple and composite null hypotheses involving general multivariate distributions. For group families, the procedure is to be implemented after preliminary reduction of the data via invariance. This property allows for calculation of exact critical values and \(p\)-values at finite sample sizes. Applications include testing for location-scale families and testing for families arising from affine transformations, such as elliptical distributions with given standard radial density and unspecified location vector and scatter matrix. A novel test for multivariate normality with unspecified mean vector and covariance matrix arises as a special case. For more general parametric families, we propose a parametric bootstrap procedure to calculate critical values. The lack of asymptotic distribution theory for the empirical Wasserstein distance means that the validity of the parametric bootstrap under the null hypothesis remains a conjecture. Nevertheless, we show that the test is consistent against fixed alternatives. To this end, we prove a uniform law of large numbers for the empirical distribution in Wasserstein distance, where the uniformity is over any class of underlying distributions satisfying a uniform integrability condition but no additional moment assumptions. The calculation of test statistics boils down to solving the well-studied semi-discrete optimal transport problem. Extensive numerical experiments demonstrate the practical feasibility and the excellent performance of the proposed tests for the Wasserstein distance of order \(p=1\) and \(p=2\) and for dimensions at least up to \(d=5\). The simulations also lend support to the conjecture of the asymptotic validity of the parametric bootstrap.

MSC:

62H15 Hypothesis testing in multivariate analysis
62H05 Characterization and structure theory for multivariate probability distributions; copulas

Software:

R; sn; copula; copula; transport; MVN

References:

[1] Ambrosio, L., Stra, F. and Trevisan, D. (2018). A PDE approach to a 2-dimensional matching problem. Probability Theory and Related Fields 1-45.
[2] Azzalini, A. (2014). The Skew-Normal and Related Families. Institute of Mathematical Statistics (IMS) Monographs 3. Cambridge University Press, Cambridge With the collaboration of Antonella Capitanio. · Zbl 1338.62007
[3] Azzalini, A. (2020). The R package : The Skew-Normal and Related Distributions such as the Skew-\(t\)., Università di Padova, Italia.
[4] Bakshaev, A. and Rudzkis, R. (2015). Multivariate goodness-of-fit tests based on kernel density estimators. Nonlinear Analysis. Modelling and Control 20 585-602. · Zbl 1420.62156
[5] Beirlant, J., Goegebeur, Y., Segers, J., Teugels, J. L., De Waal, D. and Ferro, C. (2004). Statistics of Extremes: Theory and Applications. Wiley. · Zbl 1070.62036
[6] Beran, R. (1997). Diagnosing bootstrap success. Annals of the Institute of Statistical Mathematics 49 1-24. · Zbl 0928.62035
[7] Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap. The Annals of Statistics 9 1196-1217. · Zbl 0449.62034 · doi:10.1214/aos/1176345637
[8] Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. The Annals of Statistics 1 1071-1095. · Zbl 0275.62033
[9] Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models. Johns Hopkins Series in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD. · Zbl 0786.62001
[10] Bobkov, S. and Ledoux, M. (2019). One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Mem. Amer. Math. Soc. 261 v+126. · Zbl 1454.60007
[11] Cambanis, S., Huang, S. and Simons, G. (1981). On the theory of elliptically contoured distributions. Journal of Multivariate Analysis 11 368-385. · Zbl 0469.60019
[12] Capanu, M. (2019). A unified approach to proving parametric bootstrap consistency for some goodness-of-fit tests. Statistics 53 58-80. · Zbl 1411.62046
[13] Carlier, G., Chernozhukov, V., Galichon, A. et al. (2016). Vector quantile regression: an optimal transport approach. The Annals of Statistics 44 1165-1192. · Zbl 1381.62239
[14] Chernozhukov, V., Galichon, A., Hallin, M., Henry, M. et al. (2017). Monge-Kantorovich depth, quantiles, ranks and signs. The Annals of Statistics 45 223-256. · Zbl 1426.62163
[15] Cramér, A. (1928). On the composition of elementary errors. Scandinavian Actuarial Journal 1 13-74. · JFM 54.0557.02
[16] del Barrio, E., Giné, E. and Utzet, F. (2005). Asymptotics for \[{L_2}\] functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances. Bernoulli 11 131-189. · Zbl 1063.62072
[17] del Barrio, E. and Loubes, J. M. (2019). Central limit theorems for empirical transportation cost in general dimension. The Annals of Probability 47 926-951. · Zbl 1466.60042
[18] del Barrio, E., Cuesta-Albertos, J. A., Matrán, C. and Rodríguez-Rodríguez, J. M. (1999). Tests of goodness of fit based on the \[{L_2}\]-Wasserstein distance. The Annals of Statistics 27 1230-1239. · Zbl 0961.62037
[19] del Barrio, E., Cuesta-Albertos, J. A., Matrán, C., Csörgö, S., Cuadras, C. M., de Wet, T., Giné, E., Lockhart, R., Munk, A. and Stute, W. (2000). Contributions of empirical and quantile processes to the asymptotic theory of goodness-of-fit tests. Test 9 1-96. · Zbl 0997.62034
[20] Ebner, B., Henze, N. and Yukich, J. E. (2018). Multivariate goodness-of-fit on flat and curved spaces via nearest neighbor distances. Journal of Multivariate Analysis 165 231-242. · Zbl 1397.62201
[21] Fan, Y. (1997). Goodness-of-fit tests for a multivariate distribution by the empirical characteristic function. Journal of Multivariate Analysis 62 36-63. · Zbl 0949.62044
[22] Fang, K.-T., Kotz, S. and Ng, K.-W. (1990). Symmetric multivariate and related distributions. Chapman & Hall, London. · Zbl 0699.62048
[23] Fournier, N. and Guillin, A. (2015). On the rate of convergence in Wasserstein distance of the empirical measure. Probability Theory and Related Fields 162 707-738. · Zbl 1325.60042
[24] Genest, C., Ghoudi, K. and Rivest, L. P. (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 82 543-552. · Zbl 0831.62030 · doi:10.1093/biomet/82.3.543
[25] Genevay, A., Cuturi, M., Peyré, G. and Bach, F. (2016). Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 3440-3448.
[26] Goldfeld, Z. and Kato, K. (2020). Limit Distribution Theory for Smooth Wasserstein Distance with Applications to Generative Modeling. arXiv preprint 2002.01012.
[27] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, third ed. The Johns Hopkins University Press, Baltimore and London. · Zbl 0865.65009
[28] Hallin, M., del Barrio, E., Cuesta Albertos, J. and Matrán, C. (2020). Distribution and quantile functions, ranks, and signs in dimension \(d\): a measure transportation approach. The Annals of Statistics (to appear). · Zbl 1468.62282
[29] Hartmann, V. and Schuhmacher, D. (2020). Semi-discrete optimal transport: a solution procedure for the unsquared Euclidean distance case. Mathematical Methods of Operations Research 92 133-163. · Zbl 1457.65014
[30] Henze, N. and Zirkler, B. (1990). A class of invariant consistent tests for multivariate normality. Communications in Statistics. Theory and Methods 19 3595-3617. · Zbl 0738.62068
[31] Hofert, M., Kojadinovic, I., Maechler, M. and Yan, J. (2018). copula: Multivariate Dependence with Copulas R package version 0.999-19.1.
[32] Horowitz, J. and Karandikar, R. L. (1994). Mean rates of convergence of empirical measures in the Wasserstein distance. Journal of Computational and Applied Mathematics 55 261-273. · Zbl 0819.60031
[33] Joe, H. (2005). Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal of Multivariate Analysis 94 401-419. · Zbl 1066.62061
[34] Khmaladze, E. V. (2016). Unitary transformations, empirical processes and distribution free testing. Bernoulli 22 563-588. · Zbl 1345.60094
[35] Kitagawa, J., Mérigot, Q. and Thibert, B. (2017). Convergence of a Newton algorithm for semi-discrete optimal transport. arXiv preprint 1603.05579v2. · Zbl 1439.49053
[36] Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari 4 83-91. · JFM 59.1166.03
[37] Korkmaz, S., Goksuluk, D. and Zararsiz, G. (2014). MVN: An R Package for Assessing Multivariate Normality. The R Journal 6 151-162.
[38] Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer Science+Business Media, New York. · Zbl 0916.62017
[39] Lévy, B. (2015). A numerical algorithm for L2 semi-discrete optimal transport in 3D. ESAIM: Mathematical Modelling and Numerical Analysis 49 1693-1715. · Zbl 1331.49037
[40] Massart, P. (1990). The tight constant in the Dvoretsky-Kiefer-Wolfowitz inequality. The Annals of Probability 18 1269-1283. · Zbl 0713.62021
[41] McAssey, M. P. (2013). An empirical goodness-of-fit test for multivariate distributions. Journal of Applied Statistics 40 1120-1131. · Zbl 1514.62752
[42] Mena, G. and Niles-Weed, J. (2019). Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In Advances in Neural Information Processing Systems 4541-4551.
[43] Mérigot, Q. (2011). A multiscale approach to optimal transport. In Computer Graphics Forum 30 1583-1592. Wiley Online Library.
[44] Munk, A. and Czado, C. (1998). Nonparametric validation of similar distributions and assessment of goodness of fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 223-241. · Zbl 0909.62047
[45] Panaretos, V. M. and Zemel, Y. (2019). Statistical aspects of Wasserstein distances. Annual Review of Statistics and its Applications 6 405-431.
[46] Peyré, G. and Cuturi, M. (2019). Computational Optimal Transport. Foundations and Trends in Machine Learning 11 355-607.
[47] Ramdas, A., García Trillos, N. and Cuturi, M. (2017). On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 19 Paper No. 47, 15.
[48] Rippl, T., Munk, A. and Sturm, A. (2016). Limit laws of the empirical Wasserstein distance: Gaussian distributions. Journal of Multivariate Analysis 151 90-109. · Zbl 1351.62064
[49] Rizzo, M. L. and Székely, G. J. (2016). Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics 8 27-38. · Zbl 07912789
[50] Royston, J. P. (1983). Some techniques for assessing multivariate normality based on the Shapiro-Wilk W. Journal of the Royal Statistical Society: Series C (Applied Statistics) 32 121-133. · Zbl 0536.62043
[51] Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Progress in Nonlinear Differential Equations and their Applications 87. Birkhäuser/Springer, Cham. · Zbl 1401.49002
[52] Schmidt, M., Le Roux, N. and Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162. · Zbl 1358.90073
[53] Schuhmacher, D., Bähre, B., Gottschlich, C., Hartmann, V., Heinemann, F. and Schmitzer, B. (2019). transport: Computation of Optimal Transport Plans and Wasserstein Distances R package version 0.12-1.
[54] Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscow 2 3-14. · JFM 65.1356.04
[55] Smith, S. P. (1995). Differentiation of the Cholesky Algorithm. Journal of Computational and Graphical Statistics 4 134-147.
[56] Sommerfeld, M. and Munk, A. (2018). Inference for empirical Wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 219-238. · Zbl 1380.62121
[57] Tameling, C., Sommerfeld, M. and Munk, A. (2019). Empirical optimal transport on countable metric spaces: Distributional limits and statistical applications. The Annals of Applied Probability 29 2744-2781. · Zbl 1439.60028
[58] R Core Team (2018). R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria.
[59] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge. · Zbl 0910.62001
[60] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics. Springer-Verlag, New York. · Zbl 0862.60002
[61] Villani, C. (2009). Optimal Transport: Old and New. Springer-Verlag, Berlin. · Zbl 1156.53003
[62] von Mises, R. E. (1928). Wahrscheinlichkeit, Statistik und Wahrheit. Julius Springer, Berlin. · JFM 54.0540.12
[63] Weed, J. and Bach, F. (2019). Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 25 2620-2648. · Zbl 1428.62099
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.