×

Minimax nonparametric multi-sample test under smoothing. (English) Zbl 07928652

Summary: We consider the problem of comparing probability densities among multiple groups. To this end, we develop a new probabilistic tensor product smoothing spline framework to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions, for which we propose a penalized likelihood ratio test. Here we show that the test statistic is asymptotically chi-squared distributed under the null hypothesis. Furthermore, we derive a sharp minimax testing rate based on the Bernstein width for nonparametric multi-sample tests, and show that our proposed test statistic is minimax optimal. In addition, we develop a data-adaptive tuning criterion for choosing the penalty parameter. The results of simulations and real applications demonstrate that the proposed test outperforms conventional approaches under various scenarios.

MSC:

62-XX Statistics

References:

[1] Abramowitz, M. and Stegun, I. A. (1948). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. US Government Printing Office.
[2] Anderson, N. H., Hall, P. and Titterington, D. M.(1994). Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis 50, 41-54. · Zbl 0798.62055
[3] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York. · Zbl 0083.14601
[4] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics 33, 1497-1537. · Zbl 1083.62034
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289-300. · Zbl 0809.62014
[6] Berlinet, A. and Thomas-Agnan, C. (2011). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media.
[7] Bilban, M., Heintel, D., Scharl, T., Woelfel, T., Auer, M. M., Porpaczy, E. et al. (2006). Deregulated expression of fat and muscle genes in B-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia 20, 1080-1088.
[8] Braun, M. L. (2006). Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research 7, 2303-2328. · Zbl 1222.62064
[9] Cao, R. and Van Keilegom, I. (2006). Empirical likelihood tests for two-sample problems via nonparametric density estimation. Canadian Journal of Statistics 34, 61-77. · Zbl 1096.62038
[10] Darling, D. A. (1957). The kolmogorov-smirnov, cramer-von mises tests. The Annals of Mathematical Statistics 28, 823-838. · Zbl 0082.13602
[11] de la Vega-Monroy, M.L. L., Larrieta, E., German, M., Baez-Saldana, A. and Fernandez-Mejia, C. (2013). Effects of biotin supplementation in the diet on insulin secretion, islet gene expression, glucose homeostasis and beta-cell proportion. The Journal of Nutritional Biochemistry 24, 169-177.
[12] Eric, M., Bach, F. R. and Harchaoui, Z. (2008). Testing for homogeneity with kernel fisher discriminant analysis. In Advances in Neural Information Processing Systems, 609-616.
[13] Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. The Annals of Statistics 29, 153-193. · Zbl 1029.62042
[14] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B. and Smola, A. J. (2007). A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, 513-520.
[15] Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B. and Smola, A. J. (2012). A kernel two-sample test. Journal of Machine Learning Research 13, 723-773. · Zbl 1283.62095
[16] Gu, C. (2013). Smoothing Spline ANOVA Models. Springer Science & Business Media. · Zbl 1269.62040
[17] Gu, C. and Qiu, C. (1993). Smoothing spline density estimation: Theory. The Annals of Statistics 21, 217-234. · Zbl 0770.62030
[18] Ingster, Y. I. (1989). Asymptotic minimax testing of independence hypothesis. Journal of Soviet Mathematics 44, 466-476. · Zbl 0682.62027
[19] Ingster, Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives. I, II, III. Mathematical Methods of Statistics 2, 85-114. · Zbl 0798.62057
[20] Jiang, B., Ye, C. and Liu, J. S. (2015). Nonparametric K-sample tests via dynamic slicing. Journal of the American Statistical Association 110, 642-653. · Zbl 1373.62195
[21] Kim, I. (2021). Comparing a large number of multivariate distributions. Bernoulli 27, 419-441. · Zbl 1467.62065
[22] Kim, Y.-J. and Gu, C. (2004). Smoothing spline Gaussian regression: More scalable computation via efficient approximation. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 66, 337-356. · Zbl 1062.62067
[23] Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y. and Póczos, B. (2017). MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2203-2213.
[24] Li, T. and Yuan, M. (2019). On the optimality of Gaussian kernel based nonparametric tests against smooth alternatives. arXiv:1909.03302.
[25] Lin, Y. (2000). Tensor product space ANOVA models. The Annals of Statistics 28, 734-755. · Zbl 1105.62329
[26] Liu, M., Shang, Z. and Cheng, G. (2020). Nonparametric distributed learning under general designs. Electronic Journal of Statistics 14, 3070-3102. · Zbl 1466.62364
[27] Liu, M., Shang, Z., Yang, Y. and Cheng, G. (2021). Nonparametric testing under randomized sketching. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4280-4290.
[28] Ma, P., Huang, J. Z. and Zhang, N. (2015). Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 102, 631-645. · Zbl 1452.62286
[29] Martínez-Camblor, P. and de Uña-Álvarez, J. (2009). Non-parametric k-sample tests: Density functions vs distribution functions. Computational Statistics & Data Analysis 53, 3344-3357. · Zbl 1453.62152
[30] Martínez-Camblor, P., de Uña Álvarez, J. and Corral, N. (2008). K-Sample test based on the common area of kernel density estimators. Journal of Statistical Planning and Inference 138, 4006-4020. · Zbl 1146.62026
[31] Mendelson, S. (2002). Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, 29-43. Springer. · Zbl 1050.68070
[32] Miller, R. and Siegmund, D. (1982). Maximally selected chi square statistics. Biometrics 38, 1011-1016. · Zbl 0502.62091
[33] Novak, E., Ullrich, M., Woźniakowski, H. and Zhang, S. (2018). Reproducing kernels of Sobolev spaces on R d and applications to embedding constants and tractability. Analysis and Applications 16, 693-715. · Zbl 1405.46022
[34] Pinkus, A. (2012). N-widths in Approximation Theory. Springer Science & Business Media.
[35] Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F. et al. (2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55-60.
[36] Scholz, F. W. and Stephens, M. A. (1987). K-sample Anderson-Darling tests. Journal of the American Statistical Association 82, 918-924.
[37] Shang, Z. and Cheng, G. (2013). Local and global asymptotic inference in smoothing spline models. The Annals of Statistics 41, 2608-2638. · Zbl 1293.62107
[38] Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika 52, 591-611. · Zbl 0134.36501
[39] Silverman, B. W. (1982). On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics 10, 795-810. · Zbl 0492.62034
[40] Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. CRC Press. · Zbl 0617.62042
[41] Tapia, R. and Thompson, J. (1978). Nonparametric Probability Density Estimation. Goucher College Series. Johns Hopkins University Press. · Zbl 0449.62029
[42] Tilg, H. and Moschen, A. R. (2014). Microbiota and diabetes: An evolving relationship. Gut 63, 1513-1521.
[43] Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E. et al. (2009). A core gut microbiome in obese and lean twins. Nature 457, 480-484.
[44] Wahba, G. (1990). Spline Models for Observational Data. Siam. · Zbl 0813.62001
[45] Wang, Y. (2011). Smoothing Splines: Methods and Applications. CRC Press. · Zbl 1223.65011
[46] Wei, Y. and Wainwright, M. J. (2018). The local geometry of testing in ellipses: Tight control via localized Kolmogorov widths. arXiv:1712.00711.
[47] Xing, X., Liu, J. S. and Zhong, W. (2017). MetaGen: Reference-free learning with multiple metagenomic samples. Genome Biology 18, 187.
[48] Xing, X., Liu, M., Ma, P. and Zhong, W. (2020). Minimax nonparametric parallelism test. Journal of Machine Learning Research 21, 1-47. · Zbl 1502.62079
[49] Zhan, D. and Hart, J. (2014). Testing equality of a large number of densities. Biometrika 101, 449-464. · Zbl 1452.62565
[50] E-mail: xinxing@vt.edu Zuofeng Shang Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ 07102, USA. E-mail: zshang@njit.edu Pang Du Department of Statistics, Virginia Tech, Blacksburg, VA 24061, USA.
[51] E-mail: pangdu@vt.edu
[52] E-mail: pingma@uga.edu Wenxuan Zhong Department of Statistics, University of Georgia, Athens, GA 30602, USA.
[53] E-mail: wenxuan@uga.edu Jun S. Liu Department of Statistics, Harvard University, Cambridge, MA 02138, USA.
[54] E-mail: jliu@stat.harvard.edu (.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.