×

Classification accuracy as a proxy for two-sample testing. (English) Zbl 1461.62100

Summary: When data analysts train a classifier and check if its accuracy is significantly different from chance, they are implicitly performing a two-sample test. We investigate the statistical properties of this flexible approach in the high-dimensional setting. We prove two results that hold for all classifiers in any dimensions: if its true error remains \(\epsilon \)-better than chance for some \(\epsilon > 0\) as \(d, n \to \infty \), then (a) the permutation-based test is consistent (has power approaching to one), (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. To get a finer understanding of the rates of consistency, we study a specialized setting of distinguishing Gaussians with mean-difference \(\delta\) and common (known or unknown) covariance \(\Sigma\), when \(d / n \to c \in (0, \infty)\). We study variants of Fisher’s linear discriminant analysis (LDA) such as “naive Bayes” in a nontrivial regime when \(\epsilon \to 0\) (the Bayes classifier has true accuracy approaching 1/2), and contrast their power with corresponding variants of Hotelling’s test. Surprisingly, the expressions for their power match exactly in terms of \(n, d, \delta, \Sigma\), and the LDA approach is only worse by a constant factor, achieving an asymptotic relative efficiency (ARE) of \(1 / \sqrt{\pi}\) for balanced samples. We also extend our results to high-dimensional elliptical distributions with finite kurtosis. Other results of independent interest include minimax lower bounds, and the optimality of Hotelling’s test when \(d = o(n)\). Simulation results validate our theory, and we present practical takeaway messages along with natural open problems.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H15 Hypothesis testing in multivariate analysis
62E20 Asymptotic distribution theory in statistics

Software:

hypoRF

References:

[1] Anderson, T. W. (1951). Classification by multivariate analysis. Psychometrika 16 31-50.
[2] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley Publications in Statistics. Wiley, New York. · Zbl 0083.14601
[3] Arias-Castro, E., Pelletier, B. and Saligrama, V. (2018). Remember the curse of dimensionality: The case of goodness-of-fit testing in arbitrary dimension. J. Nonparametr. Stat. 30 448-471. · Zbl 1402.62077 · doi:10.1080/10485252.2018.1435875
[4] Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311-329. · Zbl 0848.62030
[5] Ben-David, S., Blitzer, J., Crammer, K. and Pereira, F. (2007). Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 137-144.
[6] Bhattacharya, B. B. (2020). Asymptotic distribution and detection thresholds for two-sample tests based on geometric graphs. Ann. Statist. 40 2879-2903. · Zbl 1473.62142 · doi:10.1214/19-AOS1913
[7] Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10 989-1010. · Zbl 1064.62073 · doi:10.3150/bj/1106314847
[8] Blanchard, G., Lee, G. and Scott, C. (2010). Semi-supervised novelty detection. J. Mach. Learn. Res. 11 2973-3009. · Zbl 1242.68205
[9] Borji, A. (2019). Pros and cons of GAN evaluation measures. Comput. Vis. Image Underst. 179 41-65.
[10] Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808-835. · Zbl 1183.62095 · doi:10.1214/09-AOS716
[11] Chen, N. F., Shen, W., Campbell, J. and Schwartz, R. (2009). Large-scale analysis of formant frequency estimation variability in conversational telephone speech. In Tenth Annual Conference of the International Speech Communication Association.
[12] Etzel, J. A., Gazzola, V. and Keysers, C. (2009). An introduction to anatomical ROI-based fMRI classification analysis. Brain Res. 1282 114-125.
[13] Fang, K. T., Kotz, S. and Ng, K. W. (2018). Symmetric Multivariate and Related Distributions. Chapman and Hall/CRC. · Zbl 0699.62048
[14] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annu. Eugen. 7 179-188.
[15] Fisher, R. A. (1940). The precision of discriminant functions. Annu. Eugen. 10 422-429. · Zbl 0063.01384
[16] Frahm, G. (2004). Generalized elliptical distributions: Theory and applications. Ph.D. thesis, Universität zu Köln.
[17] Friedman, J. (2004). On multivariate goodness-of-fit and two-sample testing. Technical report, Stanford Linear Accelerator Center, Menlo Park, CA (US).
[18] Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Statist. 7 697-717. · Zbl 0423.62034
[19] Gagnon-Bartsch, J. and Shem-Tov, Y. (2019). The classification permutation test: A flexible approach to testing for covariate imbalance in observational studies. Ann. Appl. Stat. 13 1464-1483. · Zbl 1434.62061 · doi:10.1214/19-AOAS1241
[20] Giri, N. and Kiefer, J. (1964). Local and asymptotic minimax properties of multivariate tests. Ann. Math. Stat. 35 21-35. · Zbl 0133.41805 · doi:10.1214/aoms/1177703730
[21] Giri, N., Kiefer, J. and Stein, C. (1963). Minimax character of Hotelling’s \(T^2\) test in the simplest case. Ann. Math. Stat. 34 1524-1535. · Zbl 0202.49506 · doi:10.1214/aoms/1177703884
[22] Golland, P. and Fischl, B. (2003). Permutation tests for classification: Towards statistical significance in image-based studies. In Biennial International Conference on Information Processing in Medical Imaging 330-341. Springer, New York.
[23] Gómez, E., Gómez-Villegas, M. A. and Marín, J. M. (2003). A survey on continuous elliptical vector distributions. Rev. Mat. Complut. 16 345-361. · Zbl 1041.60016
[24] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B. and Smola, A. (2012). A kernel two-sample test. J. Mach. Learn. Res. 13 723-773. · Zbl 1283.62095
[25] Hediger, S., Michel, L. and Näf, J. (2019). On the use of random forest for two-sample testing. arXiv preprint, arXiv:1903.06287.
[26] Hemerik, J. and Goeman, J. J. (2018). False discovery proportion estimation by permutations: Confidence for significance analysis of microarrays. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 137-155. · Zbl 1380.62232 · doi:10.1111/rssb.12238
[27] Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Statist. 16 772-783. · Zbl 0645.62062 · doi:10.1214/aos/1176350835
[28] Hotelling, H. (1931). The generalization of student’s ratio. Ann. Math. Stat. 2 360-378. · Zbl 0004.26503 · doi:10.1214/aoms/1177732979
[29] Hu, J. and Bai, Z. (2016). A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci. China Math. 59 2281-2300. · Zbl 1360.62290 · doi:10.1007/s11425-016-0131-0
[30] Kariya, T. (1981). A robustness property of Hotelling’s \(T^2\)-test. Ann. Statist. 9 211-214. · Zbl 0453.62030 · doi:10.1214/aos/1176345350
[31] Kim, I., Ramdas, A., Singh, A. and Wasserman, L. (2021). Supplement to “Classification accuracy as a proxy for two-sample testing.” https://doi.org/10.1214/20-AOS1962SUPP
[32] Liu, Y., Li, C.-L. and Póczos, B. (2018). Classifier two-sample test for video anomaly detections. In British Machine Vision Conference 2018, BMVC 2018 71. Northumbria Univ., Newcastle, UK.
[33] Lopez-Paz, D. and Oquab, M. (2016). Revisiting classifier two-sample tests. arXiv preprint, arXiv:1610.06545.
[34] Luschgy, H. (1982). Minimax character of the two-sample \(\chi^2\)-test. Stat. Neerl. 36 129-134. · Zbl 0486.62051 · doi:10.1111/j.1467-9574.1982.tb00784.x
[35] Olivetti, E., Greiner, S. and Avesani, P. (2012). Induction in neuroscience with classification: Issues and solutions. In Machine Learning and Interpretation in Neuroimaging 42-50. Springer, New York.
[36] Pereira, F., Mitchell, T. and Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview. NeuroImage 45 S199-S209.
[37] Raudys, Š. and Young, D. M. (2004). Results in statistical discriminant analysis: A review of the former Soviet Union literature. J. Multivariate Anal. 89 1-35. · Zbl 1036.62053 · doi:10.1016/S0047-259X(02)00021-0
[38] Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 515-530. · Zbl 1095.62053
[39] Rosenblatt, J. D., Benjamini, Y., Gilron, R., Mukamel, R. and Goeman, J. J. (2019). Better-than-chance classification for signal detection. Biostatistics.
[40] Salaevskii, O. (1969). Minimax character of Hotelling’s \(T^2\) test. I. In Investigations in Classical Problems of Probability Theory and Mathematical Statistics 74-101. Springer, New York.
[41] Schilling, M. F. (1986). Multivariate two-sample tests based on nearest neighbors. J. Amer. Statist. Assoc. 81 799-806. · Zbl 0612.62081 · doi:10.1080/01621459.1986.10478337
[42] Scott, C. and Nowak, R. (2005). A Neyman-Pearson approach to statistical learning. IEEE Trans. Inf. Theory 51 3806-3819. · Zbl 1318.62054 · doi:10.1109/TIT.2005.856955
[43] Simaika, J. B. (1941). On an optimum property of two important statistical tests. Biometrika 32 70-80. · Zbl 0063.07034 · doi:10.1093/biomet/32.1.70
[44] Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R. and Schölkopf, B. (2009). Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems 1750-1758.
[45] Srivastava, M. S. and Du, M. (2008). A test for the mean vector with fewer observations than the dimension. J. Multivariate Anal. 99 386-402. · Zbl 1148.62042 · doi:10.1016/j.jmva.2006.11.002
[46] Srivastava, M. S., Katayama, S. and Kano, Y. (2013). A two sample test in high dimensional data. J. Multivariate Anal. 114 349-358. · Zbl 1255.62165 · doi:10.1016/j.jmva.2012.08.014
[47] Stelzer, J., Chen, Y. and Turner, R. (2013). Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control. NeuroImage 65 69-82.
[48] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge. · Zbl 0910.62001
[49] Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of two groups. Ann. Math. Stat. 15 145-162. · Zbl 0063.08121 · doi:10.1214/aoms/1177731280
[50] Xiao, J., Wang, R., Teng, G. and Hu, Y. (2014). A transfer learning based classifier ensemble model for customer credit scoring. In 2014 Seventh International Joint Conference on Computational Sciences and Optimization 64-68. IEEE.
[51] Xiao, J., Xiao, Y., Huang, A., Liu, D. and Wang, S. (2015). Feature-selection-based dynamic transfer ensemble model for customer churn prediction. Knowl. Inf. Syst. 43 29-51.
[52] Yu, K., Martin, R., Rothman, N., Zheng, T. and Lan, Q. (2007). Two-sample comparison based on prediction error, with applications to candidate gene association studies. Ann. Hum. Genet. 71 107-118.
[53] Zhu, C.-Z., Zang, Y.-F., Cao, Q.-J., Yan, C.-G., He, Y., Jiang, T.-Z., Sui, M.-Q. and Wang, Y.-F. (2008). Fisher discriminative analysis of resting-state brain function for attention-deficit/hyperactivity disorder. NeuroImage 40 110-120.
[54] Zografos, K. (2008). On Mardia’s and Song’s measures of kurtosis in elliptical distributions. J. Multivariate Anal. 99 858-879. · Zbl 1133.62329 · doi:10.1016/j.jmva.2007.05.001
[55] Zollanvari, A. · Zbl 1391.62127 · doi:10.1109/TSP.2011.2159210
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.