×

Chebyshev polynomials, moment matching, and optimal estimation of the unseen. (English) Zbl 1418.62127

Summary: We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least \(\frac{1}{k}\). Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of \(\epsilon k\) with probability at least 0.1 is within universal constant factors of \(\frac{k}{\log k}\log^{2}\frac{1}{\epsilon }\), which improves the state-of-the-art result of \(\frac{k}{\epsilon^{2}\log k}\) in [G. Valiant and P. Valiant, J. ACM 64, No. 6, Paper No. 37, 41 p. (2017; Zbl 1426.62107)]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in \(O(n+\log^{2}k)\) time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

MSC:

62G05 Nonparametric estimation
62C20 Minimax procedures in statistical decision theory
41A10 Approximation by polynomials

Citations:

Zbl 1426.62107

References:

[1] Bar-Yossef, Z., Jayram, T., Kumar, R., Sivakumar, D. and Trevisan, L. (2002). Counting distinct elements in a data stream. In Proceedings of the 6th Randomization and Approximation Techniques in Computer Science 1-10. Springer, Berlin. · Zbl 1028.68949
[2] Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: A review. J. Amer. Statist. Assoc.88 364-373.
[3] Burnham, K. P. and Overton, W. S. (1979). Robust estimation of population size when capture probabilities vary among animals. Ecology60 927-936.
[4] Cai, T. T. and Low, M. G. (2011). Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional. Ann. Statist.39 1012-1041. · Zbl 1277.62101 · doi:10.1214/10-AOS849
[5] Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 265-270.
[6] Chao, A. and Lee, S.-M. (1992). Estimating the number of classes via sample coverage. J. Amer. Statist. Assoc.87 210-217. · Zbl 0850.62145 · doi:10.1080/01621459.1992.10475194
[7] Charikar, M., Chaudhuri, S., Motwani, R. and Narasayya, V. (2000). Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 268-279. ACM, New York.
[8] Darroch, J. and Ratcliff, D. (1980). A note on capture-recapture estimation. Biometrics36 149-153. · Zbl 0437.62100 · doi:10.2307/2530505
[9] Dzyadyk, V. K. and Shevchuk, I. A. (2008). Theory of Uniform Approximation of Functions by Polynomials. de Gruyter, Berlin. · Zbl 1148.41003
[10] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika63 435-447. · Zbl 0344.62088
[11] Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol.12 42-58.
[12] Gandolfi, A. and Sastri, C. C. A. (2004). Nonparametric estimations about species not observed in a random sample. Milan J. Math.72 81-105. · Zbl 1225.62044 · doi:10.1007/s00032-004-0031-8
[13] Global Language Monitor. Number of words in the English language. https://www.languagemonitor.com/global-english/no-of-words/. Accessed: 2016-02-16. · Zbl 0567.58017
[14] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika40 237-264. · Zbl 0051.37103 · doi:10.1093/biomet/40.3-4.237
[15] Good, I. J. and Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika43 45-63. · Zbl 0070.14403 · doi:10.1093/biomet/43.1-2.45
[16] Gotelli, N. J. and Colwell, R. K. (2011). Estimating species richness. Biological Diversity: Frontiers in Measurement and Assessment12 39-54.
[17] Harris, B. (1968). Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. J. Amer. Statist. Assoc.63 837-847. · Zbl 0174.50401
[18] Huang, S.-P. and Weir, B. (2001). Estimating the total number of alleles using a sample coverage method. Genetics159 1365-1373.
[19] Ibragimov, I. A., Nemirovskii, A. S. and Khas’minskii, R. Z. (1987). Some problems on nonparametric estimation in Gaussian white noise. Theory Probab. Appl.31 391-406. · Zbl 0623.62028
[20] Jiao, J., Venkat, K., Han, Y. and Weissman, T. (2015). Minimax estimation of functionals of discrete distributions. IEEE Trans. Inform. Theory61 2835-2885. · Zbl 1359.62104 · doi:10.1109/TIT.2015.2412945
[21] Lepski, O., Nemirovski, A. and Spokoiny, V. (1999). On estimation of the \(L_r\) norm of a regression function. Probab. Theory Related Fields113 221-253. · Zbl 0921.62103 · doi:10.1007/s004409970006
[22] Lewontin, R. C. and Prout, T. (1956). Estimation of the number of different classes in a population. Biometrics12 211-223.
[23] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. · Zbl 0605.62002
[24] Mao, C. X. and Lindsay, B. G. (2007). Estimating the number of classes. Ann. Statist.35 917-930. · Zbl 1117.62045 · doi:10.1214/009053606000001280
[25] Marchand, J. and Schroeck Jr, F. (1982). On the estimation of the number of equally likely classes in a population. Comm. Statist. Theory Methods11 1139-1146. · Zbl 0507.62006 · doi:10.1080/03610928208828299
[26] McNeil, D. R. (1973). Estimating an author’s vocabulary. J. Amer. Statist. Assoc.68 92-96.
[27] Mitzenmacher, M. and Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge. · Zbl 1092.60001
[28] Orlitsky, A., Suresh, A. T. and Wu, Y. (2016). Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA113 13283-13288. · Zbl 1407.62409 · doi:10.1073/pnas.1607774113
[29] Oxford English Dictinary. http://public.oed.com/about/. Accessed: 2016-02-16.
[30] Raskhodnikova, S., Ron, D., Shpilka, A. and Smith, A. (2009). Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM J. Comput.39 813-842. · Zbl 1194.68124 · doi:10.1137/070701649
[31] Robbins, H. E. (1968). Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Stat.39 256-257. · Zbl 0155.25801 · doi:10.1214/aoms/1177698526
[32] Samuel, E. (1968). Sequential maximum likelihood estimation of the size of a population. Ann. Math. Stat.39 1057-1068. · Zbl 0193.16701 · doi:10.1214/aoms/1177698338
[33] Steele, J. M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist.14 753-758. · Zbl 0604.62017 · doi:10.1214/aos/1176349952
[34] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika74 445-455. · Zbl 0635.62115 · doi:10.1093/biomet/74.3.445
[35] Timan, A. F. (1963). Theory of Approximation of Functions of a Real Variable. Pergamon Press, Elmsford, NY. · Zbl 0117.29001
[36] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer, New York, NY.
[37] Valiant, G. and Valiant, P. (2010). A CLT and tight lower bounds for estimating entropy. In Electronic Colloquium on Computational Complexity (ECCC) 17 179. · Zbl 1426.62107 · doi:10.1145/3125643
[38] Valiant, G. and Valiant, P. (2011). Estimating the unseen: An \(n/\log(n)\)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing 685-694. · Zbl 1288.68186
[39] Valiant, G. and Valiant, P. (2011). The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on 403-412. IEEE, New York. · Zbl 1292.68159
[40] Valiant, P. and Valiant, G. (2013). Estimating the unseen: Improved estimators for entropy and other properties. In Advances in Neural Information Processing Systems 2157-2165. · Zbl 1426.62107 · doi:10.1145/3125643
[41] Wu, Y. and Yang, P. (2016). Sample complexity of the distinct elements problem. Available at arXiv:1612.03375. · Zbl 1416.62187
[42] Wu, Y. and Yang, P. (2016). Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inform. Theory62 3702-3720. · Zbl 1359.94375 · doi:10.1109/TIT.2016.2548468
[43] Wu, Y.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.