×

A formal derivation of Heaps’ law. (English) Zbl 1070.60009

Summary: Word frequencies in text documents can be reasonably described by the Mandelbrot distribution, which has Zipf’s law as a special case. Furthermore, the growth of vocabulary size as a function of the text size (its number of words) has been described in Heaps’ law. It has been shown that these two experimental laws are related. In this paper we go a step further, and provide a (formal) derivation of Heaps’ law from the Mandelbrot distribution. We also provide a specification of the validity area for applying Heaps’ law.

MSC:

60C05 Combinatorial probability
Full Text: DOI

References:

[1] Anton, H., Calculus (1999), Wiley · Zbl 0997.26500
[2] R. Ayoub, An Introduction to the the Analytic Theory of Numbers, vol. 10. AMS Mathematical Surveys, 1963; R. Ayoub, An Introduction to the the Analytic Theory of Numbers, vol. 10. AMS Mathematical Surveys, 1963 · Zbl 0128.04303
[3] Baeza-Yates, R.; Navarro, G., Block-addressing indices for approximate text retrieval, Journal of the American Society for Information Science (JASIS), 51, 1, 69-82 (2000)
[4] Gabaix, X., Zipf’s law for cities: An explanation, The Quarterly Journal of Economics, 113, 739-767 (1999) · Zbl 0952.91059
[5] Good, I. J., The population frequencies of species and the estimation of population parameters, Biometrika, 40, 3/4, 237-264 (1953) · Zbl 0051.37103
[6] H.S. Heaps, Information retrieval: computational and theoretical aspects, 1978; H.S. Heaps, Information retrieval: computational and theoretical aspects, 1978 · Zbl 0471.68075
[7] Li, W., Random texts exhibit Zipf’s-law-like word frequency distribution, IEEE Transactions on Information Theory, 38, 6, 1842-1845 (1992)
[8] Mandelbrot, B., The Pareto-Levy law and the distribution of income, International Economic Review I, 8, 79-106 (1960) · Zbl 0201.51101
[9] M. Mitzenmacher, A brief history of generative models for power law and lognormal distributions, TR-08-01 Computer Science Group, Havard University, Cambridge, MA; M. Mitzenmacher, A brief history of generative models for power law and lognormal distributions, TR-08-01 Computer Science Group, Havard University, Cambridge, MA · Zbl 1063.68526
[10] C. Samuelsson, Relating Turing’s Formula and Zipf’s Law. In: Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, 1996; C. Samuelsson, Relating Turing’s Formula and Zipf’s Law. In: Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, 1996
[11] Whittaker, E. T.; Watson, G. N., A Course of Modern Analysis (1965), Cambridge University Press, (reprinted) · Zbl 0108.26903
[12] Zipf, G., Human behavior and the principle of least effort (1949), Addison-Wesley: Addison-Wesley Cambridge, MA
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.