×

A statistical test for correspondence of texts to the Zipf-Mandelbrot law. (English) Zbl 1459.62028

Summary: We analyse correspondence of texts to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary, and the probability distribution of words corresponds to the Zipf-Mandelbrot law. We count the numbers of different words in the text sequentially and get the process of the numbers of different words. Then we estimate the Zipf-Mandelbrot law’s parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. After that we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on \(C(0,1)\) to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for calculating the probability distribution of the integral of the square of this process. We present several examples of application of the algorithm for analysis of the homogeneity of texts in English, French, Russian, and Chinese.

MSC:

62F03 Parametric hypothesis testing
60G15 Gaussian processes

References:

[1] R.R. Bahadur, On the number of distinct values in a large sample from an innite discrete distribution, Proc. Natl Inst. Sci. India, 26A, Supp. II (1960), 6775. Zbl 0151.23803
[2] A.D. Barbour, Univariate approximations in the innite occupancy scheme, Alea, 6 (2009), 415433. MR2576025
[3] A.D. Barbour, A.V. Gnedin, 2009. Small counts in the innite occupancy scheme, Electron. J. Probab., 14 (2009), 365384. Zbl 1189.60048 · Zbl 1189.60048
[4] A. Ben-Hamou, S. Boucheron, M.I. Ohannessian, Concentration inequalities in the innite urn scheme for occupancy counts and the missing mass, with applications, Bernoulli, 23:1 (2017), 249287. Zbl 1366.60016
[5] M. Chebunin, A. Kovalevskii, 2016. Functional central limit theorems for certain statistics in an innite urn scheme, Stat. Probab. Lett., 119 (2016), 344348. Zbl 1398.60051
[6] M. Chebunin, A. Kovalevskii, A statistical test for the Zipf’s law by deviations from the Heaps’ law, Sib. Electron. Mat. Izv., 16 (2019), 18221832. Zbl 1433.62060 · Zbl 1433.62060
[7] M. Chebunin, A. Kovalevskii, Asymptotically normal estimators for Zipf’s law, Sankhya, Ser. A, 81:2 (2019), 482492. Zbl 1437.62097 · Zbl 1437.62097
[8] G. Decrouez, M. Grabchak, Q. Paris, Finite sample properties of the mean occupancy counts and probabilities, Bernoulli, 24:3 (2018), 19101941. Zbl 1429.60016 · Zbl 1429.60016
[9] O. Durieu, Y. Wang, From innite urn schemes to decompositions of self-similar Gaussian processes, Electron. J. Probab., 21 (2016), Paper No. 43. Zbl 1346.60039 · Zbl 1346.60039
[10] O. Durieu, G. Samorodnitsky, Y. Wang, From innite urn schemes to self-similar stable processes, Stochastic Processes Appl., 130:4 (2020), 24712487. Zbl 1434.60105
[11] M. Dutko, Central limit theorems for innite urn models, Ann. Probab., 17:3 (1989), 1255 1263. Zbl 0685.60023 · Zbl 0685.60023
[12] I. Eliazar, The Growth Statistics of Zipan Ensembles: Beyond Heaps’ Law, Physica (Amsterdam), 390 (2011), 3189.
[13] M. Gerlach, E.G. Altmann, Stochastic Model for the Vocabulary Growth in Natural Languages, Physical Review X 3 (2013) 021006.
[14] A. Gnedin, B. Hansen, J. Pitman, Notes on the occupancy problem with innitely many boxes: general asymptotics and power laws, Probab. Surv., 4 (2007), 146171. Zbl 1189.60050
[15] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects, Academic Press, New York etc., 1978. Zbl 0471.68075 · Zbl 0471.68075
[16] G. Herdan, Type-token mathematics. A textbook of mathematical linguistics, Mouton and Co., ’s-Gravenhage, 1960. Zbl 0163.40904 · Zbl 0163.40904
[17] H.-K. Hwang, S. Janson, Local limit theorems for nite and innite urn models, Ann. Probab., 36:3 (2008), 9921022. Zbl 1138.60027
[18] D.C. van Leijenhorst, Th.P. van der Weide, A Formal Derivation of Heaps’ Law, Inf. Sci., 170:2-4 (2005), 263272. Zbl 1070.60009
[19] B. Mandelbrot, Information Theory and Psycholinguistics, In: B.B. Wolman and E. Nagel, Scientic psychology, Basic Books. 1965
[20] P.T. Nicholls, Estimation of Zipf parameters, J. Am. Soc. Inf. Sci., 38:8 (1987), 443445.
[21] A.M. Petersen, J.N. Tenenbaum, S. Havlin, H.E. Stanley, M. Perc, Languages cool as they expand: Allometric scaling and the decreasing need for new words, Scientic Reports 2 (2012), Article No. 943. https://doi.org/10.1038/srep00943
[22] N.V. Smirnov, On theω2distribution, Mat. Sb. n. Ser., 2 (1937), 973993. Zbl 0018.41202
[23] N. Zakrevskaya, A. Kovalevskii, An omega-square statistics for analysis of correspondence of small texts to the ZipfMandelbrot law, In: Applied methods of statistical analysis. Statistical computation and simulation, Proceedings of the International Workshop, NSTU, Novosibirsk, 2019, 488494.
[24] G.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.