×

On a minimum distance procedure for threshold selection in tail analysis. (English) Zbl 1484.62057

Summary: Power-law distributions have been widely observed in different areas of scientific research. Practical estimation issues include selecting a threshold above which observations follow a power-law distribution and then estimating the power-law tail index. A minimum distance selection procedure (MDSP) proposed by A. Clauset et al. [SIAM Rev. 51, No. 4, 661–703 (2009; Zbl 1176.62001)] has been widely adopted in practice for the analyses of social networks. However, theoretical justifications for this selection procedure remain scant. In this paper, we study the asymptotic behavior of the selected threshold and the corresponding power-law index given by the MDSP. For independent and identically distributed (iid) observations with Pareto-like tails, we derive the limiting distribution of the chosen threshold and the power-law index estimator, where the latter estimator is not asymptotically normal. We deduce that in this iid setting MDSP tends to choose too high a threshold level and show with asymptotic analysis and simulations how the variance increases compared to Hill estimators based on a nonrandom threshold. We also provide simulation results for dependent preferential attachment network data and find that the performance of the MDSP procedure is highly dependent on the chosen model parameters.

MSC:

62G32 Statistics of extreme values; tail inference
60G70 Extreme value theory; extremal stochastic processes
62E20 Asymptotic distribution theory in statistics
60G15 Gaussian processes
62G30 Order statistics; empirical distribution functions
05C80 Random graphs (graph-theoretic aspects)

Citations:

Zbl 1176.62001

References:

[1] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, Analysis of topological characteristics of huge online social networking services, in Proceedings of the 16th International Conference on World Wide Web, ACM, 2007, pp. 835-844.
[2] A. L. Barabási and R. Albert, Emergence of scaling in random networks, Science, 286 (1999), pp. 509-512. · Zbl 1226.05223
[3] J. Beirlant, Y. Goegebeur, J. Segers, J. Teugels, D. de Waal, and C. Ferro, Statistics of Extremes, Wiley, New York, 2004. · Zbl 1070.62036
[4] S. Bhamidi, Universal Techniques to Analyze Preferential Attachment Trees: Global and Local Analysis, preprint, http://www.unc.edu/ bhamidi/preferent.pdf, 2007.
[5] B. Bollobás, C. Borgs, J. Chayes, and O. Riordan, Directed scale-free graphs, in Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, ACM, 2003, pp. 132-139. · Zbl 1094.68605
[6] B. A. Carreras, V. E. Lynch, I. Dobson, and D. E. Newman, Critical points and transitions in an electric power transmission model for cascading failure blackouts, Chaos, 12 (2002), pp. 985-994. · Zbl 1080.82579
[7] E. Cho, S. A. Myers, and J. Leskovec, Friendship and mobility: User movement in location-based social networks, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, pp. 1082-1090.
[8] A. Clauset, C. R. Shalizi, and M. E. J. Newman, Power-law distributions in empirical data, SIAM Rev., 51 (2009), pp. 661-703. · Zbl 1176.62001
[9] S. Coles, An Introduction to Statistical Modeling of Extreme Values, Springer Ser. Statist., Springer, New York, 2001. · Zbl 0980.62043
[10] J. Danielsson, L. de Haan, L. Peng, and C. de Vries, Using a bootstrap method to choose the sample fraction in tail index estimation, J. Multivariate Anal., 76 (2001), pp. 226-248. · Zbl 0976.62044
[11] L. de Haan and A. Ferreira, Extreme Value Theory: An Introduction, Springer, New York, 2006. · Zbl 1101.62002
[12] H. Drees, Weighted approximations of tail processes for \(\beta \)-mixing random variables, Ann. Appl. Prob., 10 (2000), pp. 1274-1301. · Zbl 1073.60520
[13] H. Drees and E. Kaufmann, Selecting the optimal sample fraction in univariate extreme value statistics, Stochastic Process. Appl., 75 (1998), pp. 149-172. · Zbl 0926.62013
[14] R. T. Durrett, Random Graph Dynamics, Camb. Ser. Stat. Probab. Math., Cambridge University Press, Cambridge, 2010. · Zbl 1223.05002
[15] D. Ferger, A continuous mapping theorem for the argmax-functional in the non-unique case, Stat. Neerl., 58 (2004), pp. 83-96. · Zbl 1090.60032
[16] C. S. Gillespie, Fitting heavy tailed distributions: The poweRlaw package, J. Stat. Softw., 64 (2015), pp. 1-16, http://www.jstatsoft.org/v64/i02/.
[17] M. I. Gomes and O. Oliveira, The bootstrap methodology in statistics of extremes-Choice of the optimal sample fraction, Extremes, 4 (2001), pp. 331-358. · Zbl 1023.62048
[18] P. Hall, On some simple estimates of an exponent of regular variation, J. Roy. Statist. Soc. B, 44 (1982), pp. 37-42. · Zbl 0521.62024
[19] B. M. Hill, A simple general approach to inference about the tail of a distribution, Ann. Statist., 3 (1975), pp. 1163-1174. · Zbl 0323.62033
[20] T. Hsing, On tail index estimation using dependent data, Ann. Statist., 19 (1991), pp. 1547-1569. · Zbl 0738.62026
[21] N. E. Humphries, N. Queiroz, J. R. Dyer, N. G. Pade, M. K. Musyl, K. M. Schaefer, D. W. Fuller, J. M. Brunnschweiler, T. K. Doyle, J. D. Houghton, G. C. Hays, C. S. Jones, L. R. Noble, V. J. Wearmouth, E. J. Southall, and D. W. Sims, Environmental context explains Lévy and Brownian movement patterns of marine predators, Nature, 465 (2010), pp. 1066-1069.
[22] A. Java, X. Song, T. Finin, and B. Tseng, Why we Twitter: Understanding microblogging usage and communities, in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, ACM, 2007, pp. 56-65.
[23] M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, and M. A. Porter, Multilayer networks, J. Complex Networks, 2 (2014), pp. 203-271.
[24] J. Komlós, P. Major, and G. Tusnády, An approximation of partial sums of independent rv’s and the sample df i, Probab. Theory Related Fields, 33 (1975), pp. 111-131. · Zbl 0308.60029
[25] J. Komlós, P. Major, and G. Tusnády, An approximation of partial sums of independent rv’s and the sample df ii, Probab. Theory Related Fields, 34 (1976), pp. 33-58. · Zbl 0307.60045
[26] A. Koning and L. Peng, Goodness-of-fit tests for a heavy tailed distribution, J. Statist. Plann. Inference, 58 (2004), pp. 3960-3981. · Zbl 1146.62033
[27] P. Krapivsky, G. Rodgers, and S. Redner, Degree distributions of growing networks, Phys. Rev. Lett, 86 (2001), https://doi.org/10.1103/PhysRevLett.86.5401.
[28] P. L. Krapivsky and S. Redner, Organization of growing random networks, Phys. Rev. E, 63 (2001), pp. 1-14.
[29] J. Kunegis, KONECT-the Koblenz network collection, in Proceedings of the 22nd International Conference on World Wide Web, ACM, 2013, pp. 1343-1350.
[30] J. Kunegis, Handbook of Network Analysis; The Konect Project, University of Namur Center for Complex Systems, 2018, https://github.com/kunegis/konect-handbook/raw/master/konect-handbook.pdf.
[31] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, Internet Math., 6 (2009), pp. 29-123. · Zbl 1205.91144
[32] B. Mandelbrot, The Pareto-Lévy law and the distribution of income, Internat. Econ. Rev., 1 (1960), pp. 79-106. · Zbl 0201.51101
[33] D. Mason, Laws of large numbers for sums of extreme values, Ann. Probab., 10 (1982), pp. 754-764. · Zbl 0493.60039
[34] M. Mitzenmacher, A brief history of generative models for power law and lognormal distributions, Internet Math., 1 (2004), pp. 226-251. · Zbl 1063.68526
[35] B. Oancea, T. Andrei, and D. Pirjol, Income inequality in Romania: The exponential-Pareto distribution, Phys. A, 469 (2017), pp. 486-498, https://doi.org/10.1016/j.physa.2016.11.094.
[36] R. D. Reiss, Approximate Distributions of Order Statistics, Springer, New York, 1989. · Zbl 0682.62009
[37] S. Resnick, Heavy tail phenomena: Probabilistic and statistical modeling, Springer Ser. Oper. Res. Financ. Eng., Springer, New York, 2007. · Zbl 1152.62029
[38] S. I. Resnick and G. Samorodnitsky, Tauberian theory for multivariate regularly varying distributions with application to preferential attachment networks, Extremes, 18 (2015), pp. 349-367, https://doi.org/10.1007/s10687-015-0216-2. · Zbl 1345.60118
[39] S. I. Resnick and G. Samorodnitsky, Asymptotic normality of degree counts in a preferential attachment model, Adv. Appl. Prob., 48 (2016), pp. 283-299, https://doi.org/10.1017/apr.2016.56. · Zbl 1426.05152
[40] M. A. M. Safari, N. Masseran, and K. Ibrahim, Optimal threshold for Pareto tail modelling in the presence of outliers, Phys. A, 509 (2018), pp. 169-180, https://doi.org/10.1016/j.physa.2018.06.007.
[41] G. Samorodnitsky, S. Resnick, D. Towsley, R. Davis, A. Willis, and P. Wan, Nonstandard regular variation of in-degree and out-degree in the preferential attachment model, J. Appl. Probab., 53 (2016), pp. 146-161, https://doi.org/10.1017/jpr.2015.15. · Zbl 1343.60138
[42] P. Soriano-Hernández, M. del Castillo-Mussot, O. Córdoba-Rodríguez, and R. Mansilla-Corona, Non-stationary individual and household income of poor, rich and middle classes in Mexico, Phys. A, 465 (2017), pp. 403-413, https://doi.org/10.1016/j.physa.2016.08.042.
[43] R. van der Hofstad, Random Graphs and Complex Networks, Vol. 1, Camb. Ser. Stat. Probab. Math., Cambridge University Press, Cambridge, 2017, https://doi.org/10.1017/9781316779422. · Zbl 1361.05002
[44] A. W. van der Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge, 1998. · Zbl 0910.62001
[45] Y. Virkar and A. Clauset, Power-law distributions in binned empirical data, Ann. Appl. Stat., 8 (2014), pp. 89-119. · Zbl 1454.62150
[46] P. Wan, T. Wang, R. A. Davis, and S. I. Resnick, Fitting the linear preferential attachment model, Electron. J. Stat., 11 (2017), pp. 3738-3780, https://doi.org/10.1214/17-EJS1327. · Zbl 1387.62074
[47] P. Wan, T. Wang, R. A. Davis, and S. I. Resnick, Are extreme estimation methods useful for network data?, Extremes, to appear, https://doi.org/10.1007/s10687-019-00359-x. · Zbl 1460.62085
[48] T. Wang and S. I. Resnick, Multivariate regular variation of discrete mass functions with applications to preferential attachment networks, Methodol. Comput. Appl. Probab., 20 (2018), pp. 1029-1042, https://doi.org/10.1007/s11009-016-9503-x. · Zbl 1401.28006
[49] T. Wang and S. I. Resnick, Asymptotic normality of in- and out-degree counts in a preferential attachment model, Stoch. Models, 33 (2017), pp. 229-255, https://doi.org/10.1080/15326349.2016.1256219. · Zbl 1367.05091
[50] T. Wang and S. I. Resnick, Consistency of Hill estimators in a linear preferential attachment model, Extremes, 22 (2019), pp. 1-28, https://doi.org/10.1007/s10687-018-0335-7. · Zbl 1432.60056
[51] T. Wang and S. I. Resnick, Degree growth rates and index estimation in a directed preferential attachment model, Stochastic Process Appl., 130 (2020), pp. 878-906, https://doi.org/10.1016/j.spa.2019.03.021. · Zbl 1443.60078
[52] V. M. Yakovenko and J. B. Rosser, Jr., Colloquium: Statistical mechanics of money, wealth, and income, Rev. Modern Phys., 81 (2009), p. 1703.
[53] D. H. Zanette and S. C. Manrubia, Vertical transmission of culture and the distribution of family names, Phys. A, 295 (2001), pp. 1-8. · Zbl 0984.92516
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.