×

Adaptive estimation for Hawkes processes; application to genome analysis. (English) Zbl 1200.62135

Summary: The aim of this paper is to provide a new method for the detection of either favored or avoided distances between genomic events along DNA sequences. These events are modeled by a Hawkes process [see A. G. Hawkes and D. Oakes, J. Appl. Probab. 11, 493–503 (1974; Zbl 0305.60021)]. The biological problem is actually complex enough to need a nonasymptotic penalized model selection approach. We provide a theoretical penalty that satisfies an oracle inequality even for quite complex families of models. The consecutive theoretical estimator is shown to be adaptive minimax for Hölder functions with regularity in (1/2, 1]: those aspects have not yet been studied for the Hawkes’ process. Moreover, we introduce an efficient strategy, named Islands, which is not classically used in model selection, but that happens to be particularly relevant to the biological question we want to answer. Since a multiplicative constant in the theoretical penalty is not computable in practice, we provide extensive simulations to find a data-driven calibration of this constant. The results obtained on real genomic data are coherent with biological knowledge and eventually refine them.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
65C60 Computational problems in statistics (MSC2010)
62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference
60E15 Inequalities; stochastic orderings
46N30 Applications of functional analysis in probability theory and statistics

Citations:

Zbl 0305.60021

References:

[1] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245-279.
[2] Baraud, Y., Comte, F. and Viennet, G. (2001). Model selection for (auto)-regression with dependent data. ESAIM Probab. Stat. 5 33-49. · Zbl 0990.62035 · doi:10.1051/ps:2001101
[3] Baraud, Y., Comte, F. and Viennet, G. (2001). Adaptive estimation in autoregression or beta-mixing regression via model selection. Ann. Statist. 39 839-875. · Zbl 1012.62034 · doi:10.1214/aos/1009210692
[4] Birgé, L. (2005). A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory 51 1611-1615. · Zbl 1283.62030 · doi:10.1109/TIT.2005.844101
[5] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203-268. · Zbl 1037.62001 · doi:10.1007/s100970100031
[6] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33-73. · Zbl 1112.62082 · doi:10.1007/s00440-006-0011-8
[7] Brémaud, P. and Massoulié, L. (1996). Stability of nonlinear Hawkes processes. Ann. Probab. 24 1563-1588. · Zbl 0870.60043 · doi:10.1214/aop/1065725193
[8] Brémaud, P. and Massoulié, L. (2001). Hawkes branching point processes without ancestors. J. Appl. Probab. 38 122-135. · Zbl 0983.60048 · doi:10.1239/jap/996986648
[9] Daley, D. J. and Vere-Jones, D. (2005). An Introduction to the Theory of Point Processes. Springer Series in Statistics I . Springer, New York. · Zbl 0657.60069
[10] Gallager, R. (1968). Information Theory and Reliable Communication . Wiley, New York. · Zbl 0198.52201
[11] Gusto, G. (2004). Estimation de l’intensité d’un processus de Hawkes généralisé double. Application à la recherche de motifs corépartis le long d’une séquence d’ADN. Ph.D. thesis, Univ. Paris. Available at .
[12] Gusto, G. and Schbath, S. (2005). FADO: A statistical method to detect favored or avoided distances between motif occurrences using the Hawkes’ model. Stat. Appl. Genet. Mol. Biol. 4 Article 24, 28 pp. (electronic). · Zbl 1095.62126 · doi:10.2202/1544-6115.1119
[13] Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a self-exciting process. J. Appl. Probab. 11 493-503. JSTOR: · Zbl 0305.60021 · doi:10.2307/3212693
[14] Lacour, C. (2007). Adaptive estimation of the transition density of a Markov chain. Ann. Inst. H. Poincaré Probab. Statist. 43 571-597. · Zbl 1125.62087 · doi:10.1016/j.anihpb.2006.09.003
[15] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896 . Springer, Berlin. · Zbl 1170.60006 · doi:10.1007/978-3-540-48503-2
[16] Ogata, Y. and Akaike, H. (1982). On linear intensity models for mixed doubly stochastic Poisson and self-exciting point processes. J. Roy. Statist. Soc. Ser. B 44 102-107. JSTOR: · Zbl 0496.62074
[17] Ozaki, T. (1979). Maximum likelihood estimation of Hawkes’ self-exciting point processes. Ann. Inst. Statist. Math. 31 145-155. · Zbl 0447.62081 · doi:10.1007/BF02480272
[18] Reinert, G., Schbath, S. and Waterman, M. S. (2000). Probabilistic and statistical properties of words: An overview. J. Comput. Biol. 7 1-46.
[19] Reynaud-Bouret, P. (2003). Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Related Fields 126 103-153. · Zbl 1019.62079 · doi:10.1007/s00440-003-0259-1
[20] Reynaud-Bouret, P. (2006). Compensator and exponential inequalities for some suprema of counting processes. Statist. Probab. Lett. 76 1514-1521. · Zbl 1101.60033 · doi:10.1016/j.spl.2006.03.012
[21] Reynaud-Bouret, P. (2006). Penalized projection estimators of the Aalen multiplicative intensity. Bernoulli 12 633-661. · Zbl 1125.62027 · doi:10.3150/bj/1155735930
[22] Reynaud-Bouret, P. and Roy, E. (2007). Some nonasymptotic tail estimate for Hawkes processes. Bull. Belg. Math. Soc. Simon Stevin 13 883-896. · Zbl 1120.60052
[23] Reynaud-Bouret, P. and Schbath, S. (2010). Adaptive estimation for Hawkes’processes; application to genome analysis. Available at . · Zbl 1200.62135
[24] Vere-Jones, D. and Ozaki, T. (1982). Some examples of statistical estimation applied to earthquake data. Ann. Inst. Statist. Math. 34 189-207.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.