×

Ancestral inference from haplotypes and mutations. (English) Zbl 1405.92174

Summary: We consider inference about the history of a sample of DNA sequences, conditional upon the haplotype counts and the number of segregating sites observed at the present time. After deriving some theoretical results in the coalescent setting, we implement rejection sampling and importance sampling schemes to perform the inference. The importance sampling scheme addresses an extension of the Ewens sampling formula for a configuration of haplotypes and the number of segregating sites in the sample. The implementations include both constant and variable population size models. The methods are illustrated by two human Y chromosome datasets.

MSC:

92D10 Genetics and epigenetics
92D20 Protein sequences, DNA sequences
92D15 Problems related to evolution

References:

[1] Achaz, G., Frequency spectrum neutrality tests: one for all and all for one, Genetics, 183, 249-258, (2009)
[2] Arratia, R.; Barbour, A. D.; Tavaré, S., (Logarithmic Combinatorial Structures: A Probabilistic Approach, Monographs in Mathematics, (2003), European Mathematical Society) · Zbl 1040.60001
[3] Blum, M. G.; Rosenberg, N. A., Estimating the number of ancestral lineages using a maximum-likelihood method based on rejection sampling, Genetics, 176, 1741-1757, (2007)
[4] De Iorio, M.; Griffiths, R. C., Importance sampling on coalescent histories. I., Adv. Appl. Probab., 36, 417-433, (2004) · Zbl 1045.62111
[5] Ethier, S. N.; Griffiths, R. C., The infinitely-many-sites model as a measure-valued diffusion, Ann. Probab., 15, 515-545, (1987) · Zbl 0634.92007
[6] Felsenstein, J.; Kuhner, M.; Yamato, J.; Beerli, P., Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data, IMS Lect. Notes Monogr. Ser., 33, 163-185, (1999)
[7] Griffiths, R. C., Lines of descent in the diffusion approximation of neutral wright-Fisher models, Theor. Popul. Biol., 17, 37-50, (1980) · Zbl 0434.92011
[8] Griffiths, R. C., Transient distribution of the number of segregating sites in a neutral infinite-sites model with no recombination, J. Appl. Probab., 18, 42-51, (1981) · Zbl 0457.92013
[9] Griffiths, R. C., The number of alleles and segregating sites in a sample from the infinite-alleles model, Adv. Appl. Probab., 14, 225-239, (1982) · Zbl 0501.92013
[10] Griffiths, R. C., Genealogical-tree probabilities in the infinitely-many-sites model, J. Math. Biol., 27, 667-680, (1989) · Zbl 0716.92012
[11] Griffiths, R. C., Ancestral inference from gene trees, (Veuille, M.; Slatkin, M., Modern Developments in Theoretical Population Genetics: The Legacy of Gustave MalÉcot, (2002), Oxford University Press New York), 94-117
[12] Griffiths, R. C., Coalescent lineage distributions, Adv. Appl. Probab., 38, 405-429, (2006) · Zbl 1092.92035
[13] Griffiths, R. C.; Lessard, S., Ewens’ sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles, Theor. Popul. Biol., 68, 167-177, (2005) · Zbl 1085.92027
[14] Griffiths, R. C.; Tavaré, S., Simulating probability distributions in the coalescent, Theor. Popul. Biol., 46, 131-159, (1994) · Zbl 0807.92015
[15] Griffiths, R. C.; Tavaré, S., Ancestral inference in population genetics, Stat. Sci., 9, 307-319, (1994) · Zbl 0955.62644
[16] Griffiths, R. C.; Tavaré, S., Sampling theory for neutral alleles in a varying environment, Phil. Trans. R. Soc. Lond. B, 344, 403-410, (1994)
[17] Griffiths, R. C.; Tavaré, S., Computational methods for the coalescent, (Donnelly, P.; Tavaré, S., Progress in Population Genetics and Human Evolution, IMA Volumes in Mathematics and its Applications, vol. 87, (1997), Springer Verlag Berlin), 165-182 · Zbl 0893.92021
[18] Griffiths, R. C.; Tavaré, S., The age of a mutation in a general coalescent tree, Stoch. Models, 14, 273-295, (1998) · Zbl 0889.92017
[19] Griffiths, R. C.; Tavaré, S., The ages of mutations in gene trees, Ann. Appl. Probab., 9, 567-590, (1999) · Zbl 0948.92016
[20] Hammer, M. F.; Karafet, T.; Rasanayagam, A.; Wood, E. T.; Altheide, T. K.; Jenkins, T.; Griffiths, R. C.; Templeton, A. R.; Zegura, S. L., Out of africa and back again: nested cladistic analysis of human Y chromosome variation, Mol. Biol. Evol., 15, 427-441, (1998)
[21] Innan, H.; Zhang, K.; Marjoram, P.; Tavaré, S.; Rosenberg, N. A., Statistical tests of the coalescent model based on the haplotype frequency distribution and the number of segregating sites, Genetics, 169, 1763-1777, (2005)
[22] Joyce, P.; Genz, A.; Buzbas, E. O., Efficient simulation and likelihood methods for non-neutral multi-allele models, J. Comput. Biol., 19, 650-661, (2012)
[23] Kingman, J. F.C., On the genealogy of large populations, J. Appl. Probab., 19A, 27-43, (1982) · Zbl 0516.92011
[24] Liu, J. S., Monte Carlo strategies in scientific computing, (2001), Springer New York · Zbl 0991.65001
[25] Poznik, G. D.; Xue, Y.; Mendez, F. L.; Willems, T. F.; Massaia, A.; Sayres, M. A.W.; Ayub, Q.; McCarthy, S. A.; Narechania, A.; Kashin, S.; Chen, Y.; Banerjee, R.; Rodriguez-Flores, J. L.; Cerezo, M.; Shao, H.; Gymrek, M.; Malhotra, A.; Louzada, S.; Desalle, R.; Ritchie, G. R.S.; Cerveira, E.; Fitzgerald, T. W.; Garrison, E.; Marcketta, A.; Mittelman, D.; Romanovitch, M.; Zhang, C.; Zheng-Bradley, X.; Abecasis, G. R.; McCarroll, S. A.; Flicek, P.; Underhill, P. A.; Coin, L.; Zerbino, D. R.; Yang, F.; Lee, C.; Clarke, L.; Auton, A.; Erlich, Y.; Handsaker, R. E.; Bustamante, C. D.; Tyler-Smith, C., Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences, Nature Genet, 48, 593-599, (2016)
[26] Slater, G. J.; Harmon, L. J.; Joyce, P.; Revell, L. J.; Alfaro, M. E., Fitting models of continuous trait evolution to incompletely sampled comparative data using approximate Bayesian computation, Evolution, 66, 752-762, (2012)
[27] Stephens, M.; Donnelly, P., Inference in molecular population genetics, J. Roy. Statist. Soc. B, 62, 605-655, (2000) · Zbl 0962.62107
[28] Tajima, F., Evolutionary relationship of DNA sequences in finite populations, Genetics, 105, 437-460, (1983)
[29] Tavaré, S., Line-of-descent and genealogical processes, and their application in population genetics models, Theor. Popul. Biol., 26, 119-164, (1984) · Zbl 0555.92011
[30] Tavaré, S., (Ancestral Inference in Population Genetics, Lectures on Probability Theory and Statistics, vol. 1837, (2004), Springer Berlin Heidelberg), 1-188 · Zbl 1062.92046
[31] Tavaré, S.; Balding, D.; Griffiths, R. C.; Donnelly, P., Inferring coalescence times from DNA sequence data, Genetics, 145, 505-518, (1997)
[32] Watterson, G. A., The sampling theory of selectively neutral alleles, Adv. Appl. Probab., 6, 463-488, (1974) · Zbl 0289.62020
[33] Watterson, G. A., On the number of segregating sites in genetical models without recombination, Theor. Popul. Biol., 7, 256-276, (1975) · Zbl 0294.92011
[34] Ewens, W. J., The sampling theory of selectively neutral alleles, Theor. Popul. Biol., 3, 87-112, (1972) · Zbl 0245.92009
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.