×

Species tree estimation under joint modeling of coalescence and duplication: sample complexity of quartet methods. (English) Zbl 1505.92139

Summary: We consider species tree estimation under a standard stochastic model of gene tree evolution that incorporates incomplete lineage sorting (as modeled by a coalescent process) and gene duplication and loss (as modeled by a branching process). Through a probabilistic analysis of the model, we derive sample complexity bounds for widely used quartet-based inference methods that highlight the effect of the duplication and loss rates in both subcritical and supercritical regimes.

MSC:

92D15 Problems related to evolution
60J90 Coalescent processes
60J85 Applications of branching processes

References:

[1] ALLMAN, E. S., BAÑOS, H. and RHODES, J. A. (2019). NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms Mol. Biol. 14 24. · doi:10.1186/s13015-019-0159-2
[2] ALLMAN, E. S., DEGNAN, J. H. and RHODES, J. A. (2011). Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 62 833-862. · Zbl 1230.92033 · doi:10.1007/s00285-010-0355-7
[3] ALLMAN, E. S., LONG, C. and RHODES, J. A. (2019). Species tree inference from genomic sequences using the log-det distance. SIAM J. Appl. Algebra Geom. 3 107-127. · Zbl 1415.92127 · doi:10.1137/18M1194134
[4] ANÉ, C., HO, L. S. T. and ROCH, S. (2017). Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree. J. Math. Biol. 74 355-385. · Zbl 1358.62069 · doi:10.1007/s00285-016-1029-x
[5] ARVESTAD, L., LAGERGREN, J. and SENNBLAD, B. (2009). The gene evolution model and computing its associated probabilities. J. ACM 56 Art. 7. · Zbl 1325.92064 · doi:10.1145/1502793.1502796
[6] ATHREYA, K. B. and NEY, P. E. (1972). Branching Processes. Die Grundlehren der Mathematischen Wissenschaften, Band 196. Springer, New York.
[7] BORGS, C., CHAYES, J. T., MOSSEL, E. and ROCH, S. (2006). The Kesten-Stigum reconstruction bound is tight for roughly symmetric binary channels. In FOCS 518-530.
[8] DASARATHY, G., NOWAK, R. and ROCH, S. (2015). Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12 422-432.
[9] DASKALAKIS, C., MOSSEL, E. and ROCH, S. (2011). Evolutionary trees and the Ising model on the Bethe lattice: A proof of Steel’s conjecture. Probab. Theory Related Fields 149 149-189. · Zbl 1221.92063 · doi:10.1007/s00440-009-0246-2
[10] DEGNAN, J. H. (2018). Modeling hybridization under the network multispecies coalescent. Syst. Biol. 67 786-799.
[11] DEGNAN, J. H. and ROSENBERG, N. A. (2006). Discordance of species trees with their most likely gene trees. PLoS Genet. 2 e68.
[12] DEGNAN, J. H. and ROSENBERG, N. A. (2009). Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 24 332-340.
[13] DRUMMOND, A. J. and BEAST, A. R. (2007). Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7 214.
[14] DU, P., HAHN, M. W. and NAKHLEH, L. (2019). Species tree inference under the multispecies coalescent on data with paralogs is accurate. bioRxiv.
[15] FAN, W.-T. and ROCH, S. (2018). Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees. Electron. J. Probab. 23 Paper No. 47. · Zbl 1410.60074 · doi:10.1214/18-ejp165
[16] FELSENSTEIN, J. (2003). Inferring Phylogenies. Sinauer.
[17] GALTIER, N. (2007). A model of horizontal gene transfer and the bacterial phylogeny problem. Syst. Biol. 56 633-642.
[18] GANESH, A. and ZHANG, Q. (2019). Optimal sequence length requirements for phylogenetic tree reconstruction with indels. In STOC’19—Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing 721-732. ACM, New York. · Zbl 1437.92088 · doi:10.1145/3313276.3316345
[19] GASCUEL, O., ed. (2007). Mathematics of Evolution and Phylogeny. Oxford Univ. Press, Oxford. · Zbl 1274.92040
[20] Kingman, J. F. C. (1982). The coalescent. Stochastic Process. Appl. 13 235-248. · Zbl 0491.60076 · doi:10.1016/0304-4149(82)90011-4
[21] LARGET, B. R., KOTHA, S. K., DEWEY, C. N. and ANÉ BUCKY, C. (2010). Gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 26 2910-2911.
[22] LEGRIED, B., MOLLOY, E. K., WARNOW, T. and ROCH, S. (2020). Polynomial-time statistical estimation of species trees under gene duplication and loss. In Research in Computational Molecular Biology. Lecture Notes in Computer Science 12074 120-135. Springer, Cham. · Zbl 1500.92056 · doi:10.1007/978-3-030-45257-5_8
[23] LI, Q., GALTIER, N., SCORNAVACCA, C. and CHAN, Y.-B. (2020). The multilocus multispecies coalescent: A flexible new model of gene family evolution. bioRxiv.
[24] LINZ, S., RADTKE, A. and VON HAESELER, A. (2007). A likelihood framework to measure horizontal gene transfer. Mol. Biol. Evol. 24 1312-1319.
[25] MADDISON, W. (1997). Gene trees in species trees. Syst. Biol. 46 523-536.
[26] MARKIN, A. and EULENSTEIN, O. (2020). Quartet-based inference methods are statistically consistent under the unified duplication-loss-coalescence model.
[27] MATSEN, F. A. and STEEL, M. (2007). Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst. Biol. 56 767-775.
[28] MENG, C. and SALTER KUBATKO, L. (2009). Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: A model. Theor. Popul. Biol. 75 35-45. · Zbl 1210.92023
[29] MIHAESCU, R., HILL, C. and RAO, S. (2013). Fast phylogeny reconstruction through learning of ancestral sequences. Algorithmica 66 419-449. · Zbl 1266.68237 · doi:10.1007/s00453-012-9644-4
[30] MIRARAB, S., REAZ, R., BAYZID, M. S., ZIMMERMANN, T., SWENSON, M. S. and WARNOW, T. (2014). ASTRAL: Genome-scale coalescent-based species tree estimation. Bioinformatics 30 i541-i548.
[31] MOSSEL, E. (2003). On the impossibility of reconstructing ancestral data and phylogenies. J. Comput. Biol. 10 669-678.
[32] MOSSEL, E. (2004). Phase transitions in phylogeny. Trans. Amer. Math. Soc. 356 2379-2404. · Zbl 1041.92018 · doi:10.1090/S0002-9947-03-03382-8
[33] MOSSEL, E. (2004). Survey: Information flow on trees. In Graphs, Morphisms and Statistical Physics. DIMACS Ser. Discrete Math. Theoret. Comput. Sci. 63 155-170. Amer. Math. Soc., Providence, RI. · Zbl 1066.94006
[34] MOSSEL, E. and PERES, Y. (2003). Information flow on trees. Ann. Appl. Probab. 13 817-844. · Zbl 1050.60082 · doi:10.1214/aoap/1060202828
[35] MOSSEL, E. and ROCH, S. (2012). Phylogenetic mixtures: Concentration of measure in the large-tree limit. Ann. Appl. Probab. 22 2429-2459. · Zbl 1257.92037 · doi:10.1214/11-AAP837
[36] MOSSEL, E. and ROCH, S. (2017). Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length. Ann. Appl. Probab. 27 2926-2955. · Zbl 1379.92040 · doi:10.1214/16-AAP1273
[37] MOSSEL, E., ROCH, S. and SLY, A. (2011). On the inference of large phylogenies with long branches: How long is too long? Bull. Math. Biol. 73 1627-1644. · Zbl 1402.92319 · doi:10.1007/s11538-010-9584-6
[38] MOSSEL, E. and STEEL, M. (2004). A phase transition for a random cluster model on phylogenetic trees. Math. Biosci. 187 189-203. · Zbl 1047.92032 · doi:10.1016/j.mbs.2003.10.004
[39] NAKHLEH, L. (2013). Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol. Evol. 28 719-728.
[40] RABIEE, M., SAYYARI, E. and MIRARAB, S. (2019). Multi-allele species reconstruction using ASTRAL. Mol. Phylogenet. Evol. 130 286-296. · doi:10.1016/j.ympev.2018.10.033
[41] RANNALA, B., EDWARDS, S. V., LEACHÉ, A. and YANG, Z. (2020). The multi-species coalescent model and species tree inference. In Phylogenetics in the Genomic Era (C. Scornavacca, F. Delsuc and N. Galtier, eds.) 3.3:1-3.3:21. No commercial publisher | Authors open access book.
[42] RANNALA, B. and YANG, Z. (2003). Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics 164 1645-1656.
[43] RASMUSSEN, M. D. and KELLIS, M. (2012). Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22 755-765.
[44] ROCH, S. (2010). Toward extracting all phylogenetic information from matrices of evolutionary distances. Science 327 1376-1379. · Zbl 1226.92058 · doi:10.1126/science.1182300
[45] ROCH, S., NUTE, M. and WARNOW, T. (2018). Long-branch attraction in species tree estimation: Inconsistency of partitioned likelihood and topology-based summary methods. Syst. Biol. 68 281-297.
[46] ROCH, S. and SLY, A. (2017). Phase transition in the sample complexity of likelihood-based phylogeny inference. Probab. Theory Related Fields 169 3-62. · Zbl 1379.92041 · doi:10.1007/s00440-017-0793-x
[47] ROCH, S. and SNIR, S. (2013). Recovering the treelike trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis. J. Comput. Biol. 20 93-112. · doi:10.1089/cmb.2012.0234
[48] ROCH, S. and STEEL, M. (2015). Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100 56-62. · Zbl 1331.92111
[49] ROCH, S. and WARNOW, T. (2015). On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods. Syst. Biol. 64 663-676.
[50] SCHREMPF, D. and SZÖLLÖSI, G. (2020). The sources of phylogenetic conflicts. In Phylogenetics in the Genomic Era (C. Scornavacca, F. Delsuc and N. Galtier, eds.) 3.1:1-3.1:23. No commercial publisher | Authors open access book.
[51] SCORNAVACCA, C., DELSUC, F. and GALTIER, N. (2020). Phylogenetics in the Genomic Era. No commercial publisher | Authors open access book.
[52] SÉBASTIEN, R. (2019). Hands-on Introduction to Sequence-Length Requirements in Phylogenetics. In Bioinformatics and Phylogenetics 47-86. Springer, Cham.
[53] SEMPLE, C. and STEEL, M. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications 24. Oxford Univ. Press, Oxford. · Zbl 1043.92026
[54] SHEKHAR, S., ROCH, S. and MIRARAB, S. Species tree estimation using ASTRAL: How many genes are enough? IEEE/ACM Trans. Comput. Biol. Bioinform. 15 1738-1747. · doi:10.1109/TCBB.2017.2757930
[55] SIMION, P., DELSUC, F. and TO, H. P. (2020). What extent current limits of phylogenomics can be overcome? In Phylogenetics in the Genomic Era (C. Scornavacca, F. Delsuc and N. Galtier, eds.) 2.1:1-2.1:34. No commercial publisher | Authors open access book.
[56] SOLÍS-LEMUS, C. and ANÉ, C. (2016). Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 12 1-21.
[57] STEEL, M. (2016). Phylogeny—Discrete and Random Processes in Evolution. CBMS-NSF Regional Conference Series in Applied Mathematics 89. SIAM, Philadelphia, PA. · Zbl 1361.92001 · doi:10.1137/1.9781611974485.ch1
[58] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics 47. Cambridge Univ. Press, Cambridge. · Zbl 1430.60005 · doi:10.1017/9781108231596
[59] WARNOW, T. (2017). Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge Univ. Press, Cambridge. · Zbl 1377.92001
[60] YANG, Z. (2014). Molecular Evolution: A Statistical Approach. OUP, Oxford · Zbl 1288.92002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.