×

Species tree inference from genomic sequences using the log-det distance. (English) Zbl 1415.92127

Summary: The log-det distance between two aligned DNA sequences was introduced as a tool for statistically consistent inference of a gene tree under simple nonmixture models of sequence evolution. Here we prove that the log-det distance, coupled with a distance-based tree construction method, also permits consistent inference of species trees under mixture models appropriate to aligned genomic-scale sequences data. Data may include sites from many genetic loci, which evolved on different gene trees due to incomplete lineage sorting on an ultrametric species tree, with different time-reversible substitution processes. The simplicity and speed of distance-based inference suggest log-det-based methods should serve as benchmarks for judging more elaborate and computationally intensive species trees inference methods.

MSC:

92D15 Problems related to evolution
92D20 Protein sequences, DNA sequences
62P10 Applications of statistics to biology and medical sciences; meta analysis

References:

[1] E. S. Allman, J. H. Degnan, and J. A. Rhodes, {\it Species tree inference by the STAR method, and generalizations}, J. Comput. Biol., 20 (2013), pp. 50-61.
[2] E. S. Allman, J. H. Degnan, and J. A. Rhodes, {\it Species tree inference from gene splits by unrooted STAR methods}, IEEE/ACM Trans. Comput. Biol. Bioinform., 15 (2018), pp. 337-342.
[3] M. Bayzid, S. Mirarab, B. Boussau, and T. Warnow, {\it Weighted statistical binning: Enabling statistically consistent genome-scale phylogenetic analyses}, PLoS One, 10 (2015), pp. 1-40.
[4] D. Bryant, R. Bouckaert, J. Felsenstein, N. A. Rosenberg, and A. RoyChoudhury, {\it Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis}, Mol. Biol. Evol., 98 (2012), pp. 1917-1932.
[5] J. Chifman and L. Kubatko, {\it Quartet inference from SNP data under the coalescent model}, Bioinformatics, 30 (2014), pp. 3317-3324.
[6] J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, A. Mirarab, and T. Warnow, {\it A comparative study of SVD quartets and other coalescent-based species tree estimation methods}, BMC Genom., 16 (2015), pp. 1-11.
[7] G. Dasarathy, R. Nowak, and S. Roch, {\it Data requirement for phylogenetic inference from multiple loci: A new distance method}, IEEE/ACM Trans. Comput. Biol. Bioinform., 12 (2015), pp. 422-432.
[8] J. Degnan and N. Rosenberg, {\it Discordance of species trees with their most likely gene trees}, PLoS Genet., 2 (2006), pp. 762-768.
[9] J. Heled and A. Drummond, {\it Bayesian inference of species trees from multilocus data}, Mol. Biol. Evol., 27 (2010), pp. 570-580.
[10] R. A. Horn and C. R. Johnson, {\it Matrix Analysis}, 2nd ed., Cambridge University Press, Cambridge, UK, 2012.
[11] L. S. Kubatko, B. C. Carstens, and L. L. Knowles, {\it STEM: Species tree estimation using maximum likelihood for gene trees under coalescence}, Bioinformatics, 25 (2009), pp. 971-973.
[12] L. S. Kubatko and J. H. Degnan, {\it Inconsistency of phylogenetic estimates from concatenated data under coalescence}, Syst. Biol., 56 (2007), pp. 17-24.
[13] J. Lake, {\it Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances}, Proc. Nat. Acad. Sci. U.S.A., 91 (1994), pp. 1455-1459.
[14] L. Liu and D. K. Pearl, {\it Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions}, Syst. Biol., 56 (2007), pp. 504-514.
[15] L. Liu and L. Yu, {\it Estimating species trees from unrooted gene trees}, Syst. Biol., 60 (2011), pp. 661-667.
[16] L. Liu, L. Yu, and S. Edwards, {\it A maximum pseudo-likelihood approach for estimating species trees under the coalescent model}, BMC Evol. Biol., 10 (2010), 302.
[17] L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards, {\it Estimating species phylogenies using coalescence times among sequences}, Syst. Biol., 58 (2009), pp. 468-477.
[18] P. Lockhart, M. Steel, M. Hendy, and D. Penny, {\it Recovering evolutionary trees under a more realistic model of sequence evolution}, Mol. Biol. Evol., 11 (1994), pp. 605-612.
[19] C. Long and L. Kubatko, {\it Identifiability and reconstructability of species phylogenies under a modified coalescent}, Bull. Math. Biol., 81 (2019), pp. 408-430. · Zbl 1410.92078
[20] S. Mirarab and T. Warnow, {\it ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes}, Bioinformatics, 31 (2015), pp. i44-i52.
[21] P. Pamilo and M. Nei, {\it Relationships between gene trees and species trees}, Mol. Biol. Evol., 5 (1988), pp. 568-583.
[22] B. Rannala and Z. Yang, {\it Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci}, Genetics, 164 (2003), pp. 1645-1656.
[23] S. Roch and M. Steel, {\it Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent}, Theor. Popul. Biol., 100 (2015), pp. 56-62. · Zbl 1331.92111
[24] F. Ronquist, M. Teslenko, P. van der Mark, D. Ayres, A. Darling, S. Höhna, B. Larget, L. Liu, M. Suchard, and J. Huelsenbeck, {\it MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space}, Syst. Biol., 61 (2012), pp. 539-542.
[25] A. RoyChoudhury, J. Felsenstein, and E. A. Thompson, {\it A two-stage pruning algorithm for likelihood computation for a population tree}, Genetics, 180 (2008), pp. 1095-1105.
[26] J. Rusinko and M. McPartlon, {\it Species tree estimation using Neighbor Joining}, J. Theoret. Biol., 414 (2017), pp. 5-7.
[27] C. Semple and M. Steel, {\it Phylogenetics}, Oxford University Press, Oxford, UK, 2003. · Zbl 1043.92026
[28] M. Steel, {\it Recovering a tree from the leaf colourations it generates under a Markov model}, Appl. Math. Lett., 7 (1994), pp. 19-24. · Zbl 0794.60071
[29] P. Vachaspati and T. Warnow, {\it ASTRID: Accurate Species TRees from Internode Distances}, BMC Genom., 16 (2015), p. S3.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.