×

Identifiability of phylogenetic parameters from \(k\)-mer data under the coalescent. (English) Zbl 1410.92073

Summary: Distances between sequences based on their \(k\)-mer frequency counts can be used to reconstruct phylogenies without first computing a sequence alignment. Past work has shown that effective use of \(k\)-mer methods depends on (1) model-based corrections to distances based on \(k\)-mers and (2) breaking long sequences into blocks to obtain repeated trials from the sequence-generating process. Good performance of such methods is based on having many high-quality blocks with many homologous sites, which can be problematic to guarantee a priori. Nature provides natural blocks of sequences into homologous regions – namely, the genes. However, directly using past work in this setting is problematic because of possible discordance between different gene trees and the underlying species tree. Using the multispecies coalescent model as a basis, we derive model-based moment formulas that involve the species divergence times and the coalescent parameters. From this setting, we prove identifiability results for the tree and branch length parameters under the Jukes-Cantor model of sequence mutations.

MSC:

92D15 Problems related to evolution
14P99 Real algebraic and real-analytic geometry

Software:

MUSCLE

References:

[1] Allman ES, Degnan JH, Rhodes JA (2011) Determining species tree topologies from clade probabilities under the coalescent. J Theor Biol 289:96-106 · Zbl 1397.92478 · doi:10.1016/j.jtbi.2011.08.006
[2] Allman ES, Rhodes JA, Sullivant S (2017) Statistically consistent k-mer methods for phylogenetic tree reconstruction. J Comput Biol 24(2):153-171 · doi:10.1089/cmb.2015.0216
[3] Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol 5:21 · doi:10.1186/1748-7188-5-21
[4] Cox DA, Little J, O’Shea D (2015) Ideals, varieties, and algorithms. Undergraduate Texts in Mathematics. Springer, Cham, fourth edition, An introduction to computational algebraic geometry and commutative algebra · Zbl 1335.13001
[5] Dasarathy G, Nowak R, Roch S (2015) Data requirement for phylogenetic inference from multiple loci: A new distance method. IEEE/ACM Trans Comput Biol Bioinf 12:422-432 · doi:10.1109/TCBB.2014.2361685
[6] Daskalakis C, Roch S (2013) Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann Appl Probab 23:693-721 · Zbl 1377.92060 · doi:10.1214/12-AAP852
[7] Edgar RC (2004a) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinf 5:113 · doi:10.1186/1471-2105-5-113
[8] Edgar RC (2004b) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797 · doi:10.1093/nar/gkh340
[9] Gale D, Nikaido H (1965) The Jacobian matrix and global univalence of mappings. Math Ann 159(2):81-93 · Zbl 0158.04903 · doi:10.1007/BF01360282
[10] Kingman JFC (1982) The coalescent. Stoch Process Their Appl 13(3):235-248 · Zbl 0491.60076 · doi:10.1016/0304-4149(82)90011-4
[11] Leung D, Drton M, Hara H et al (2016) Identifiability of directed Gaussian graphical models with one latent source. Electron J Stat 10(1):394-422 · Zbl 1332.62172 · doi:10.1214/16-EJS1111
[12] McVean GAT (2002) A genealogical interpretation of linkage disequilibrium. Genetics 162(2):987-991
[13] Pamilo P, Nei M (1988) Relationships between gene trees and species trees. Mol Biol Evol 5(5):568-583
[14] Rannala B, Yang Z (2003) Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics 164(4):1645-1656
[15] Semple C, Steel M (2003) Phylogenetics, volume 24 of Oxford lecture series in mathematics and its applications. Oxford University Press, Oxford · Zbl 1043.92026
[16] Sievers F, Wilm A, Dineen DG, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539 · doi:10.1038/msb.2011.75
[17] Speyer D, Sturmfels B (2004) The tropical Grassmannian. Adv Geom 4(3):389-411 · Zbl 1065.14071 · doi:10.1515/advg.2004.023
[18] Takahata N (1986) An attempt to estimate the effective size of the ancestral species common to two extant species from which homologous genes are sequenced. Genet Res 48(03):187-190 · doi:10.1017/S001667230002499X
[19] Takahata N, Satta Y, Klein J (1995) Divergence time and population size in the lineage leading to modern humans. Theor Popul Biol 48(2):198-221 · Zbl 0854.92013 · doi:10.1006/tpbi.1995.1026
[20] Wakeley J (2009) Coalescent theory: an introduction, vol 1. Roberts & Company Publishers Greenwood Village, Colorado · Zbl 1366.92001
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.