×

Ancestral sequence reconstruction for co-evolutionary models. (English) Zbl 1539.92078

Summary: The ancestral sequence reconstruction problem is the inference, back in time, of the properties of common sequence ancestors from the measured properties of contemporary populations. Standard algorithms for this problem assume independent (factorized) evolution of the characters of the sequences, which is generally wrong (e.g. proteins and genome sequences). In this work, we have studied this problem for sequences described by global co-evolutionary models, which reproduce the global pattern of cooperative interactions between the elements that compose it. For this, we first modeled the temporal evolution of correlated real valued characters by a multivariate Ornstein-Uhlenbeck process on a finite tree. This represents sequences as Gaussian vectors evolving in a quadratic potential, who describe the selection forces acting on the evolving entities. Under a Bayesian framework, we developed a reconstruction algorithm for these sequences and obtained an analytical expression to quantify the quality of our estimation. We extend this formalism to discrete valued sequences by applying our method to a Potts model. We showed that for both continuous and discrete configurations, there is a wide range of parameters where, to properly reconstruct the ancestral sequences, intra-species correlations must be taken into account. We also demonstrated that, for sequences with discrete elements, our reconstruction algorithm outperforms traditional schemes based on independent site approximations.

MSC:

92D15 Problems related to evolution
92D25 Population dynamics (general)

Software:

PAML

References:

[1] Joy, J. B.; Liang, R. H.; McCloskey, R. M.; Nguyen, T.; Poon, A. F Y., Ancestral reconstruction, PLoS Comput. Biol., 12 (2016) · doi:10.1371/journal.pcbi.1004763
[2] Felsenstein, J., Evolutionary trees from DNA sequences: a maximum likelihood approach, J. Mol. Evol., 17, 368-376 (1981) · doi:10.1007/bf01734359
[3] Yang, Z.; Kumar, S.; Nei, M., A new method of inference of ancestral nucleotide and amino acid sequences, Genetics, 141, 1641-1650 (1995) · doi:10.1093/genetics/141.4.1641
[4] Koshi, J. M.; Goldstein, R. A., Probabilistic reconstruction of ancestral protein sequences, J. Mol. Evol., 42, 313-320 (1996) · doi:10.1007/bf02198858
[5] Pagel, M., The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies, Syst. Biol., 48, 612-622 (1999) · doi:10.1080/106351599260184
[6] Pupko, T.; Pe, I.; Shamir, R.; Graur, D., A fast algorithm for joint reconstruction of ancestral amino acid sequences, Mol. Biol. Evol., 17, 890-896 (2000) · doi:10.1093/oxfordjournals.molbev.a026369
[7] Yang, Z., Paml 4: phylogenetic analysis by maximum likelihood, Mol. Biol. Evol., 24, 1586-1591 (2007) · doi:10.1093/molbev/msm088
[8] Huelsenbeck, J. P.; Bollback, J. P., Empirical and hierarchical Bayesian estimation of ancestral states, Syst. Biol., 50, 351-366 (2001) · doi:10.1080/106351501300317978
[9] Breen, M. S.; Kemena, C.; Vlasov, P. K.; Notredame, C.; Kondrashov, F. A., Epistasis as the primary factor in molecular evolution, Nature, 490, 535-538 (2012) · doi:10.1038/nature11510
[10] Harms, M. J.; Thornton, J. W., Evolutionary biochemistry: revealing the historical and physical causes of protein properties, Nat. Rev. Genet., 14, 559-571 (2013) · doi:10.1038/nrg3540
[11] Olson, C. A.; Wu, N. C.; Sun, R., A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain, Curr. Biol., 24, 2643-2651 (2014) · doi:10.1016/j.cub.2014.09.072
[12] Rollins, N. J.; Brock, K. P.; Poelwijk, F. J.; Stiffler, M. A.; Gauthier, N. P.; Sander, C.; Marks, D. S., 3D protein structure from genetic epistasis experiments, Curr. Biol., 24, 2643-2651 (2018) · doi:10.1101/320721
[13] Kimura, M., Attainment of quasi linkage equilibrium when gene frequencies are changing by natural selection, Genetics, 52, 875-890 (1965) · doi:10.1093/genetics/52.5.875
[14] Gao, C-Y; Cecconi, F.; Vulpiani, A.; Zhou, H-J; Aurell, E., DCA for genome-wide epistasis analysis: the statistical genetics perspective, Phys. Biol., 16 (2019) · doi:10.1088/1478-3975/aafbe0
[15] Chau, N. H.; Zecchina, R. N.; Berg, J., Inverse statistical problems: from the inverse Ising problem to data science, Adv. Phys., 66, 197-261 (2017) · doi:10.1080/00018732.2017.1341604
[16] Levy, R. M.; Haldane, A.; Flynn, W. F., Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness, Curr. Opin. Struct. Biol., 43, 55-62 (2017) · doi:10.1016/j.sbi.2016.11.004
[17] Cocco, S.; Feinauer, C.; Figliuzzi, M.; Monasson, R.; Martin, W., Inverse statistical physics of protein sequences: a key issues review, Rep. Prog. Phys., 81 (2018) · doi:10.1088/1361-6633/aa9965
[18] Morcos, F., Direct-coupling analysis of residue coevolution captures native contacts across many protein families, Proc. Natl Acad. Sci., 108, E1293-E1301 (2011) · doi:10.1073/pnas.1111471108
[19] Russ, W. P., An evolution-based model for designing chorismate mutase enzymes, Science, 369, 440-445 (2020) · Zbl 1478.92063 · doi:10.1126/science.aba3304
[20] Zeng, H-L; Dichio, V.; Horta, E. R.; Kaisa, T.; Aurell, E., Global analysis of more than 50 000 SARS-CoV-2 genomes reveals epistasis between eight viral genes, Proc. Natl Acad. Sci., 117, 31519-31526 (2020) · doi:10.1073/pnas.2012331117
[21] Huelsenbeck, J. P.; Nielsen, R., Effect of nonindependent substitution on phylogenetic accuracy, Syst. Biol., 48, 317-328 (1999) · doi:10.1080/106351599260319
[22] Nasrallah, C. A.; Mathews, D. H.; Huelsenbeck, J. P., Quantifying the impact of dependent evolution among sites in phylogenetic inference, Syst. Biol., 60, 60-73 (2011) · doi:10.1093/sysbio/syq074
[23] Muntoni, A. P.; Pagnani, A.; Martin, W.; Zamponi, F., Aligning biological sequences by exploiting residue conservation and coevolution, Phys. Rev. E, 102 (2020) · doi:10.1103/PhysRevE.102.062409
[24] Bartoszek, K.; Pienaar, J.; Mostad, P.; Andersson, S.; Hansen, T. F., A phylogenetic comparative method for studying multivariate adaptation, J. Theor. Biol., 314, 204-215 (2012) · Zbl 1397.92481 · doi:10.1016/j.jtbi.2012.08.005
[25] Mitov, V.; Bartoszek, K.; Asimomitis, G.; Stadler, T., Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts, Theor. Popul. Biol., 131, 66-78 (2020) · Zbl 1516.92058 · doi:10.1016/j.tpb.2019.11.005
[26] Horta, E. R.; Lage-Castellanos, A.; Weigt, M.; Barrat-Charlaix, P., Global multivariate model learning from hierarchically correlated data, J. Stat. Mech. (2021) · Zbl 1539.82249 · doi:10.1088/1742-5468/ac06c2
[27] Baldassi, C.; Zamparo, M.; Feinauer, C.; Procaccini, A.; Zecchina, R.; Weigt, M.; Pagnani, A., Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners, PLoS One, 9 (2014) · doi:10.1371/journal.pone.0092721
[28] Horta, E. R.; Martin, W., On the effect of phylogenetic correlations in coevolution based contact prediction in proteins, PLoS Comput. Biol., 17 (2021) · doi:10.1371/journal.pcbi.1008957
[29] Weiss, Y.; Freeman, W. T., Correctness of belief propagation in Gaussian graphical models of arbitrary topology, Neural Comput., 13, 2173-2200 (2001) · Zbl 0992.68055 · doi:10.1162/089976601750541769
[30] Malioutov, D. M.; Johnson, J. K.; Willsky, A. S., Walk-sums and belief propagation in Gaussian graphical models, Mach. Learn. Res., 7, 2031-2064 (2006) · Zbl 1222.68254
[31] Bickson, D., Gaussian belief propagation: theory and application (2009)
[32] Morcos, F., Direct-coupling analysis of residue coevolution captures native contacts across many protein families, Proc. Natl Acad. Sci., 108, E1293-E1301 (2011) · doi:10.1073/pnas.1111471108
[33] Cocco, S.; Feinauer, C.; Figliuzzi, M.; Monasson, R.; Weigt, M., Inverse statistical physics of protein sequences: a key issues review, Rep. Prog. Phys., 81 (2018) · doi:10.1088/1361-6633/aa9965
[34] Gardiner, C. W., Handbook of Stochastic Methods (2009), Berlin: Springer, Berlin
[35] Singh, R.; Ghosh, D.; Adhikari, R., Fast Bayesian inference of the multivariate Ornstein-Uhlenbeck process (2017)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.