×

Fast inference of individual admixture coefficients using geographic data. (English) Zbl 1393.62054

Summary: Accurately evaluating the distribution of genetic ancestry across geographic space is one of the main questions addressed by evolutionary biologists. This question has been commonly addressed through the application of Bayesian estimation programs allowing their users to estimate individual admixture proportions and allele frequencies among putative ancestral populations. Following the explosion of high-throughput sequencing technologies, several algorithms have been proposed to cope with computational burden generated by the massive data in those studies. In this context, incorporating geographic proximity in ancestry estimation algorithms is an open statistical and computational challenge. In this study, we introduce new algorithms that use geographic information to estimate ancestry proportions and ancestral genotype frequencies from population genetic data. Our algorithms combine matrix factorization methods and spatial statistics to provide estimates of ancestry matrices based on least-squares approximation. We demonstrate the benefit of using spatial algorithms through extensive computer simulations, and we provide an example of application of our new algorithms to a set of spatially referenced samples for the plant species Arabidopsis thaliana. Without loss of statistical accuracy, the new algorithms exhibit runtimes that are much shorter than those observed for previously developed spatial methods. Our algorithms are implemented in the R package, tess3r.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H25 Factor analysis and principal components; correspondence analysis

References:

[1] Genomes Project Consortium, Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., Korbel, J. O., Marchini, J. L., McCarthy, S., McVean, G. A. and Abecasis, G. R. (2015). A global reference for human genetic variation. Nature 526 68-74.
[2] Alexander, D. H. and Lange, K. (2011). Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinform.12 246.
[3] Baran, Y., Quintela, I., Carracedo, Á., Pasaniuc, B. and Halperin, E. (2013). Enhanced localization of genetic samples through linkage-disequilibrium correction. Am. J. Hum. Genet.92 882-894.
[4] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for eimensionality reduction and data representation. Neural Comput.6 1373-1396. · Zbl 1085.68119
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. · Zbl 0809.62014
[6] Bertsekas, D. P. (1995). Nonlinear Programming. Athena Scientific, Nashua, USA. · Zbl 0935.90037
[7] Bradburd, G. S., Ralph, P. L. and Coop, G. M. (2016). A spatial framework for understanding population structure and admixture. PLoS Genet.12 e1005703.
[8] Cai, D., He, X., Han, J. and Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell.33 1548-1560.
[9] Carbon, S., Ireland, A., Mungall, C. J., Shu, S., Marshall, B., Lewis, S., AmiGO Hub and Web Presence Working Group (2009). AmiGO: Online access to ontology and annotation data. Bioinformatics 25 288-289.
[10] Cavalli, L. L., Menozzi, P. and Piazza, A. (1994). The History and Geography of Human Genes. Princeton Univ. Press, Princeton, NJ.
[11] Caye, K., Deist, T. M., Martins, H., Michel, O. and François, O. (2016). TESS3: Fast inference of spatial population structure and genome scans for selection. Mol. Ecol. Resour.16 540-548.
[12] Chen, C., Durand, E., Forbes, F. and François, O. (2007). Bayesian clustering algorithms ascertaining spatial population structure: A new computer program and a comparison study. Mol. Ecol. Notes 7 747-756.
[13] Cichocki, A., Zdunek, R., Phan, A. H. and Amari, S. I. (2009). Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, Chichester.
[14] Corander, J., Sirén, J. and Arjas, E. (2008). Bayesian spatial modeling of genetic population structure. Comput. Statist.23 111-129.
[15] Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA. · Zbl 1347.62005
[16] Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997-1004. · Zbl 1059.62640
[17] Durand, E., Jay, F., Gaggiotti, O. E. and François, O. (2009). Spatial inference of admixture proportions and secondary contact zones. Mol. Biol. Evol.26 1963-1973.
[18] Eastment, H. and Krzanowski, W. (1982). Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24 73-77.
[19] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc.99 96-104. · Zbl 1089.62502
[20] Engelhardt, B. E. and Stephens, M. (2010). Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis. PLoS Genet.6 e1001117.
[21] Epperson, B. K. (2003). Geographical Genetics. Princeton Univ. Press, Princeton, NJ.
[22] Epperson, B. K. and Li, T. (1996). Measurement of genetic structure within populations using Moran’s spatial autocorrelation statistics. Proc. Natl. Acad. Sci. USA 93 10528-10532.
[23] Fournier-Level, A., Korte, A., Cooper, M. D., Nordborg, M., Schmitt, J. and Wilczek, A. M. (2011). A map of local adaptation in Arabidopsis thaliana. Science 334 86-89.
[24] François, O. and Durand, E. (2010). Spatially explicit Bayesian clustering models in population genetics. Mol. Ecol. Resour.10 773-784.
[25] François, O. and Waits, L. P. (2016). Clustering and assignment methods in landscape genetics. 114-128. Wiley, Chichester.
[26] François, O., Martins, H., Caye, K. and Schoville, S. D. (2016). Controlling false discoveries in genome scans for selection. Mol. Ecol.25 454-469.
[27] Frichot, E. and François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods Ecol. Evol.6 925-929.
[28] Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G. and François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics 196 973-983.
[29] Grippo, L. and Sciandrone, M. (2000). On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Oper. Res. Lett.26 127-136. · Zbl 0955.90128
[30] Hancock, A. M., Brachi, B., Faure, N., Horton, M. W., Jarymowycz, L. B., Sperone, F. G., Toomajian, C., Roux, F. and Bergelson, J. (2011). Adaptation to climate across the Arabidopsis thaliana genome. Science 334 83-86.
[31] Hardy, O. J. and Vekemans, X. (1999). Isolation by distance in a continuous population: Reconciliation between spatial autocorrelation analysis and population genetics models. Heredity 83 145-154.
[32] Horton, M. W., Hancock, A. M., Huang, Y. S., Toomajian, C., Atwell, S., Auton, A., Muliyati, N. W., Platt, A., Sperone, F. G., Vilhjálmsson, B. J., Nordborg, M., Borevitz, J. O. and Bergelson, J. (2012). Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nat. Genet.44 212-216.
[33] Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18 337-338.
[34] Kim, J. and Park, H. (2011). Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM J. Sci. Comput.33 3261-3281. · Zbl 1232.65068 · doi:10.1137/110821172
[35] Korneliussen, T. S., Albrechtsen, A. and Nielsen, R. (2014). ANGSD: Analysis of next generation sequencing data. BMC Bioinform.15 356.
[36] Lao, O., Liu, F., Wollstein, A. and Kayser, M. (2014). GAGA: A new algorithm for genomic inference of geographic ancestry reveals fine level population substructure in Europeans. PLoS Comput. Biol.10 e1003480.
[37] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788-791. · Zbl 1369.68285
[38] Li, G. and Zhu, H. (2013). Genetic studies: The linear mixed models in genome-wide association studies. Open Bioinform. J.7 27-33.
[39] Malécot, G. (1948). Les Mathématiques de L’Hérédité. Masson et Cie., Paris. · Zbl 0031.17304
[40] Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Res.27 209-220.
[41] Martins, H., Caye, K., Luu, K., Blum, M. G. B. and François, O. (2016). Identifying outlier loci in admixed and in continuous populations using ancestral population differentiation statistics. Mol. Ecol.25 5029-5042.
[42] Popescu, A. A., Harper, A. L., Trick, M., Bancroft, I. and Huber, K. T. (2014). A novel and fast approach for population structure inference using kernel-PCA and optimization. Genetics 198 1421-1431.
[43] Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945-959.
[44] Raj, A., Stephens, M. and Pritchard, J. K. (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics 197 573-589.
[45] Rañola, J. M., Novembre, J. and Lange, K. (2014). Fast spatial ancestry via flexible allele frequency surfaces. Bioinformatics 30 2915-2922.
[46] R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available at https://www.R-project.org/.
[47] Schraiber, J. G. and Akey, J. M. (2015). Methods and models for unravelling human evolutionary history. Nat. Rev. Genet.16 727-740.
[48] Segelbacher, G., Cushman, S. A., Epperson, B. K., Fortin, M. J., Francois, O., Hardy, O. J., Holderegger, R., Taberlet, P., Waits, L. P. and Manel, S. (2010). Applications of landscape genetics in conservation biology: Concepts and challenges. Conserv. Genet.11 375-385.
[49] Tang, H., Peng, J., Wang, P. and Risch, N. J. (2005). Estimation of individual admixture: Analytical and study design considerations. Genet. Epidemiol.28 289-301.
[50] Wang, J. (2017). The computer program structure for assigning individuals to populations: Easy to use but easier to misuse. Mol. Ecol. Resour.17 981-990.
[51] Weir, B. S. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data, Vol. 2. Sinauer Associates, Sunderland, MA.
[52] Wold, S. (1978). Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 20 397-405. · Zbl 0403.62032
[53] Wollstein, A. and Lao, O. (2015). Detecting individual ancestry in the human genome. Invest. Genet.6 7.
[54] Wright, S. (1943). Isolation by distance. Genetics 28 114-138.
[55] Yang, W.-Y., Platt, A., Chiang, C. W.-K., Eskin, E., Novembre, J. and Pasaniuc, B. (2014). Spatial localization of recent ancestors for admixed individuals. Genes Genomes Genet.4 2505-2518.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.