×

Kermit: guided long read assembly using coloured overlap graphs. (English) Zbl 1494.92091

Parida, Laxmi (ed.) et al., 18th international workshop on algorithms in bioinformatics, WABI 2018, Helsinki, Finland, August 20–22, 2018. Proceedings. Wadern: Schloss Dagstuhl – Leibniz Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 113, Article 11, 11 p. (2018).
Summary: With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly.
For the entire collection see [Zbl 1393.68016].

MSC:

92D20 Protein sequences, DNA sequences
92-08 Computational methods for problems pertaining to biology
Full Text: DOI

References:

[1] V. Ahola, R. Lehtonen, P. Somervuo, et al. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nature Com-munications, 5:4737, 2014.
[2] B. Alipanahi, L. Salmela, S.J. Puglisi, M. Muggli, and C. Boucher. Disentangled long-read de Bruijn graphs via optical maps. In R. Schwartz and K. Reinert, editors, WABI 2017, volume 88 of LIPIcs, pages 1:1-1:14, Dagstuhl, Germany, 2017. · Zbl 1443.92132
[3] A. Bankevich, S. Nurk, D. Antipov, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol., 19(5):455-477, 2012.
[4] S.M. Van Belleghem, P. Rastas, A. Papanicolalaou, et al. Complex modular architecture around a simple toolkit of wing pattern genes. Nature Ecology & Evolution, 1:0052, 2017.
[5] J. Catchen. Chromonomer. http://catchenlab.life.illinois.edu/chromonomer/, 2015. Ac-cessed: 2018-04-27.
[6] G. Chartrand, GL. Johns, KA. McKeon, and P. Zhang. Rainbow connection in graphs. Mathematica Bohemica, 133(1):85-98, 2008. · Zbl 1199.05106
[7] C.-S. Chin, P. Peluso, F.J. Sedlazeck, et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nature Methods, 13:1050-1054, 2016.
[8] J.L. Fierst. Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Frontiers in Genetics, 6:220, 2015.
[9] A. Gurevich, V. Saveliev, N. Vyahhi N, and G. Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8):1072-1075, 2013.
[10] M. Kolmogorov, J. Yuan, Y. Lin, and P. Pevzner. Assembly of long error-prone reads using repeat graphs. In Proc. RECOMB 2018, pages 261-263, 2018.
[11] S. Koren, B.P. Walenz, K. Berlin, J.R. Miller, N.H. Bergman, and A.M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separa-tion. Genome Res., 27:722-736, 2017.
[12] H. Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103-2110, 2016.
[13] H. Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 2018. (To appear).
[14] H.C. Lin, S. Goldstein, L. Mendelowitz, S. Zhou, J. Wetzel, D.C. Schwartz, and M. Pop. AGORA: assembly guided by optical restriction alignment. BMC Bioinformatics, 13:189, 2012.
[15] T. Paterson and A. Law. ArkMAP: integrating genomic maps across species and data sources. BMC Bioinformatics, 14:246, 2013.
[16] R. Vaser R, I. Sovic, N. Nagarajan, and M. Sikic. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research, 27:737-746, 2017.
[17] P. Rastas. Lep-MAP3: robust linkage mapping even for low-coverage whole genome se-quencing data. Bioinformatics, 33(23):3726-3732, 2017.
[18] J. Salojärvi, O.P. Smolander, K. Nieminen, et al. Genome sequencing and population gen-omic analyses provide insights into the adaptive landscape of silver birch. Nature Genetics, 49:904-912, 2017.
[19] K. Schneeberger, S. Ossowski, F. Ott, et al. Reference-guided assembly of four diverse Arabidopsis thaliana genomes. PNAS, 108(25):10249-10254, 2011.
[20] B.K. Stöcker, J. Köster, and S. Rahmann. SimLoRD: Simulation of long read data. Bioin-formatics, 32(17):2704-2706, 2016.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.