×

The contig assembly problem and its algorithmic solutions. (English) Zbl 1457.68336

Elloumi, Mourad (ed.), Algorithms for next-generation sequencing data. Techniques, approaches, and applications. Cham: Springer. 267-298 (2017).
Summary: DNA sequencing, assuming no prior knowledge on the target DNA fragment, may be roughly described as the succession of two steps. The first of them uses some sequencing technology to output, for a given DNA fragment (not necessarily a whole genome), a collection of possibly overlapping sequences (called reads) representing small parts of the initial DNA fragment. The second one aims at recovering the sequence of the entire DNA fragment by assembling the reads.
For the entire collection see [Zbl 1383.68005].

MSC:

68W32 Algorithms on strings
92D20 Protein sequences, DNA sequences
Full Text: DOI

References:

[1] Ansorge, W.J.: Next-generation DNA sequencing techniques. New Biotechnol. 25(4), 195-203 (2009) · doi:10.1016/j.nbt.2008.12.009
[2] Ariyaratne, P.N., Sung, W.K.: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2), 167-174 (2011) · doi:10.1093/bioinformatics/btq626
[3] Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455-477 (2012) · doi:10.1089/cmb.2012.0021
[4] Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422-426 (1970) · Zbl 0195.47003 · doi:10.1145/362686.362692
[5] Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J.A., Chapuis, G., Chikhi, R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1-31 (2013) · doi:10.1186/2047-217X-2-10
[6] Bresler, M., Sheehan, S., Chan, A.H., Song, Y.S.: Telescoper: de novo assembly of highly repetitive regions. Bioinformatics 28(18), i311-i317 (2012) · doi:10.1093/bioinformatics/bts399
[7] Bryant, D.W., Wong, W.K., Mockler, T.C.: QSRA-a quality-value guided de novo short read assembler. BMC Bioinf. 10(1), 69 (2009) · doi:10.1186/1471-2105-10-69
[8] Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810-820 (2008) · doi:10.1101/gr.7337908
[9] Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., DePristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13(1), 375 (2012) · doi:10.1186/1471-2164-13-375
[10] Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Res. 18(2), 324-330 (2008) · doi:10.1101/gr.7088808
[11] Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19(2), 336-346 (2009) · doi:10.1101/gr.079053.108
[12] Chang, Y.J., Chen, C.C., Chen, C.L., Ho, J.M.: A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. BMC Genomics 13(Suppl. 7), S28 (2012)
[13] Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31-37 (2014) · doi:10.1093/bioinformatics/btt310
[14] Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. In: Proceedings of WABI 2012. LNBI, vol. 7534, pp. 236-248. Springer, Heidelberg (2012)
[15] Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 17(11), 1697-1706 (2007) · doi:10.1101/gr.6435207
[16] Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K., Buffalo, V., Zerbino, D.R., Diekhans, M., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21(12), 2224-2241 (2011) · doi:10.1101/gr.126599.111
[17] El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), e1003,345+ (2013). doi:10.1371/journal.pcbi.1003345 · doi:1003%2C345%2B
[18] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS’00, pp. 390-398 (2000)
[19] Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. Freeman, New York (1979) · Zbl 0411.68039
[20] Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108(4), 1513-1518 (2011) · doi:10.1073/pnas.1017351108
[21] Henson, J., Tischler, G., Ning, Z.: Next-generation sequencing and large genome assemblies. Pharmacogenomics 13(8), 901-915 (2012) · doi:10.2217/pgs.12.72
[22] Hernandez, D., François, P., Farinelli, L., Østerås, M., Schrenzel, J.: De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 18(5), 802-809 (2008) · doi:10.1101/gr.072033.107
[23] Hossain, M.S., Azimi, N., Skiena, S.: Crystallizing short-read assemblies around seeds. BMC Bioinform. 10(Suppl. 1), S16 (2009) · doi:10.1186/1471-2105-10-S1-S16
[24] Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comput. Biol. 2(2), 291-306 (1995) · doi:10.1089/cmb.1995.2.291
[25] Jeck, W.R., Reinhardt, J.A., Baltrus, D.A., Hickenbotham, M.T., Magrini, V., Mardis, E.R., Dangl, J.L., Jones, C.D.: Extending assembly of short DNA sequences to handle error. Bioinformatics 23(21), 2942-2944 (2007) · doi:10.1093/bioinformatics/btm451
[26] Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13, 7-51 (1993) · Zbl 0831.92013 · doi:10.1007/BF01188580
[27] Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS One 8(9), e75505 (2013) · doi:10.1371/journal.pone.0075505
[28] Koren, S., Harhay, G.P., Smith, T., Bono, J.L., Harhay, D.M., Mcvey, S.D., Radune, D., Bergman, N.H., Phillippy, A.M.: Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14(9), R101 (2013) · doi:10.1186/gb-2013-14-9-r101
[29] Landau, G.M., Vishkin, U.: Introducing efficient parallelism into approximate string matching and a new serial algorithm. In: Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, STOC ’86, pp. 220-230. ACM, New York (1986). doi:10.1145/12130.12152. http://doi.acm.org/10.1145/12130.12152 · doi:10.1145/12130.12152
[30] Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231-239 (1988) · doi:10.1016/0888-7543(88)90007-9
[31] Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2), 265-272 (2010) · doi:10.1101/gr.097261.109
[32] Lin, Y., Li, J., Shen, H., Zhang, L., Papasian, C.J., et al.: Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27(15), 2031-2037 (2011) · doi:10.1093/bioinformatics/btr319
[33] Liu, Y., Schmidt, B., Maskell, D.: Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinform. 12, 354 (2011) · doi:10.1186/1471-2105-12-354
[34] Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1-6 (2012) · doi:10.1186/2047-217X-1-18
[35] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’90, pp. 319-327. Society for Industrial and Applied Mathematics, Philadelphia, PA (1990). http://dl.acm.org/citation.cfm?id=320176.320218 · Zbl 0800.68364
[36] Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Proceedings of WABI 2007. LNCS, vol. 4645, pp. 289-301. Springer, Heidelberg (2007)
[37] Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A., Johnson, J., Li, K., Mobarry, C., Sutton, G.: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24(24), 2818-2824 (2008) · doi:10.1093/bioinformatics/btn548
[38] Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(Suppl. 2), ii79-ii85 (2005)
[39] Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of Drosophila. Science 287(5461), 2196-2204 (2000) · doi:10.1126/science.287.5461.2196
[40] Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16(7), 897-908 (2009) · doi:10.1089/cmb.2009.0005
[41] Narzisi, G., Mishra, B.: Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons. Bioinformatics 27(2), 153-160 (2011) · doi:10.1093/bioinformatics/btq646
[42] Pell, J., Hintze, A., Canino-Koning, R., Howe, A., Tiedje, J.M., Brown, C.T.: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl. Acad. Sci. USA 109(33), 13272-13277 (2012) · Zbl 1256.68052 · doi:10.1073/pnas.1121464109
[43] Peltola, H., Söderlund, H., Ukkonen, E.: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res. 12(1), 307-321 (1984) · doi:10.1093/nar/12.1Part1.307
[44] Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA-a practical iterative de Bruijn graph de novo assembler. In: Research in Computational Molecular Biology, pp. 426-440. Springer, Heidelberg (2010)
[45] Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420-1428 (2012) · doi:10.1093/bioinformatics/bts174
[46] Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinformatics 17(Suppl. 1), S225-S233 (2001) · doi:10.1093/bioinformatics/17.suppl_1.S225
[47] Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98(17), 9748-9753 (2001) · Zbl 0993.92018 · doi:10.1073/pnas.171285098
[48] Pop, M., Phillippy, A., Delcher, A.L., Salzberg, S.L.: Comparative genome assembly. Brief. Bioinform. 5(3), 237-248 (2004) · doi:10.1093/bib/5.3.237
[49] Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557-567 (2012) · doi:10.1101/gr.131383.111
[50] Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165-1173 (2010) · doi:10.1101/gr.101360.109
[51] Schmidt, B., Sinha, R., Beresford-Smith, B., Puglisi, S.J.: A fast hybrid short read fragment assembly algorithm. Bioinformatics 25(17), 2279-2280 (2009) · doi:10.1093/bioinformatics/btp374
[52] Sim, J.S., Park, K.: The consensus string problem for a metric is NP-complete. J. Discrete Algorithms 1(1), 111-117 (2003). doi:10.1016/S1570-8667(03)00011-X · Zbl 1118.68449 · doi:10.1016/S1570%E2%80%938667(03)00011-X
[53] Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367-i373 (2010) · doi:10.1093/bioinformatics/btq217
[54] Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549-556 (2012) · doi:10.1101/gr.126953.111
[55] Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117-1123 (2009) · doi:10.1101/gr.089532.108
[56] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195-197 (1981) · doi:10.1016/0022-2836(81)90087-5
[57] Warren, R.L., Sutton, G.G., Jones, S.J., Holt, R.A.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4), 500-501 (2007) · doi:10.1093/bioinformatics/btl629
[58] Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Yu, D.W.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012) · doi:10.1186/1471-2105-13-S6-S1
[59] Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821-829 (2008) · doi:10.1101/gr.074492.107
[60] Zerbino, D.R., McEwen, G.K., Margulies, E.H., Birney, E.: Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de Novo assembler. PLoS One 4(12), e8407 (2009) · doi:10.1371/journal.pone.0008407
[61] Zhang, W.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.