×

Extended pairwise local alignment of wild card DNA/RNA sequences using dynamic programming. (English) Zbl 1457.62343

Summary: Optimal string alignment is used to discover evolutionary relationships or mutations in DNA/RNA or protein sequences. Errors, missing parts or uncertainty in such a sequence can be covered with wild cards, so-called wild bases. This makes an alignment possible even when the data are corrupted or incomplete. The extended pairwise local alignment of wild card DNA/RNA sequences requires additional calculations in the dynamic programming algorithm and necessitates a subsequent best- and worst-case analysis for the wild card positions. In this paper, we propose an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment. An implementation of the algorithm can be obtained at https://github.com/sysbio-bioinf/swat+ and http://sysbio.uni-ulm.de/?Software:Swat+.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
68T05 Learning and adaptive systems in artificial intelligence
92D20 Protein sequences, DNA sequences
92-04 Software, source code, etc. for problems pertaining to biology
Full Text: DOI

References:

[1] Pierri CL, Parisi G, Porcelli V. Computational approaches for protein function prediction: a combined strategy from multiple sequence alignment to molecular docking-based virtual screening. BBA-Protein Proteomics. 2010;1804:1695-1712. doi: 10.1016/j.bbapap.2010.04.008[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[2] Lambert C, Léonard N, De Bolle X, Depiereux E. ESyPred3D: prediction of proteins 3D structures. Bioinformatics. 2002;18:1250-1256. doi: 10.1093/bioinformatics/18.9.1250[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[3] Jani M, Azad RK. Information entropy based methods for genome comparison. ACM SIGBioinform Rec. 2013;3:2: 1-2:4. [Google Scholar]
[4] Rosenberg MS. Sequence alignment: methods, models, concepts, and strategies. Berkeley: University of California Press; 2009. [Google Scholar]
[5] Haque W, Aravind A, Reddy B. Pairwise sequence alignment algorithms: a survey. In: Proceedings of the 2009 conference on Information Science, Technology and Applications; ISTA ’09; Kuwait; New York: ACM. p. 96-103. [Google Scholar]
[6] Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11:473-483. doi: 10.1093/bib/bbq015[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[7] Gibbs AJ, Mcintyre GA. The diagram, a method for comparing sequences. Eur J Biochem. 1970;16:1-11. doi: 10.1111/j.1432-1033.1970.tb01046.x[Crossref], [PubMed], [Google Scholar]
[8] Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227:1435-1441. doi: 10.1126/science.2983426[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[9] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403-410. doi: 10.1016/S0022-2836(05)80360-2[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[10] Pearson WR, Miller W. Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 1992;210:575-601. doi: 10.1016/0076-6879(92)10029-D[Crossref], [PubMed], [Google Scholar]
[11] Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. In: Misener S, Krawetz SA, editors. Bioinformatics methods and protocols. New York: Humana Press; 1999. p. 185-219. [Crossref], [Google Scholar]
[12] Leslie C, Kuang R. Fast string kernels using inexact matching for protein sequences. J Mach Learn Res. 2004;5:1435-1455. [Web of Science ®], [Google Scholar] · Zbl 1222.92029
[13] Zhao K, Chu X. G-BLASTN: accelerating nucleotide alignment by graphics processors. Bioinformatics. 2014. 10.1093/bioinformatics/btu047. [Crossref], [Web of Science ®], [Google Scholar]
[14] Di Tommaso P, Orobitg M, Guirado F, Cores F, Espinosa T, Notredame C. Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud. Bioinformatics. 2010;26:1903-1904. doi: 10.1093/bioinformatics/btq304[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[15] Harfe BD, Jinks-Robertson S. Removal of frameshift intermediates by mismatch repair proteins in Saccharomyces cerevisiae. Mol Cell Biol. 1999;19:4766-4773. [PubMed], [Web of Science ®], [Google Scholar]
[16] Farrell PM, Rosentein BJ, White TB, Accurso FJ, Castellani C, Cutting GR, Durie PR, Legrys VA, Massie J, Parad RB, Rock MJ, Campbell PW III. Guidelines for diagnosis of cystic fibrosis in newborns through older adults: cystic fibrosis foundation consensus report. J Pediatr. 2008;153(2):S4-S14. [PubMed], [Web of Science ®], [Google Scholar]
[17] Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, Nord AS, Mandell JB, Swisher EM, King MC. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc Natl Acad Sci. 2010;107:12629-12633. doi: 10.1073/pnas.1007983107[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[18] Girdea M, Noe L, Kucherov G. Back-translation for discovering distant protein homologies in the presence of frameshift mutations. Algorithms Mol Biol. 2010;5(1):1-15. doi: 10.1186/1748-7188-5-6[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[19] Taub MA, Corrada Bravo H, Irizarry RA. Overcoming bias and systematic errors in next generation sequencing data. Genome Med. 2010;2(12):1-5. doi: 10.1186/gm208[Crossref], [PubMed], [Google Scholar]
[20] Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform. 2013;14(1):56-66. doi: 10.1093/bib/bbs015[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[21] Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13:341. doi: 10.1186/1471-2164-13-341[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[22] Connelly NG. Nomenclature of inorganic chemistry: IUPAC recommendations 2005. Cambridge: Royal Society of Chemistry Publishing; 2005. [Google Scholar]
[23] Ripley L. Frameshift mutation: determinants of specificity. Annu Rev Genet. 1990;24:189-211. doi: 10.1146/annurev.ge.24.120190.001201[Crossref], [PubMed], [Google Scholar]
[24] Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195-197. doi: 10.1016/0022-2836(81)90087-5[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[25] Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162:705-708. doi: 10.1016/0022-2836(82)90398-9[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[26] Guan X, Uberbacher EC. Alignments of DNA and protein sequences containing frameshift errors. Comput Appl Biosci. 1996;12:31-40. [PubMed], [Google Scholar]
[27] Huang X, Zhang J. Methods for comparing a DNA sequence with a protein sequence. Comput Appl Biosci. 1996;12:497-506. [PubMed], [Google Scholar]
[28] Lysholm F. Highly improved homopolymer aware nucleotide-protein alignments with 454 data. BMC Bioinform. 2012;13(1):1-13. doi: 10.1186/1471-2105-13-230[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[29] Chevillotte M, Einemvon J, Meier B, Lin F, Kestler HA, Mertens T. A new tool linking human cytomegalovirus drug resistance mutations to resistance phenotypes. Antivir Res. 2010;85:318-327. doi: 10.1016/j.antiviral.2009.10.004[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[30] Fitch WM. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem Genet. 1969;3:99-108. doi: 10.1007/BF00520346[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[31] Sankoff D, Cedergren RJ, Lapalme G. Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA. J Mol Evolut. 1976;7:133-149. doi: 10.1007/BF01732471[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[32] Altschul SF. Generalized affine gap costs for protein sequence alignment. Proteins. 1998;32:88-96. doi: 10.1002/(SICI)1097-0134(19980701)32:1<88::AID-PROT10>3.0.CO;2-J[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[33] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992;89:10915-10919. doi: 10.1073/pnas.89.22.10915[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[34] Dayhoff MO, Eck RV. Atlas of protein sequence and structure, 1967-68. Silver Spring, MD: National Biomedical Research Foundation; 1968. [Google Scholar]
[35] Griffiths AJ, Miller JH, Suzuki DT, Lewontin RC, Gelbart WM. An introduction to genetic analysis. 7th ed.New York: W.H. Freeman; 2000. [Google Scholar]
[36] Michel D, Pavi I, Zimmermann A, Haupt E, Wunderlich K, Heuschmid M, Mertens T. The UL97 gene product of human cytomegalovirus is an early-late protein with a nuclear localization but is not a nucleoside kinase. J Virol. 1996;70:6340-6346. [PubMed], [Web of Science ®], [Google Scholar]
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.