×

Refining the \(r\)-index. (English) Zbl 1435.68075

Summary: The second author et al.’s \(r\)-index [in: Proceedings of the 29th annual ACM-SIAM symposium on discrete algorithms, SODA’18. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM); New York, NY: Association for Computing Machinery (ACM). 1459–1477 (2018; Zbl 1403.68051)] promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify A. Policriti and N. Prezza’s Toehold Lemma [“Computing LZ77 in run-compressed space”, in: Proceedings of the 2016 data compression conference, DCC’16. Los Alamitos, CA: IEEE Computer Society. 23–32 (2016; doi:10.1109/DCC.2016.30); Algorithmica 80, No. 7, 1986–2011 (2018; Zbl 1392.68186)], which inspired the \(r\)-index and plays an important role in its implementation. We then show how to update the \(r\)-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza’s algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the \(r\)-index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database.

MSC:

68P15 Database theory
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
92D20 Protein sequences, DNA sequences

References:

[1] Bannai, H.; Gagie, T.; I, T., Online LZ77 parsing and matching statistics with RLBWTs, (Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching. Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching, CPM (2018)) · Zbl 1497.68169
[2] Danecek, P.; Auton, A.; Abecasis, G.; Albers, C. A.; Banks, E.; DePristo, M. A.; Handsaker, R. E.; Lunter, G.; Marth, G. T.; Sherry, S. T., The variant call format and VCFtools, Bioinformatics, 27, 15, 2156-2158 (2011)
[3] Langmead, B.; Salzberg, S. L., Fast gapped-read alignment with Bowtie 2, Nat. Methods, 9, 4, 357 (2012)
[4] Li, H.; Durbin, R., Fast and accurate long-read alignment with Burrows-Wheeler transform, Bioinformatics, 26, 5, 589-595 (2010)
[5] Ferragina, P.; Manzini, G., Indexing compressed text, J. ACM, 52, 4, 552-581 (2005) · Zbl 1323.68261
[6] Backurs, A.; Indyk, P., Edit distance cannot be computed in strongly subquadratic time (unless SETH is false), (Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing. Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC (2015), ACM), 51-58 · Zbl 1321.68548
[7] Cohen-Addad, V.; Feuilloley, L.; Starikovskaya, T. A., Lower bounds for text indexing with mismatches and differences, (Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA (2019)), 1146-1164 · Zbl 1431.68169
[8] Consortium, C. P.-G., Computational pan-genomics: status, promises and challenges, Brief. Bioinform., Article bbw089 pp. (2016)
[9] Gagie, T.; Puglisi, S. J., Searching and indexing genomic databases via kernelization, Front. Bioeng. Biotechnol., 3, 12 (2015)
[10] Valenzuela, D.; Norri, T.; Välimäki, N.; Pitkänen, E.; Mäkinen, V., Towards pan-genome read alignment to improve variation calling, BMC Genomics, 19, 2, 87 (2018)
[11] Gagie, T.; Navarro, G.; Prezza, N., On the approximation ratio of Lempel-Ziv parsing, (Latin American Symposium on Theoretical Informatics (2018), Springer), 490-503 · Zbl 1485.68129
[12] Policriti, A.; Prezza, N., From LZ77 to the run-length encoded Burrows-Wheeler transform, and back, (Proceedings of the 28th Symposium on Combinatorial Pattern Matching. Proceedings of the 28th Symposium on Combinatorial Pattern Matching, CPM (2017)), 17:1-17:10 · Zbl 1434.68155
[13] Gagie, T.; Navarro, G.; Prezza, N., Optimal-time text indexing in BWT-runs bounded space, (Proceedings of the 19th Symposium on Discrete Algorithms. Proceedings of the 19th Symposium on Discrete Algorithms, SODA (2018)), 1459-1477 · Zbl 1403.68051
[14] Boucher, C.; Gagie, T.; Kuhnle, A.; Manzini, G., Prefix-free parsing for building big BWTs, (Proceedings of the 18th International Workshop on Algorithms in Bioinformatics. Proceedings of the 18th International Workshop on Algorithms in Bioinformatics, WABI (2018)), 2:1-2:16 · Zbl 1494.92083
[15] Boucher, C.; Gagie, T.; Kuhnle, A.; Langmead, B.; Manzini, G.; Mun, T., Prefix-free parsing for building big BWTs, Algorithms Mol. Biol., 14, 1, 13:1-13:15 (2019)
[16] Kuhnle, A.; Mun, T.; Boucher, C.; Gagie, T.; Langmead, B.; Manzini, G., Efficient construction of a complete index for pan-genomics read alignment, (Proc. 23rd Annual International Conference on Research in Computational Molecular Biology. Proc. 23rd Annual International Conference on Research in Computational Molecular Biology, RECOMB, 2019 (2019)), 158-173 · Zbl 1412.92202
[17] Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M., An extension of the Burrows-Wheeler transform, Theor. Comput. Sci., 387, 3, 298-312 (2007) · Zbl 1144.68024
[18] Li, H., Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv preprint
[19] Manber, U.; Myers, E. W., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948 (1993) · Zbl 0784.68027
[20] Burrows, M.; Wheeler, D. J., A Block Sorting Lossless Compression Algorithm (Dec. 1994), Tech. Rep. 124
[21] Ohno, T.; Sakai, K.; Takabatake, Y.; I, T.; Sakamoto, H., A faster implementation of online RLBWT and its application to LZ77 parsing, J. Discret. Algorithms, 52, 18-28 (2018) · Zbl 1410.68417
[22] Online RLBWT
[23] Taher Mun’s Fork of r-index:
[24] Mori, Yuta, divsufsort: A lightweight suffix array construction algorithm
[25] Okanohara, D.; Sadakane, K., A linear-time Burrows-Wheeler transform using induced sorting, (Proc. 16th International Symposium on String Processing and Information Retrieval. Proc. 16th International Symposium on String Processing and Information Retrieval, SPIRE 2009 (2009)), 90-101
[26] dbwt. Direct BWT construction based on induced sorting
[27] Kempa, D., Optimal construction of compressed indexes for highly repetitive texts, (Proc. 30th Annual ACM-SIAM Symposium on Discrete Algorithms. Proc. 30th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019 (2019)), 1344-1357 · Zbl 1431.68172
[28] Prezza, Nicola, r-index: the run-length BWT index · Zbl 1392.68186
[29] Belazzougui, D.; Cunial, F.; Gagie, T.; Prezza, N.; Raffinot, M., Composite repetition-aware data structures, (Proc. 26th Annual Symposium on Combinatorial Pattern Matching. Proc. 26th Annual Symposium on Combinatorial Pattern Matching, CPM 2015 (2015)), 26-39 · Zbl 1432.68082
[30] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, 23, 3, 337-343 (1977) · Zbl 0379.94010
[31] Ohlebusch, E.; Gog, S., Lempel-Ziv factorization revisited, (Proc. 22nd Annual Symposium on Combinatorial Pattern Matching. Proc. 22nd Annual Symposium on Combinatorial Pattern Matching, CPM 2011 (2011)), 15-26 · Zbl 1339.68335
[32] Kempa, D.; Puglisi, S. J., Lempel-Ziv factorization: simple, fast, practical, (ALENEX (2013)), 103-112 · Zbl 1430.68462
[33] Goto, K.; Bannai, H., Simpler and faster Lempel Ziv factorization, (Proc. Data Compression Conference. Proc. Data Compression Conference, DCC 2013 (2013)), 133-142
[34] Goto, K.; Bannai, H., Space efficient linear time Lempel-Ziv factorization for small alphabets, (Proc. Data Compression Conference. Proc. Data Compression Conference, DCC 2014 (2014)), 163-172
[35] Kärkkäinen, J.; Kempa, D.; Puglisi, S. J., Lightweight Lempel-Ziv parsing, (Proceedings of the 13th Symposium on Experimental Algorithms. Proceedings of the 13th Symposium on Experimental Algorithms, SEA (2013)), 139-150
[36] Yamamoto, J.; I, T.; Bannai, H.; Inenaga, S.; Takeda, M., Faster compact on-line Lempel-Ziv factorization, (STACS (2014)), 675-686 · Zbl 1359.68341
[37] Policriti, A.; Prezza, N., Fast online Lempel-Ziv factorization in compressed space, (Proceedings of the 22nd Symposium on String Processing and Information Retrieval. Proceedings of the 22nd Symposium on String Processing and Information Retrieval, SPIRE (2015)), 13-20
[38] Fischer, J.; Gagie, T.; Gawrychowski, P.; Kociumaka, T., Approximating LZ77 via small-space multiple-pattern matching, (Proc. 23rd Annual European Symposium on Algorithms. Proc. 23rd Annual European Symposium on Algorithms, ESA 2015 (2015)), 533-544 · Zbl 1466.68090
[39] Kosolobov, D., Faster lightweight Lempel-Ziv parsing, (Proc. 40th International Symposium on Mathematical Foundations of Computer Science. Proc. 40th International Symposium on Mathematical Foundations of Computer Science, MFCS 2015 (2015)), 432-444 · Zbl 1465.68316
[40] Belazzougui, D.; Puglisi, S. J., Range predecessor and Lempel-Ziv parsing, (Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms. Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016 (2016)), 2053-2071 · Zbl 1410.68116
[41] Policriti, A.; Prezza, N., LZ77 computation based on the run-length encoded BWT, Algorithmica, 80, 7, 1986-2011 (2018) · Zbl 1392.68186
[42] Fischer, J.; I, T.; Köppl, D.; Sadakane, K., Lempel-Ziv factorization powered by space efficient suffix trees, Algorithmica, 80, 7, 2048-2081 (2018) · Zbl 1392.68184
[43] Nishimoto, T.; Tabei, Y., Conversion from RLBWT to LZ77, (Proc. 30th Annual Symposium on Combinatorial Pattern Matching. Proc. 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019 (2019)), 9:1-9:12 · Zbl 07559177
[44] Kreft, S.; Navarro, G., LZ77-like compression with fast random access, (Proc. Data Compression Conference. Proc. Data Compression Conference, DCC 2010 (2010)), 239-248
[45] Kempa, D.; Kosolobov, D., LZ-end parsing in compressed space, (Proc. Data Compression Conference. Proc. Data Compression Conference, DCC 2017 (2017)), 350-359
[46] LZscan
[47] DYNAMIC: dynamic succinct/compressed data structures library
[48] Prezza, N., A framework of dynamic data structures for string processing, (Proc. 16th International Symposium on Experimental Algorithms. Proc. 16th International Symposium on Experimental Algorithms, SEA (2017)), 11:1-11:15 · Zbl 1432.68610
[49] Prezza, N., Compressed Computation for Text Indexing (2016), University of Udine, Ph.D. thesis
[50] Get-git-revisions: get all revisions of a git repository
[51] Li, H., Tabix: fast retrieval of sequence features from generic TAB-delimited files, Bioinformatics, 27, 5, 718-719 (2011)
[52] Kuruppu, S.; Puglisi, S. J.; Zobel, J., Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval, (Proceedings of the 17th Symposium on String Processing and Information Retrieval. Proceedings of the 17th Symposium on String Processing and Information Retrieval, SPIRE (2010)), 201-206 · Zbl 1397.68073
[53] Cox, A. J.; Farruggia, A.; Gagie, T.; Puglisi, S. J.; Sirén, J., RLZAP: relative Lempel-Ziv with adaptive pointers, (Proceedings of the 23rd Symposium on String Processing and Information Retrieval. Proceedings of the 23rd Symposium on String Processing and Information Retrieval, SPIRE (2016)), 1-14 · Zbl 1397.68068
[54] Belazzougui, D.; Navarro, G., Optimal lower and upper bounds for representing sequences, ACM Trans. Algorithms, 11, 4, 31:1-31:21 (2015) · Zbl 1398.68103
[55] Gagie, T.; Navarro, G.; Prezza, N., Fully-functional suffix trees and optimal text searching in BWT-runs bounded space, CoRR · Zbl 1491.68067
[56] Kärkkäinen, J.; Manzini, G.; Puglisi, S. J., Permuted longest-common-prefix array, (Proc. 20th Annual Symposium on Combinatorial Pattern Matching. Proc. 20th Annual Symposium on Combinatorial Pattern Matching, CPM (2009), Springer), 181-192 · Zbl 1247.68336
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.