×

On the number of elements to reorder when updating a suffix array. (English) Zbl 1237.68268

Summary: Recently new algorithms appeared for updating the Burrows-Wheeler transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length of the text. However, in practice, these algorithms are faster for updating the Burrows-Wheeler transform or the suffix array than the fastest reconstruction algorithms.
In this article we focus on the number of elements to be reordered for real-life texts. We show that this number is related to LCP values and that, on average, Lave entries are reordered, where Lave denotes the average LCP value, defined as the average length of the longest common prefix between two consecutive sorted suffixes. Since we know little about the LCP distribution for real-life texts, we conduct experiments on a corpus that consists of DNA sequences and natural language texts. The results show that apart from texts containing large repetitions, the average LCP value is close to the one expected on a random text.

MSC:

68W32 Algorithms on strings
Full Text: DOI

References:

[1] M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Rep. 124, DEC, Palo Alto, CA, 1994.; M. Burrows, D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Rep. 124, DEC, Palo Alto, CA, 1994.
[2] Cleary, J. G.; Witten, I., Data compression using adaptive coding and partial string matching, IEEE Trans. Commun., 32, 4, 396-402 (1984)
[3] Cleary, J. G.; Teahan, W. J.; Witten, I., Unbounded length contexts for PPM, Comput. J., 40, 2/3, 67-76 (1997)
[4] J. Fayolle, M. Ward, Analysis of the average depth in a suffix tree under a Markov model, in: International Conference on the Analysis of Algorithms, 2005.; J. Fayolle, M. Ward, Analysis of the average depth in a suffix tree under a Markov model, in: International Conference on the Analysis of Algorithms, 2005. · Zbl 1104.68029
[5] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. of Foundations of Computer Science (FOCS), 2000.; P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. of Foundations of Computer Science (FOCS), 2000.
[6] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representation of sequences and full-text indexes, ACM Trans. Algorithms, 3 (2007), article 20 · Zbl 1321.68263
[7] Ferragina, P.; Manzini, G.; Muthukrishnan, S., The Burrows-Wheeler transform, Theoret. Comput. Sci., 387, 3, 197-360 (2007), (special issue)
[8] Gallé, M.; Peterlongo, P.; Coste, F., In-place update of suffix array while recoding words, (Holub, J.; Žďárek, J., Proceedings of the Prague Stringology Conference 2008 (2008), Czech Technical University in Prague: Czech Technical University in Prague Czech Republic) · Zbl 1186.68134
[9] Gonnet, G. H.; Baeza-Yates, R. A.; Snider, T., New indices for text: Pat trees and pat arrays, Information Retrieval: Data Structures & Algorithms, 66-82 (1992)
[10] Haubold, B.; Wiehe, T., How repetitive are genomes?, BMC Bioinformatics, 7 (2006), 541+
[11] J. Kärkkäinen, P. Sanders, Simple linear work suffix array construction, in: Proc. of International Colloquium on Automata, Languages, and Programming (ICALP), 2003.; J. Kärkkäinen, P. Sanders, Simple linear work suffix array construction, in: Proc. of International Colloquium on Automata, Languages, and Programming (ICALP), 2003. · Zbl 1039.68042
[12] Kasai, T.; Lee, G.; Arimura, H.; Arikawa, S.; Park, K., Linear-time longest-common-prefix computation in suffix arrays and its applications, (Proc. of Combinatorial Pattern Matching (CPM). Proc. of Combinatorial Pattern Matching (CPM), Lecture Notes in Comput. Sci., vol. 2089 (2001)) · Zbl 0990.68639
[13] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays, in: Proc. of Combinatorial Pattern Matching (CPM), 2003.; P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays, in: Proc. of Combinatorial Pattern Matching (CPM), 2003. · Zbl 1279.68069
[14] Lam, T. W.; Sung, W. K.; Tam, S. L.; Wong, C. K.; Yiu, S. M., Compressed indexing and local alignment of DNA, Bioinformatics, 24, 6, 791-797 (2008)
[15] Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S. L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology, 10, 3, R25 (2009)
[16] Liolios, K.; Mavrommatis, K.; Tavernarakis, N.; Kyrpides, N. C., The genome on line database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata, Nuc. Acid Research, 36, 475-479 (2008), URL
[17] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, in: Proc. of Symposium on Discrete Algorithms (SODA), 1990.; U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, in: Proc. of Symposium on Discrete Algorithms (SODA), 1990. · Zbl 0800.68364
[18] G. Manzini, Two space saving tricks for linear time LCP array computation, in: Proc. 9th Scandinavian Workshop on Algorithm Theory, vol. 3111, 2004.; G. Manzini, Two space saving tricks for linear time LCP array computation, in: Proc. 9th Scandinavian Workshop on Algorithm Theory, vol. 3111, 2004. · Zbl 1095.68755
[19] Manzini, G.; Ferragina, P., Engineering a lightweight suffix array construction algorithm, Algorithmica, 40, 1, 33-50 (2004) · Zbl 1082.68867
[20] Puglisi, S. J.; Smyth, W. F.; Turpin, A., A taxonomy of suffix array construction algorithms, ACM Comp. Surv., 39, 2, 1-31 (2007)
[21] Salson, M.; Lecroq, T.; Léonard, M.; Mouchard, L., A four-stage algorithm for updating a Burrows-Wheeler Transform, Theoret. Comput. Sci., 410, 43, 4350-4359 (2009) · Zbl 1187.68685
[22] Salson, M.; Lecroq, T.; Léonard, M.; Mouchard, L., Dynamic extended suffix array, J. Discrete Algorithms, 8, 241-257 (2010) · Zbl 1201.68044
[23] J. Seward, On the performance of BWT sorting algorithms, in: Proc. of Data Compression Conference (DCC), 2000.; J. Seward, On the performance of BWT sorting algorithms, in: Proc. of Data Compression Conference (DCC), 2000.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.