×

Generalized substring compression. (English) Zbl 1295.68112

For the substring-compression problem, we are asked to preprocess a string \(S [1..n]\) such that later, given \(i\) and \(j\) with \(1 \leq i \leq j \leq n\), we can quickly return the output of some pre-specified compression algorithm on \(S [i..j]\); in this paper the algorithm is LZ77 [J. Ziv and A. Lempel, IEEE Trans. Inf. Theory 23, 337–343 (1977; Zbl 0379.94010)]. For the generalized substring-compression problem, we are asked to preprocess \(S\) such that later, given \(i\), \(j\), \(\alpha\) and \(\beta\) with \(1 \leq \alpha \leq \beta \leq n\), we can quickly return the parse of \(S [i..j]\) that LZ77 would generate when already given \(S [\alpha..\beta]\) as a context. That is, we return the suffix of the output of LZ77 on \(S [\alpha..\beta] \$ S [i..j]\), where $ is a special character not occurring in \(S\), that is generated while LZ77 is processing \(S [i..j]\).
G. Cormode and S. Muthukrishnan [“Substring compression problems”, in: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2005. New York, NY: Association for Computing Machinery (2005)] introduced and studied these problems. Via a reduction to range reporting, they gave an \(O (n \log^\epsilon n)\)-space data structure for substring compression with \(O (C (i, j) \log n \log \log n)\) query time, where \(C (i, j)\) is the number of phrases in the LZ77 parse of \(S [i..j]\). They also gave a data structure for generalized substring compression, but it was faulty. In this paper, the authors improve Cormode and Muthukrishnan’s result for substring compression, via a reduction to the range-successor problem [H.-P. Lenhof and M. Smid, RAIRO, Inform. Théor. Appl. 28, No. 1, 25–49 (1994; Zbl 0998.68520)], and give the first correct data structure for generalized substring compression, via a reduction to range emptiness. They achieve various time-space tradeoffs by choosing the data structures for range successor and range emptiness appropriately. For example, they obtain linear-space data structures with \(O (C (i, j) \log^\epsilon n)\) and \(O \left( C_{\alpha, \beta} (i, j) \log \left( \frac{j - i}{C_{\alpha, \beta} (i, j)} \right) \log^\epsilon n \right)\) query times, respectively, for substring compression and generalized substring compression.
Suppose we have already encoded \(S [i..k - 1]\) and now we want to find the next phrase in the LZ77 parse of \(S [i..j]\). For substring compression, we should find the length of the longest common prefix (LCP) of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\). To be able to do this quickly, we represent the suffixes as points on a grid: if \(S [y..n]\) is lexicographically \(x\)th among the suffixes of \(S\), then we represent it as the point \((x, y)\). Finding the length of the LCP of \(S [k..j]\) and any of \(S [i..n], \dots, S [k - 1..n]\) that are lexicographically less (or greater) than \(S [k..j]\), is equivalent to finding the rightmost (or leftmost) point whose \(y\)-coordinate is between \(i\) and \(k - 1\) and whose \(x\)-coordinate is less (or greater) than that of the point with \(y\)-coordinate \(k\).
For generalized substring compression, we should find the length of the LCP of \(S [k..j]\) and any of \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta], S [i..n], \dots, S [k - 1..n]\). Let \(x_{\min}\) and \(x_{\max}\) be the lexicographic ranks of the lexicographically smallest and largest suffixes of \(S\) that share prefixes of length at least \(\ell \leq j - k + 1\) with \(S [k..j]\); these can be computed quickly using, e.g., an augmented suffix tree. Determining whether the LCP of \(S [k..j]\) and \(S [\alpha..\beta], S [\alpha + 1..\beta], \dots, S [\beta]\) is at least \(\ell\) is equivalent to determining whether there is a point in the rectangle \([x_{\min}, x_{\max}] \times [\alpha, \beta]\). Thus, generalized substring compression can be solved using a search with a range-emptiness query at each step.

MSC:

68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68P05 Data structures
68P10 Searching and sorting
68W32 Algorithms on strings

Software:

BEETL; LZ77
Full Text: DOI

References:

[1] Alstrup, S.; Brodal, G. S.; Rauhe, T., New data structures for orthogonal range searching, (IEEE Symposium on Foundations of Computer Science (2000)), 198-207
[2] Amir, A.; Keselman, D.; Landau, G. M.; Lewenstein, M.; Lewenstein, N.; Rodeh, M., Text indexing and dictionary matching with one error, J. Algorithms, 37, 309-325 (2000) · Zbl 0966.68062
[3] Bender, M. A.; Farach-Colton, M., The level ancestor problem simplified, Theor. Comput. Sci., 321, 5-12 (2004) · Zbl 1068.68047
[4] Bille, P.; Gørtz, I. L., Substring range reporting, (Proc. of Symposium on Combinatorial Pattern Matching (CPM) (2011)), 299-308 · Zbl 1339.68049
[5] Chan, T. M.; Larsen, K. G.; Patrascu, M., Orthogonal range searching on the RAM, revisited, (Symposium on Computational Geometry (2011)), 1-10 · Zbl 1283.68139
[6] Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Rasala, A.; Sahai, A.; Shelat, A., Approximating the smallest grammar: Kolmogorov complexity in natural models, (Reif, J. H., STOC (2002), ACM), 792-801 · Zbl 1192.68397
[7] Chen, X.; Kwong, S.; Li, M., A compression algorithm for DNA sequences and its applications in genome comparison, (RECOMB (2000)), 107
[8] Cormode, G.; Muthukrishnan, S., Substring compression problems, (SODA’05: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2005), Society for Industrial and Applied Mathematics: Society for Industrial and Applied Mathematics Philadelphia, PA, USA), 321-330 · Zbl 1297.68278
[9] Cox, A. J.; Jakobi, T.; Rosone, G.; Schulz-Trieglaff, O., Comparing DNA sequence collections by direct comparison of compressed text indexes, (Raphael, B. J.; Tang, J., WABI. WABI, Lect. Notes Comput. Sci., vol. 7534 (2012), Springer), 214-224 · Zbl 1370.68340
[10] Crochemore, M.; Ilie, L., Computing longest previous factor in linear time and applications, Inf. Process. Lett., 106, 75-80 (2008) · Zbl 1186.68591
[11] Crochemore, M.; Ilie, L.; Smyth, W. F., A simple algorithm for computing the Lempel-Ziv factorization, (DCC (2008), IEEE Computer Society), 482-488
[12] Crochemore, M.; Iliopoulos, C. S.; Kubica, M.; Rahman, M. S.; Walen, T., Improved algorithms for the range next value problem and applications, (STACS 2008, 25th Annual Symposium on Theoretical Aspects of Computer Science. STACS 2008, 25th Annual Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 1 (2008), Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik: Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik Germany), 205-216 · Zbl 1259.68226
[13] Farach, M., Optimal suffix tree construction with large alphabets, (FOCS’97: Proceedings of the 38th Annual Symposium on Foundations of Computer Science. FOCS’97: Proceedings of the 38th Annual Symposium on Foundations of Computer Science, (FOCS’97) (1997), IEEE Computer Society: IEEE Computer Society Washington, DC, USA), 137
[14] Ferragina, P., Dynamic text indexing under string updates, J. Algorithms, 22, 296-328 (1997) · Zbl 0876.68038
[15] Ferragina, P.; Muthukrishnan, S.; de Berg, M., Multi-method dispatching: A geometric approach with applications to string matching problems, (STOC (1999)), 483-491 · Zbl 1345.68103
[16] Goto, K.; Bannai, H., Simpler and faster Lempel-Ziv factorization, CoRR (2012)
[17] Harel, D.; Tarjan, R. E., Fast algorithms for finding nearest common ancestors, SIAM J. Comput., 13, 338-355 (1984) · Zbl 0535.68022
[18] Hermelin, D.; Landau, S.; Landau, G.; Weimann, O., A unified algorithm for accelerating edit-distance via text-compression, (Proc. of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS) (2009)), 529-540 · Zbl 1236.68308
[19] Hon, W. K.; Shah, R.; Thankachan, S. V.; Vitter, J. S., On position restricted substring searching in succinct space, J. Discrete Algorithms, 17, 109-114 (2012) · Zbl 1267.68102
[20] Kärkkäinen, J.; Kempa, D.; Puglisi, S. J., Linear time Lempel-Ziv factorization: Simple, fast, small, (Fischer, J.; Sanders, P., CPM. CPM, Lect. Notes Comput. Sci., vol. 7922 (2013), Springer), 189-200 · Zbl 1381.68317
[21] Keller, O.; Kopelowitz, T.; Landau, S.; Lewenstein, M., Generalized substring compression, (Combinatorial Pattern Matching, 20th Annual Symposium. Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009. Combinatorial Pattern Matching, 20th Annual Symposium. Combinatorial Pattern Matching, 20th Annual Symposium, CPM 2009, Lect. Notes Comput. Sci., vol. 5577 (2009), Springer), 26-38
[22] Keller, O.; Kopelowitz, T.; Lewenstein, M., Range non-overlapping indexing and successive list indexing, (WADS’07: Proceedings of the 10th International Workshop on Algorithms and Data Structures. WADS’07: Proceedings of the 10th International Workshop on Algorithms and Data Structures, Lect. Notes Comput. Sci., vol. 4619 (2007), Springer-Verlag), 626-631 · Zbl 1209.68160
[23] Kopelowitz, T.; Lewenstein, M.; Porat, E., Persistency in suffix trees with applications to string interval problems, (Proc. of Symposium on String Processing and Information Retrieval (SPIRE) (2011)), 67-80
[24] Kuruppu, S.; Puglisi, S. J.; Zobel, J., Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval, (Chávez, E.; Lonardi, S., SPIRE. SPIRE, Lect. Notes Comput. Sci., vol. 6393 (2010), Springer), 201-206 · Zbl 1397.68073
[25] Lenhof, H. P.; Smid, M., Using persistent data structures for adding range restrictions to searching problems, RAIRO Theor. Inform. Appl., 28, 25-49 (1994) · Zbl 0998.68520
[26] Lewenstein, M., Orthogonal range searching for text indexing, CoRR (2013) · Zbl 1394.68099
[27] Mäkinen, V.; Navarro, G., Position-restricted substring searching, (LATIN 2006: Theoretical Informatics, 7th Latin American Symposium. LATIN 2006: Theoretical Informatics, 7th Latin American Symposium, Lect. Notes Comput. Sci., vol. 3887 (2006), Springer), 703-714 · Zbl 1145.68392
[28] McCreight, E. M., A space-economical suffix tree construction algorithm, J. ACM, 23, 262-272 (1976) · Zbl 0329.68042
[30] Nekrich, Y.; Navarro, G., Sorted range reporting, (Fomin, F. V.; Kaski, P., SWAT. SWAT, Lect. Notes Comput. Sci., vol. 7357 (2012), Springer), 271-282 · Zbl 1347.68343
[31] Rivals, E.; Delgrange, O.; Delahaye, J. P.; Dauchet, M.; Delorme, M. O.; Hénaut, A.; Ollivier, E., Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences, Comput. Appl. Biosci., 13, 131-136 (1997)
[32] Rytter, W., Application of Lempel-Ziv factorization to the approximation of grammar-based compression, Theor. Comput. Sci., 302, 211-222 (2003) · Zbl 1051.68088
[33] Ukkonen, E., On-line construction of suffix trees, Algorithmica, 14, 249-260 (1995) · Zbl 0831.68027
[34] Weiner, P., Linear pattern matching algorithms, (14th Annual Symposium on Switching and Automata Theory (1973), IEEE), 1-11
[35] Ziv, J.; Lempel, A., A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, 23, 337-343 (1977) · Zbl 0379.94010
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.