×

Linear-time text compression by longest-first substitution. (English) Zbl 1445.68365

Summary: We consider grammar-based text compression with longest first substitution (LFS), where non-overlapping occurrences of a longest repeating factor of the input text are replaced by a new non-terminal symbol. We present the first linear-time algorithm for LFS. Our algorithm employs a new data structure called sparse lazy suffix trees. We also deal with a more sophisticated version of LFS, called LFS2, that allows better compression. The first linear-time algorithm for LFS2 is also presented.

MSC:

68W32 Algorithms on strings
68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68Q42 Grammars and rewriting systems

References:

[1] Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S.; Collage system: a unifying framework for compressed pattern matching; Theoretical Computer Science: 2003; Volume 298 ,253-272. · Zbl 1038.68045
[2] M¨akinen, V.; Ukkonen, E.; Navarro, G.; Approximate Matching of Run-Length Compressed Strings; Algorithmica: 2003; Volume 35 ,347-369. · Zbl 1045.68059
[3] Lifshits, Y.; Processing Compressed Texts: A Tractability Border; Proc. 18th Annual Symposium on Combinatorial Pattern Matching (CPM’07): 2007; Volume Vol. 4580 ,228-240. · Zbl 1138.68419
[4] Matsubara, W.; Inenaga, S.; Ishino, A.; Shinohara, A.; Nakamura, T.; Hashimoto, K.; Efficient Algorithms to Compute Compressed Longest Common Substrings and Compressed Palindromes; Theoretical Computer Science: 2009; Volume 410 ,900-913. · Zbl 1162.68038
[5] Hermelin, D.; Landau, G. M.; Landau, S.; Weimann, O.; A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression; Proc. 26th International Symposium on Theoretical Aspects of Computer Science (STACS’09): 2009; ,529-540. · Zbl 1236.68308
[6] Matsubara, W.; Inenaga, S.; Shinohara, A.; Testing Square-Freeness of Strings Compressed by Balanced Straight Line Program; Proc. 15th Computing: The Australasian Theory Symposium (CATS’09): 2009; Volume Vol. 94 ,19-28. · Zbl 1286.68531
[7] Nevill-Manning, C. G.; Witten, I. H.; Identifying hierarchical structure in sequences: a linear-time algorithm; J. Artificial Intelligence Research: 1997; Volume 7 ,67-82. · Zbl 0894.68072
[8] Nevill-Manning, C. G.; Witten, I. H.; Online and offline heuristics for inferring hierarchies of repetitions in sequences; Proc. IEEE: 2000; Volume 88 ,1745-1755.
[9] Giancarlo, R.; Scaturro, D.; Utro, F.; Textual data compression in computational biology: a synopsis; Bioinformatics: 2009; Volume 25 ,1575-1586. · Zbl 1298.68087
[10] Kieffer, J. C.; Yang, E.-H.; Grammar-based codes: A new class of universal lossless source codes; IEEE Transactions on Information Theory: 2000; Volume 46 ,737-754. · Zbl 1001.94019
[11] Storer, J.; NP-completeness Results Concerning Data Compression; 1977; .
[12] Ziv, J.; Lempel, A.; Compression of individual sequences via variable-rate coding; IEEE Trans. Information Theory: 1978; Volume 24 ,530-536. · Zbl 0392.94004
[13] Welch, T. A.; A Technique for High-Performance Data Compression; IEEE Computer: 1984; Volume 17 ,8-19.
[14] Kieffer, J. C.; Yang, E.-H.; Nelson, G. J.; Cosman, P. C.; Universal lossless compression via multilevel pattern matching; IEEE Transactions on Information Theory: 2000; Volume 46 ,1227-1245. · Zbl 1003.94017
[15] Sakamoto, H.; A fully linear-time approximation algorithm for grammar-based compression; Journal of Discrete Algorithms: 2005; Volume 3 ,416-430. · Zbl 1101.68986
[16] Rytter, W.; Application of Lempel-Ziv factorization to the approximation of grammar-based compression; Theoretical Computer Science: 2003; Volume 302 ,211-222. · Zbl 1051.68088
[17] Sakamoto, H.; Maruyama, S.; Kida, T.; Shimozono, S.; A Space-Saving Approximation Algorithm for Grammar-Based Compression; IEICE Trans. on Information and Systems: 2009; Volume E92-D ,158-165.
[18] Maruyama, S.; Tanaka, Y.; Sakamoto, H.; Takeda, M.; Context-Sensitive Grammar Transform: Compression and Pattern Matching; Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE’08): 2008; Volume Vol. 5280 ,27-38.
[19] Wolff, J. G.; An algorithm for the segmentation for an artificial language analogue; British Journal of Psychology: 1975; Volume 66 ,79-90.
[20] Larsson, N. J.; Moffat, A.; Offline Dictionary-Based Compression; Proc. Data Compression Conference ’99 (DCC’99): 1999; ,296.
[21] Apostolico, A.; Lonardi, S.; Off-Line Compression by Greedy Textual Substitution; Proc. IEEE: 2000; Volume 88 ,1733-1744.
[22] Apostolico, A.; Lonardi, S.; Compression of Biological Sequences by Greedy Off-Line Textual Substitution; Proc. Data Compression Conference ’00 (DCC’00): 2000; ,143-152.
[23] Ziv, J.; Lempel, A.; A Universal Algorithm for Sequential Data Compression; IEEE Transactions on Information Theory: 1977; Volume IT-23 ,337-349. · Zbl 0379.94010
[24] Burrows, M.; Wheeler, D.; A block sorting lossless data compression algorithm; 1994; .
[25] Nakamura, R.; Bannai, H.; Inenaga, S.; Takeda, M.; Simple Linear-Time Off-Line Text Compression by Longest-First Substitution; Proc. Data Compression Conference ’07 (DCC’07): 2007; ,123-132.
[26] Inenaga, S.; Funamoto, T.; Takeda, M.; Shinohara, A.; Linear-time off-line text compression by longest-first substitution; Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE’03): 2003; Volume Vol. 2857 ,137-152. · Zbl 1254.68120
[27] Bentley, J.; McIlroy, D.; Data compression using long common strings; Proc. Data Compression Conference ’99 (DCC’99): 1999; ,287-295.
[28] Lanctot, J. K.; Li, M.; Yang, E.-H.; Estimating DNA sequence entropy; Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00): 2000; ,409-418. · Zbl 0963.92033
[29] Ukkonen, E.; On-line Construction of Suffix Trees; Algorithmica: 1995; Volume 14 ,249-260. · Zbl 0831.68027
[30] K¨arkk¨ainen, J.; Ukkonen, E.; Sparse Suffix Trees; Proc. 2nd Annual International Computing and Combinatorics Conference (COCOON’96): 1996; Volume Vol. 1090 ,219-230. · Zbl 1529.68088
[31] Apostolico, A.; Preparata, F. P.; Data structures and algorithms for the string statistics problem; Algorithmica: 1996; Volume 15 ,481-494. · Zbl 0846.68023
[32] Brødal, G. S.; Lyngsø, R. B.; O¨stlin, A.; Pedersen, C. N. S.; Solving the String Stastistics Problem in Time O(n log n); Proc. 29th International Colloquium on Automata,Languages, and Programming (ICALP’02): 2002; Volume Vol. 2380 ,728-739. · Zbl 1057.68599
[33] Lanctot, J. K.; Some String Problems in Computational Biology; PhD thesis: 2004; .
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.