×

Unified compression-based acceleration of edit-distance computation. (English) Zbl 1259.68048

Summary: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length \(O(N)\) in \(O(N ^{2})\) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings.
As it turns out, practically all known \(o(N ^{2})\) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, run-length encoding, byte-pair encoding, and dictionary methods.
For two strings of total length \(N\) having straight-line program representations of total size \(n\), we present an algorithm running in \(O(nN\lg(N/n))\) time for computing the edit-distance of these two strings under any rational scoring function, and an \(O(n ^{2/3} N ^{4/3})\) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

MSC:

68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68W05 Nonnumerical algorithms
90C39 Dynamic programming

References:

[1] Aggarwal, A., Klawe, M.M., Moran, S., Shor, P., Wilber, R.: Geometric applications of a matrix-searching algorithm. Algorithmica 2, 195–208 (1987) · Zbl 0642.68078 · doi:10.1007/BF01840359
[2] Amir, A., Benson, G., Farach, M.: Let sleeping files lie: pattern matching in Z-compressed files. J. Comput. System Sci. 52(2), 299–307 (1996) · Zbl 1152.68436 · doi:10.1006/jcss.1996.0023
[3] Amir, A., Landau, G.M., Sokol, D.: Inplace 2D matching in compressed images. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 853–862 (2003) · Zbl 1092.68696
[4] Apostolico, A., Landau, G.M., Skiena, S.: Matching for run length encoded strings. J. Complexity 15(1), 4–16 (1999) · Zbl 0921.68041 · doi:10.1006/jcom.1998.0493
[5] Apostolico, A., Atallah, M.J., Larmore, L.L., McFaddin, S.: Efficient parallel algorithms for string editing and related problems. SIAM J. Comput. 19(5), 968–988 (1990) · Zbl 0711.68055 · doi:10.1137/0219066
[6] Arbell, O., Landau, G.M., Mitchell, J.: Edit distance of run-length encoded strings. Inform. Process. Lett. 83(6), 307–314 (2001) · Zbl 1043.68059 · doi:10.1016/S0020-0190(02)00215-6
[7] Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theoret. Comput. Sci. 302(1–3), 211–222 (2003) · Zbl 1051.68088 · doi:10.1016/S0304-3975(02)00777-6
[8] Bunke, H., Csirik, J.: An improved algorithm for computing the edit distance of run length coded strings. Inform. Process. Lett. 54, 93–96 (1995) · Zbl 1004.68598 · doi:10.1016/0020-0190(95)00005-W
[9] Cégielski, P., Guessarian, I., Lifshits, Y., Matiyasevich, Y.: Window subsequence problems for compressed texts. In: Proceedings of the First International Computer Science Symposium in Russia (CSR), pp. 127–136 (2006) · Zbl 1178.68694
[10] Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32, 1654–1673 (2003) · Zbl 1253.74047 · doi:10.1137/S0097539702402007
[11] Gage, P.: A new algorithm for data compression. C Users J. 12, 23–38 (1994)
[12] Gasieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Efficient algorithms for Lempel-Ziv encoding. In: Proceedings of the 4th Scandinavian Workshop on Algorithm Theory (SWAT), pp. 392–403 (1996)
[13] Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997) · Zbl 0934.68103
[14] Hermelin, D., Landau, S., Landau, G.M., Weimann, O.: A unified algorithm for accelerating edit-distance via text-compression. In: Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 529–540 (2009) · Zbl 1236.68308
[15] Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of the 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)
[16] Karkkainen, J., Navarro, G., Ukkonen, E.: Approximate string matching over Ziv-Lempel compressed text. In: Proceedings of the 11th Symposium on Combinatorial Pattern Matching (CPM), pp. 195–209 (2000) · Zbl 0964.68574
[17] Karpinski, M., Rytter, W., Shinohara, A.: Pattern-matching for strings with short descriptions. In: Proceedings of the 6th Symposium on Combinatorial Pattern Matching (CPM), pp. 205–214 (1995) · Zbl 0874.68087
[18] Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 253–272 (2003) · Zbl 1038.68045 · doi:10.1016/S0304-3975(02)00426-7
[19] Lehman, E., Shelat, A.: Approximation algorithms for grammar-based compression. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 205–212 (2002) · Zbl 1093.68593
[20] Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978) · Zbl 0392.94004 · doi:10.1109/TIT.1978.1055934
[21] Lempel, A., Ziv, J.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 409(3), 486–496 (2008) · Zbl 0379.94010
[22] Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
[23] Lifshits, Y.: Processing compressed texts: a tractability border. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 228–240 (2007) · Zbl 1138.68419
[24] Lifshits, Y., Lohrey, M.: Querying and embedding compressed texts. In: Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS), pp. 681–692 (2006) · Zbl 1132.68379
[25] Makinen, V., Navarro, G., Ukkonen, E.: Approximate matching of run-length compressed strings. In: Proceedings of the 12th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–13 (1999)
[26] Manber, U.: A text compression scheme that allows fast searching directly in the compressed file. In: Proceedings of the 5th Symposium on Combinatorial Pattern Matching (CPM), pp. 31–49 (1994)
[27] Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. System Sci. 20, 18–31 (1980) · Zbl 0436.68044 · doi:10.1016/0022-0000(80)90002-1
[28] Miyazaki, M., Shinohara, A., Takeda, M.: An improved pattern matching algorithm for strings in terms of straight-line programs. In: Proceedings of the 8th Symposium on Combinatorial Pattern Matching (CPM), pp. 1–11 (1997)
[29] Mozes, S., Weimann, O., Ziv-Ukelson, M.: Speeding up HMM decoding and training by exploiting sequence repetitions. In: Proceedings of the 18th Symposium on Combinatorial Pattern Matching (CPM), pp. 4–15 (2007) · Zbl 1138.68420
[30] Navarro, G., Kida, T., Takeda, M., Shinohara, A., Arikawa, S.: Faster approximate string matching over compressed text. In: Proceedings of the 11th Data Compression Conference (DCC), pp. 459–468 (2001)
[31] Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1–3), 211–222 (2003) · Zbl 1051.68088 · doi:10.1016/S0304-3975(02)00777-6
[32] Schmidt, J.P.: All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM J. Comput. 27(4), 972–992 (1998) · Zbl 0907.68076 · doi:10.1137/S0097539795288489
[33] Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Proceedings of the 4th Italian Conference on Algorithms and Complexity (CIAC), pp. 306–315 (2000) · Zbl 0971.68632
[34] Tiskin, A.: Faster subsequence recognition in compressed strings. Journal of Mathematical Sciences, to appear · Zbl 1187.68507
[35] Tiskin, A.: Fast distance multiplication of unit-monge matrices. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1287–1296 (2010) · Zbl 1288.68303
[36] Wagner, R., Fischer, M.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974) · Zbl 0278.68032 · doi:10.1145/321796.321811
[37] Ziv, J., Lempel, A.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22(1), 75–81 (1976) · Zbl 0337.94013 · doi:10.1109/TIT.1976.1055501
[38] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977) · Zbl 0379.94010 · doi:10.1109/TIT.1977.1055714
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.