×

The Burrows-Wheeler transform between data compression and combinatorics on words. (English) Zbl 1370.68088

Bonizzoni, Paola (ed.) et al., The nature of computation. Logic, algorithms, applications. 9th conference on computability in Europe, CiE 2013, Milan, Italy, July 1–5, 2013. Proceedings. Berlin: Springer (ISBN 978-3-642-39052-4/pbk). Lecture Notes in Computer Science 7921, 353-364 (2013).
Summary: The Burrows-Wheeler transform (BWT) is a tool of fundamental importance in Data Compression and, recently, has found many applications well beyond its original purpose. The main goal of this paper is to highlight the mathematical and combinatorial properties on which the outstanding versatility of the BWT is based, i.e., its reversibility and the clustering effect on the output. Such properties have aroused curiosity and fervent interest in the scientific world both for theoretical aspects and for practical effects. In particular, in this paper we are interested both to survey the theoretical research issues which, by taking their cue from Data Compression, have been developed in the context of Combinatorics on Words, and to focus on those combinatorial results useful to explore the applicative potential of the Burrows-Wheeler Transform.
For the entire collection see [Zbl 1268.68007].

MSC:

68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
68R15 Combinatorics on words
Full Text: DOI

References:

[1] Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, 1st edn. Springer Publishing Company, Incorporated (2008)
[2] Bauer, M. J.; Cox, A. J.; Rosone, G., Lightweight algorithms for constructing and inverting the BWT of string collections, Theoret. Comput. Sci., 483, 134-148 (2013) · Zbl 1292.68176 · doi:10.1016/j.tcs.2012.02.002
[3] Bonomo, S.; Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M.; Béal, M.-P.; Carton, O., Suffixes, Conjugates and Lyndon words, DLT 2013, 131-142 (2013), Heidelberg: Springer, Heidelberg · Zbl 1381.68230
[4] Burrows, M., Wheeler, D.J.: A block sorting data compression algorithm. Technical report, DIGITAL System Research Center (1994)
[5] Cai, H.; Kulkarni, S. R.; Verdú, S., Universal entropy estimation via block sorting, IEEE Transactions on Information Theory, 50, 7, 1551-1561 (2004) · Zbl 1284.94035 · doi:10.1109/TIT.2004.830771
[6] Cox, A. J.; Bauer, M. J.; Jakobi, T.; Rosone, G., Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform, Bioinformatics, 28, 11, 1415-1419 (2012) · doi:10.1093/bioinformatics/bts173
[7] Cox, A. J.; Jakobi, T.; Rosone, G.; Schulz-Trieglaff, O. B.; Raphael, B.; Tang, J., Comparing DNA sequence collections by direct comparison of compressed text indexes, Algorithms in Bioinformatics, 214-224 (2012), Heidelberg: Springer, Heidelberg · Zbl 1370.68340 · doi:10.1007/978-3-642-33122-0_17
[8] Crochemore, M.; Désarménien, J.; Perrin, D., A note on the Burrows-Wheeler transformation, Theoret. Comput. Sci., 332, 567-572 (2005) · Zbl 1070.68126 · doi:10.1016/j.tcs.2004.11.014
[9] de Luca, A.; Mycielski, J.; Rozenberg, G.; Salomaa, A., Combinatorics of standard sturmian words, Structures in Logic and Computer Science (1997), Heidelberg: Springer, Heidelberg · Zbl 0884.68100 · doi:10.1007/3-540-63246-8_15
[10] de Luca, A.; Mignosi, F., Some combinatorial properties of sturmian words, Theoret. Comput. Sci., 136, 2, 361-385 (1994) · Zbl 0874.68245 · doi:10.1016/0304-3975(94)00035-H
[11] Droubay, X.; Justin, J.; Pirillo, G., Episturmian words and some constructions of de Luca and Rauzy, Theoret. Comput. Sci., 255, 1-2, 539-553 (2001) · Zbl 0981.68126 · doi:10.1016/S0304-3975(99)00320-5
[12] Effros, M.; Visweswariah, K.; Kulkarni, S. R.; Verdú, S., Universal lossless source coding with the Burrows Wheeler Transform, IEEE Transactions on Information Theory, 48, 5, 1061-1081 (2002) · Zbl 1061.94018 · doi:10.1109/18.995542
[13] Ferenczi, S., Zamboni, L.Q.: Clustering Words and Interval Exchanges. Journal of Integer Sequences 16(2), Article 13.2.1 (2013) · Zbl 1291.68305
[14] Ferragina, P.; Gagie, T.; Manzini, G., Lightweight Data Indexing and Compression in External Memory, Algorithmica, 63, 3, 707-730 (2012) · Zbl 1241.68062 · doi:10.1007/s00453-011-9535-0
[15] Ferragina, P.; Giancarlo, R.; Manzini, G.; Sciortino, M., Boosting textual compression in optimal linear time, J. ACM, 52, 4, 688-713 (2005) · Zbl 1323.68260 · doi:10.1145/1082036.1082043
[16] Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS 2000, pp. 390-398. IEEE Computer Society (2000)
[17] Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: SODA 2001, pp. 269-278. SIAM (2001) · Zbl 1002.68519
[18] Ferragina, P.; Manzini, G., Indexing compressed text, J. ACM, 52, 552-581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[19] Ferragina, P.; Nitto, I.; Venturini, R., On optimally partitioning a text to improve its compression, Algorithmica, 61, 51-74 (2011) · Zbl 1221.68302 · doi:10.1007/s00453-010-9437-6
[20] Gessel, I. M.; Reutenauer, C., Counting permutations with given cycle structure and descent set, J. Combin. Theory Ser. A, 64, 2, 189-215 (1993) · Zbl 0793.05004 · doi:10.1016/0097-3165(93)90095-P
[21] Giancarlo, R.; Sciortino, M.; Baeza-Yates, R.; Chávez, E.; Crochemore, M., Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms, Combinatorial Pattern Matching, 129-143 (2003), Heidelberg: Springer, Heidelberg · Zbl 1279.68367 · doi:10.1007/3-540-44888-8_10
[22] Gil, J.Y., Scott, D.A.: A bijective string sorting transform. CoRR (2012); abs/1201.3077
[23] Hon, W.-K.; Ku, T.-H.; Lu, C.-H.; Shah, R.; Thankachan, S. V.; Kärkkäinen, J.; Stoye, J., Efficient Algorithm for Circular Burrows-Wheeler Transform, Combinatorial Pattern Matching, 257-268 (2012), Heidelberg: Springer, Heidelberg · Zbl 1358.68341 · doi:10.1007/978-3-642-31265-6_21
[24] Kaplan, H.; Landau, S.; Verbin, E., A simpler analysis of Burrows-Wheeler-based compression, Theoret. Comput. Sci., 387, 3, 220-235 (2007) · Zbl 1144.68020 · doi:10.1016/j.tcs.2007.07.020
[25] Kaplan, H.; Verbin, E.; Ma, B.; Zhang, K., Most burrows-wheeler based compressors are not optimal, Combinatorial Pattern Matching, 107-118 (2007), Heidelberg: Springer, Heidelberg · Zbl 1138.68416 · doi:10.1007/978-3-540-73437-6_13
[26] Knuth, D.; Morris, J.; Pratt, V., Fast pattern matching in strings, SIAM Journal on Computing, 6, 2, 323-350 (1977) · Zbl 0372.68005 · doi:10.1137/0206024
[27] Kufleitner, M.: On bijective variants of the Burrows-Wheeler transform, pp. 65-79 (2009)
[28] Likhomanov, K. M.; Shur, A. M.; Kulikov, A.; Vereshchagin, N., Two combinatorial criteria for BWT images, Computer Science - Theory and Applications, 385-396 (2011), Heidelberg: Springer, Heidelberg · Zbl 1332.68272 · doi:10.1007/978-3-642-20712-9_30
[29] Lothaire, M.: Algebraic Combinatorics on Words. Cambridge Univ. Press (2002) · Zbl 1001.68093
[30] Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M.; Apostolico, A.; Crochemore, M.; Park, K., An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression, Combinatorial Pattern Matching, 178-189 (2005), Heidelberg: Springer, Heidelberg · Zbl 1130.68314 · doi:10.1007/11496656_16
[31] Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M., An extension of the Burrows-Wheeler Transform, Theoret. Comput. Sci., 387, 3, 298-312 (2007) · Zbl 1144.68024 · doi:10.1016/j.tcs.2007.07.014
[32] Mantaci, S.; Restivo, A.; Rosone, G.; Sciortino, M., A new combinatorial approach to sequence comparison, Theory Comput. Syst., 42, 3, 411-429 (2008) · Zbl 1136.68047 · doi:10.1007/s00224-007-9078-6
[33] Mantaci, S.; Restivo, A.; Sciortino, M., Burrows-Wheeler transform and Sturmian words, Information Processing Letters, 86, 241-246 (2003) · Zbl 1162.68511 · doi:10.1016/S0020-0190(02)00512-4
[34] Mantaci, S.; Restivo, A.; Sciortino, M., Distance measures for biological sequences: Some recent approaches, Int. J. Approx. Reasoning, 47, 1, 109-124 (2008) · Zbl 1183.92035 · doi:10.1016/j.ijar.2007.03.011
[35] Manzini, G., An analysis of the Burrows-Wheeler transform, J. ACM, 48, 3, 407-430 (2001) · Zbl 1323.68262 · doi:10.1145/382780.382782
[36] Ng, K.-H., Ho, C.-K., Phon-Amnuaisuk, S.: A hybrid distance measure for clustering expressed sequence tags originating from the same gene family. PLoS ONE 7(10) (2012)
[37] Jenkinson, O.; Zamboni, L. Q., Characterisations of balanced words via orderings, Theoret. Comput. Sci., 310, 1, 247-271 (2004) · Zbl 1071.68090 · doi:10.1016/S0304-3975(03)00397-9
[38] Pak, I.; Redlich, A., Long cycles in abc-permutations, Functional Analysis and Other Mathematics, 2, 87-92 (2008) · Zbl 1175.05141 · doi:10.1007/s11853-008-0017-0
[39] Restivo, A.; Rosone, G., Burrows-Wheeler transform and palindromic richness, Theoret. Comput. Sci., 410, 30-32, 3018-3026 (2009) · Zbl 1173.68055 · doi:10.1016/j.tcs.2009.03.008
[40] Restivo, A.; Rosone, G., Balancing and clustering of words in the Burrows-Wheeler transform, Theoret. Comput. Sci., 412, 27, 3019-3032 (2011) · Zbl 1220.68081 · doi:10.1016/j.tcs.2010.11.040
[41] Simpson, J., Puglisi, S.J.: Words with simple Burrows-Wheeler transforms. Electronic Journal of Combinatorics 15 article R83 (2008) · Zbl 1183.68446
[42] Simpson, J. T.; Durbin, R., Efficient construction of an assembly string graph using the FM-index, Bioinformatics, 26, 12, 367-373 (2010) · doi:10.1093/bioinformatics/btq217
[43] Vinga, S.; Almeida, J., Alignment-free sequence comparison a review, Bioinformatics, 19, 4, 513-523 (2003) · doi:10.1093/bioinformatics/btg005
[44] Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999) · Zbl 0821.68051
[45] Yang, L.; Zhang, X.; Wang, T., The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform, Journal of Theoretical Biology, 262, 4, 742-749 (2010) · Zbl 1403.92192 · doi:10.1016/j.jtbi.2009.10.033
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.