Abstract
In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we give a general method for comparing sequences that takes into account how much the characters coming from the different input sequences are mixed in the output string. Such a method is tested on a real data set for the whole mitochondrial genome phylogeny problem. However, the goal of this paper is to introduce a new and general methodology for automatic categorization of sequences.
Similar content being viewed by others
References
Benedetto, D., Caglioti, E., Loreto, V.: Zipping out relevant information, Comput. Sci. Eng. 80–85 (2003)
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Burrows, M., Wheeler, D.J.: A block sorting data compression algorithm. Technical report, DIGITAL System Research Center (1994)
Cao, Y., Janke, A., Waddell, P.J., Westerman, M., Takenaka, O., Murata, S., Okada, N., Pääbo, S., Hasegawa, M.: Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47, 307–322 (1998)
Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Crochemore, M., Désarménien, J., Perrin, D.: A note on the Burrows-Wheeler transformation. Theor. Comput. Sci. 332, 567–572 (2005)
Ergun, F., Muthukrishnan, S., Sahinalp, C.: Comparing sequences with segment rearrangements. In: Proc. of the FSTTCS’03, Bombay, India. Lecture Notes in Computer Science, pp. 183–194. Springer, Berlin (2003)
Fenwick, P.: The Burrows-Wheeler transform for block sorting text compression: principles and improvements. Comput. J. 39(9), 731–740 (1996)
Gessel, I.M., Reutenauer, C.: Counting permutations with given cycle structure and descent set. J. Comb. Theory Ser. A 64(2), 189–215 (1993)
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proc. ICALP, pp. 943–955, 2003
Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17, 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Inf. Theory 12(5), 3250–3264 (2004)
Lothaire, M.: Combinatorics on Words. Encyclopedia of Mathematics, vol. 17. Addison–Wesley, Reading (1983). Reprinted in the Cambridge Mathematical Library, Cambridge University Press (1997)
Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press, Cambridge (2002)
Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Inf. Proc. Lett. 86, 241–246 (2003)
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows Wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) Combinatorial Pattern Matching, 16th Annual Symposium, CPM 2005, Jeju Island, Korea, June 19–22, 2005, Proceedings. Lecture Notes in Computer Science, vol. 3537, pp. 178–189. Springer, Berlin (2005)
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) Proceedings of ICTCS 2005, Theoretical Computer Science, 9th Italian Conference, ICTCS 2005, Siena, Italy, October 12–14, 2005. Lecture Notes in Computer Science, vol. 3701, pp. 348–359. Springer, Berlin (2005)
Mantaci, S., Rosone, G., Restivo, A., Sciortino, M.: An extension of the burrows wheeler transform. Theor. Comput. Sci. (to appear)
Manzini, G.: The Burrows-Wheeler transform: theory and practice. In: Proc. of the 24th International Symposium on Mathematical Foundations of Computer Science (MFCS ’99). Lecture Notes in Computer Science, vol. 1672, pp. 34–47. Springer, Berlin (1999)
Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16), 2122–2130 (2003)
Pevzner, P.A.: Statistical distance between texts and filtration methods in sequence comparison. Comput. Appl. Biosci. (8), 121–127 (1992)
Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4
Vinga, S., Almeida, J.: Alignment-free sequence comparison—a review. Bioinformatics 19(4), 513–523 (2003)
Wilf, H.S., Fine, N.J.: Uniqueness theorem for periodic functions. Proc. Am. Math. Soc. 16, 109–114 (1965)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially supported by the Italian MIUR PRIN project “Automi e Linguaggi Formali: aspetti matematici ed applicativi” and by MIUR FIRB Italy-Israel project “Pattern Matching and Discovery in Discrete Structures, with applications to Bioinformatics”.
Rights and permissions
About this article
Cite this article
Mantaci, S., Restivo, A., Rosone, G. et al. A New Combinatorial Approach to Sequence Comparison. Theory Comput Syst 42, 411–429 (2008). https://doi.org/10.1007/s00224-007-9078-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-007-9078-6