Abstract
Frequent patterns (motifs) in biological sequences are good candidates to correspond to structural or functional important elements. The typical output of existing tools for the exhaustive detection of approximated motifs is a long list of motifs containing some real motifs (i.e., patterns representing functional elements) along with a large number of random variations of them, called artifacts. Artifacts increase the output size, often leading to redundant and poorly usable results for biologists. In this paper, we provide a new solution to the problem of separating real motifs from artifacts. We define a notion of motif maximality, called maximality in conservation, which, if applied to the output of existing motif finding tools, allows us to identify and remove artifacts. Their detection is based on the fact that variations of a motif share a large subset of occurrences of the real motif, but the latter is more conserved than any of its artifacts. Experiments show that the tool we implemented according to such definition allows a sensible reduction of the output size removing artifacts with a negligible time cost.
This work was supported in part by MIUR of Italy under project AlgoDEEP prot. 2008TFBWL4.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. In: Proceedings of the 9th Int. Conf. on Intell. Syst. for Mol. Biol., ISMB, Copenhagen, Denmark, July 21-25, pp. 30–38 (2001), Supplement of Bioinformatics
Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. In: BIRD, pp. 456–465 (2008)
Federico, M., Pisanti, N.: Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci. 410(43), 4391–4401 (2009)
Grossi, R., Pietracaprina, A., Pisanti, N., Pucci, G., Upfal, E., Vandin, F.: Madmx: A strategy for maximal dense motif extraction. J. of Comput. Biol. 18(4), 535–545 (2011)
Haubler, M.: Motif discovery on promotor sequences. Master’s thesis, Institut fur Informatik and IRISA/INRIA Rennes, Universitat Potsdam, Supervised by Dr. Torsten Schaub and Dr. Jacques Nicolas (2005)
Kolpakov, R., Kucherov, G.: Finding approximate repetitions under hamming distance. In: Meyer auf der Heide, F. (ed.) ESA 2001. LNCS, vol. 2161, pp. 170–181. Springer, Heidelberg (2001)
Kurtz, S., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Computation and visualization of degenerate repeats in complete genomes. In: Proceedings of the 8th Int. Conf. on Intell. Syst. for Mol. Biol. (ISMB), pp. 228–238 (2000)
Marsan, L., Sagot, M.-F.: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. of Comput. Biol. 7(3-4), 345–362 (2000)
Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification. In: RECOMB, pp. 210–219 (2000)
Parida, L., Rigoutsos, I., Floratos, A., Platt, D.E., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: SODA, pp. 297–308 (2000)
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans. Comput. Biology Bioinform. 2(1), 40–50 (2005)
Soldano, H., Viari, A., Champesme, M.: Searching for flexible repeated patterns using a non-transitive similarity relation. Pattern Recognition Letters 16, 243–246 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Federico, M., Pisanti, N. (2011). Removing Artifacts of Approximated Motifs. In: Böhm, C., Khuri, S., Lhotská, L., Pisanti, N. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2011. Lecture Notes in Computer Science, vol 6865. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23208-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-23208-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23207-7
Online ISBN: 978-3-642-23208-4
eBook Packages: Computer ScienceComputer Science (R0)