×

An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. (English) Zbl 1123.68134

Summary: In this paper, we consider the problem of enumerating all maximal motifs in an input string for the class of repeated motifs with wild cards. A maximal motif is such a representative motif that is not properly contained in any larger motifs with the same location lists. Although the enumeration problem for maximal motifs with wild cards has been studied in [L. Parida, I. Rigoutsos and D. Platt, Lect. Notes Comput. Sci. 2089, 131–142 (2001; Zbl 0990.68536); N. Pisanti, M. Crochemore, R. Grossi and M.-F. Sagot, Lect. Notes Comput. Sci. 2747, 622–631 (2003); J. Pelfrêne, S. Abdeddaïm and J. Alexandre, Lect. Notes Comput. Sci. 2676, 328–347 (2003)] its output-polynomial time computability has been still open. The main result of this paper is a polynomial space polynomial delay algorithm for the maximal motif enumeration problem for the repeated motifs with wild cards. This algorithm enumerates all maximal motifs in an input string of length \(n\) in \(O (n^{3})\) time per motif with \(O (n)\) space, in particular \(O (n^{3})\) delay. The key of the algorithm is depth-first search on a tree-shaped search route over all maximal motifs based on a technique called prefix-preserving closure extension. We also show an exponential lower bound and a succinctness result on the number of maximal motifs, which indicate the limit of a straightforward approach. The results of the computational experiments show that our algorithm can be applicable to huge string data such as genome data in practice, and does not take large additional computational cost compared to usual frequent motif mining algorithms.

MSC:

68W05 Nonnumerical algorithms
68W40 Analysis of algorithms
68T05 Learning and adaptive systems in artificial intelligence

Citations:

Zbl 0990.68536

Software:

CloseGraph; gSpan
Full Text: DOI

References:

[1] Apostolico A, Comin M 2, Parida L (2005) Conservative extraction of over-represented extensible motifs. ISMB (Supplement of Bioinformatics) 21:9–18
[2] Apostolico A, Parida L (2003) Compression and the wheel of fortune. In: Proceedings of the 2003 data compression conference (DCC’03), IEEE
[3] Arimura H, Uno T (2005) An output-polynomial time algorithm for mining frequent closed attribute trees. In: Proceedings of the ILP’05, LNAI 3625, pp 1–19 · Zbl 1134.68464
[4] Arimura H, Shinohara T, Otsuki S (1994) Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: STACS’94, LNCS 775, Springer-Verlag, pp 649–660 · Zbl 0941.68683
[5] Boros E, Gurvich V, Khachiyan L, Makino K (2002) The complexity of generating maximal frequent and minimal infrequent sets. In: Proceedings of the STACS ’02, LNCS, pp 133–141 · Zbl 1054.68072
[6] Crochemore M, Rytter W (2002) Jewels of stringology. World Scientific · Zbl 1078.68151
[7] Goldberg LA (1993) Polynomial space polynomial delay algorithms for listing families of graphs. In: Proceedings of the 25th STOC, ACM, pp 218–225 · Zbl 1310.68108
[8] Gusfield D (1997) Algorithms on strings, trees, and sequences. Cambridge · Zbl 0934.68103
[9] Parida L, Rigoutsos I, Floratos A, Platt D, Gao Y (2000) Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and effcient polynomial time algorithm. In: Proceedings of the 11th SIAM symposium on discrete algorithms (SODA’00), pp 297–308 · Zbl 0956.68134
[10] Parida L, Rigoutsos I, Platt DE (2001) An output-sensitive flexible pattern discovery algorithm. In: Proceedings of the CPM’01, LNCS 2089, pp 131–142 · Zbl 0990.68536
[11] Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, pp 398–416 · Zbl 0983.68511
[12] Pelfrêne J, Abdeddaim S, Alexandre J (2003) Extending approximate patterns. In: Proceeding of the CPM’03, LNCS 2676, pp 328–347 · Zbl 1279.68374
[13] Pisanti N, Crochemore M, Grossi R, Sagot M-F (2003) A basis of tiling motifs for generating repeated patterns and its complexity for higher quorum, In: Proceedings of the MFCS’03, LNCS 2747, pp 622–631 · Zbl 1124.68454
[14] Pisanti N, Crochemore M, Grossi R, Sagot M-F (2004) A comparative study of bases for motif inference. String algorithmics. KCL publications · Zbl 1124.68454
[15] Uno T (2003) Two general methods to reduce delay and change of enumeration algorithms, NII Technical Report, NII-2003-004E, April 2003
[16] Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In: Proceedings of the DS’04, LNAI 3245, pp 16–30 · Zbl 1110.68472
[17] Valiant LG (1979) The complexity of computing the permanent. Theor Comput Sci 8:189–201 · Zbl 0415.68008 · doi:10.1016/0304-3975(79)90044-6
[18] Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: Proceedings of the SIGKDD’03
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.