×

Detection of over-represented motifs corresponding to known TFBSs via motif clustering and matching. (English) Zbl 1189.68116

Summary: Detection of over-represented motifs corresponding to known TFBSs (Transcription Factor Binding Sites) is an important problem in biological sequences analysis. In this paper, a novel motif discovery method based on motif clustering and matching is proposed. Against a precompiled library of motifs described as position weight matrices (PWMs), each \(L\)-mer in the data set is matched to a motif base on the match score’s \(p\)-value, and then the PWMs are updated and clustered according to their similarity. Motif features are ranked in terms of statistical significance (\(p\)-value). We present an implementation of this approach, named MotifCM, which is capable of discovering multiple distinct motifs present in a single data set. We apply our method to the benchmark which has 56 data sets, and demonstrate that the performance of MotifCM on this data set compares well to, and in many cases exceeds, the performance of existing tools.

MSC:

68T10 Pattern recognition, speech recognition
92D20 Protein sequences, DNA sequences
05C90 Applications of graph theory
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI

References:

[1] Tompa, M.; Li, N.; Bailey, T. L.; Chruch, G. M.; De Moor, B.; Eskin, E., Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23, 137-144 (2005)
[2] Hu, Jianjun; Li, Bin; Kihara, Daisuke, Limitations and potentials of current motif discovery algorithms, Nucleic Acids Research, 33, 4899-4913 (2005)
[3] Sandve, Geir Kjetil; Drabløs, Finn, A survey of motif discovery methods in an integrated framework, Biology Direct, 1, 11 (2006)
[4] Bailey, T. L.; Elkan, C., Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learning, 21, 51-80 (1995)
[5] Ao, W.; Gaudet, J.; Kent, W. J.; Muttumu, S.; Mango, S. E., Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR, Science, 305, 1743-1746 (2004)
[6] Hughes, J. D.; Estep, P. W.; Tavazoie, S.; Church, G. M., Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyeds cerevisiae, Journal of Molecular Biology, 296, 1205-1214 (2000)
[7] Liu, X.; Brutlag, D. L.; Liu, J. S., BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, Pacific Symposium on Biocomputing, 6, 127-138 (2001)
[8] Thijs, G.; Marchal, K.; Lescot, M.; Rombauts, S.; De Moor, B.; Rouze, P.; Moreau, Y., A Gibbs sampling method to detect overrepresented motifs in the upstream regions of co-expressed genes, Journal of Computational Biology, 9, 447-464 (2002)
[9] Frith, M. C.; Hansen, U.; Spouge, J. L.; Weng, Z., Finding functional sequence elements by multiple local alignment, Nucleic Acids Research, 32, 189-200 (2004)
[10] Liang, K. C.; Wang, X. D.; Anastassiou, D., A profile-based deterministic sequential Monte Carlo algorithm for motif discovery, Bioinformatics, 24, 46-55 (2008)
[11] Hertz, G.; Stormo, G., Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics, 15, 563-577 (1999)
[12] Sandelin, A.; Alkema, W.; Engstom, P.; Wasserman, W.; Lenhard, B., JASPAR: An open-access database for eukaryotic transcription factor binding profiles, Nucleic Acids Research, 32, D91-D94 (2004)
[13] Wingender, E.; Chen, X.; Hehl, R.; Karas, H.; Liebich, J.; Matys, V.; Meinhardt, T.; Pruss, M.; Reuter, I.; Schacherer, F., TRANSFAC: An integrated system for gene expression regulation, Nucleic Acids Research, 28, 316-319 (2000)
[14] Frith, Martin C.; Fu, Yutao; Yu, Liqun, Detection of functional DNA motifs via statistical over-representation, Nucleic Acids Research, 32, 1372-1381 (2004)
[15] Mahony, S.; Hendrix, D.; Golden, A.; Smith, T. J.; Rokhsar, D. S., Transcription factor binding site identification using the self-organizing map, Bioinformatics, 21, 1807-1814 (2005)
[16] Jensen, Shane T.; Shen, Lei; Liu, Jun S., Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes, Bioinformatics, 21, 3832-3839 (2005)
[17] Lones, Michael A.; Tyrrell, Andy M., Regulatory motif discovery using a population clustering evolutionary algorithm, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 41, 3, 403-414 (2007)
[18] Bailey, Timothy L.; Gribskov, Michael, Combining evidence using \(p\)-values: Application to sequence homology searches, Bioinformatics, 14, 48-54 (1998)
[19] Pietrokovski, S., Searching databases of conserved sequence regions by aligning protein multiple-alignments, Nucleic Acids Research, 24, 3836-3845 (1996)
[20] Schones, D. E.; Sumazin, P.; Zhang, M. Q., Similarity of position frequency matrices for transcription factor binding sites, Bioinformatics, 21, 307-313 (2005)
[21] Pape, U. J.; Rahmann, S.; Vingron, M., Natural similarity measures between position frequency matrices with an application to clustering, Bioinformatics, 24, 350-357 (2008)
[22] Stormo, G. D., DNA binding sites: Representation and discovery, Bioinformatics, 16, 16-23 (2000)
[23] Wan, H.; Li, L.; Wootton, J. C., Discovering simple regions in biological sequences associated with scoring schemes, Journal of Computational Biology, 10, 171-185 (2003)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.