×

Finding biclusters by random projections. (English) Zbl 1171.68865

Summary: Given a matrix \(X\) composed of symbols, a bicluster is a submatrix of \(X\) obtained by removing some of the rows and some of the columns of \(X\) in such a way that each row of what is left reads the same string. In this paper, we are concerned with the problem of finding the bicluster with the largest area in a large matrix \(X\). The problem is first proved to be NP-complete. We present a fast and efficient randomized algorithm that discovers the largest bicluster by random projections. A detailed probabilistic analysis of the algorithm and an asymptotic study of the statistical significance of the solutions are given. We report results of extensive simulations on synthetic data.

MSC:

68W20 Randomized algorithms
68T10 Pattern recognition, speech recognition
68W40 Analysis of algorithms
Full Text: DOI

References:

[1] C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, J.S. Park, Fast algorithms for projected clustering, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data (SIGMOD-99), SIGMOD Record, Vol. 28(2), ACM Press, New York, 1999, pp. 61-72.; C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, J.S. Park, Fast algorithms for projected clustering, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data (SIGMOD-99), SIGMOD Record, Vol. 28(2), ACM Press, New York, 1999, pp. 61-72.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data (SIGMOD-98), ACM SIGMOD Record, Vol. 27(2), ACM Press, New York, 1998, pp. 94-105.; R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data (SIGMOD-98), ACM SIGMOD Record, Vol. 27(2), ACM Press, New York, 1998, pp. 94-105.
[3] Ben-Dor, A.; Chor, B.; Karp, R.; Yakhini, Z., Discovering local structure in gene expression data: the order-preserving submatrix problem, (Proc. Sixth Internat. Conf. on Computational Molecular Biology (RECOMB 2002) (2002), ACM Press: ACM Press New York), 45-55
[4] Cheng, Y.; Church, G. M., Biclustering of expression data, (Proc. Eighth Internat. Conf. on Intelligent Systems for Molecular (ISMB-00) (2000), AAAI Press: AAAI Press Menlo Park, CA), 93-103
[5] Dawande, M.; Keskinocak, P.; Swaminathan, J. M.; Tayur, S., On bipartite and multipartite clique problems, J. Algorithms, 41, 388-403 (2001) · Zbl 1017.68088
[6] Dhillon, I., Co-clustering documents and words using bipartite spectral graph partitioning, (Proc. Seventh ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (KDD-01) (2001), ACM Press: ACM Press New York), 269-274
[7] Dhillon, I.; Mallela, S.; Modha, D., Information-theoretic co-clustering, (Proc. Seventh ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining (KDD-03) (2001), ACM Press: ACM Press New York), 89-98
[8] Garey, M. R.; Johnson, D. S., Computers and Intractability: A Guide to the Theory of NP-Completeness (1979), Freeman: Freeman New York, NY · Zbl 0411.68039
[9] M. Grigni, F. Manne, On the complexity of the generalized block distribution, in: Proc. of Irregular’96, The Third Internat. Workshop on Parallel Algorithms for Irregularly Structured Problems, Lecture Notes in Computer Science, Vol. 1117, Springer, Berlin, 1996, pp. 319-326.; M. Grigni, F. Manne, On the complexity of the generalized block distribution, in: Proc. of Irregular’96, The Third Internat. Workshop on Parallel Algorithms for Irregularly Structured Problems, Lecture Notes in Computer Science, Vol. 1117, Springer, Berlin, 1996, pp. 319-326.
[10] M. Gu, H. Zha, C. Ding, X. He, H. Simon, Spectral relaxation models and structure analysis for \(k\)-way graph clustering and bi-clustering, Technical Report CSE-01-007, Department of Computer Science and Engineering, Pennsylvania State University, 2001.; M. Gu, H. Zha, C. Ding, X. He, H. Simon, Spectral relaxation models and structure analysis for \(k\)-way graph clustering and bi-clustering, Technical Report CSE-01-007, Department of Computer Science and Engineering, Pennsylvania State University, 2001.
[11] Hanisch, D.; Zien, A.; Zimmer, R.; Lengauer, T., Co-clustering of biological networks and gene expression data, (Proc. Eighth Internat. Conf. on Intelligent Systems for Molecular (ISMB-02) (2002), AAAI Press: AAAI Press Menlo Park, CA), 145-154
[12] Hartigan, J. A., Direct clustering of a data matrix, J. Amer. Statist. Assoc., 67, 337, 123-129 (1972)
[13] Hastie, T.; Tibshirani, R.; Eisen, M.; Alizadeh, A.; Levy, R.; Staudt, L.; Chan, W.; Botstein, D.; Brown, P., ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns, Genome Biol., 1, 2, 1-21 (2000)
[14] Hochbaum, D. S., Approximating clique and biclique problems, J. Algorithms, 29, 1, 174-200 (1998) · Zbl 0919.68056
[15] Kluger, Y.; Basri, R.; Chang, J.; Gerstein, M., Spectral biclustering of microarray data: coclustering genes and conditions, Genome Res., 13, 4, 703-716 (2003)
[16] Koyuturk, M.; Szpankowski, W.; Grama, A., Biclustering gene-feature matrices for statistically significant dense patterns, (IEEE Computer Society Bioinformatics Conf.. IEEE Computer Society Bioinformatics Conf., Stanford, CA (2004)), 480-483
[17] Kumar, R.; Raghavan, P.; Rajagopalan, S.; Tomkins, A., Trawling the web for emerging cyber-communities, Comput. Networks, 31, 11-16, 1481-1493 (1999)
[18] Lazzeroni, L.; Owen, A., Plaid models for gene expression data, Statist. Sinica, 12, 1, 61-86 (2002) · Zbl 1004.62084
[19] Liu, J.; Yang, J.; Wang, W., Biclustering in gene expression data by tendency, (IEEE Computational Systems Bioinformatics Conf. (CSB’04) (2004)), 224-235
[20] Melkman, A. A.; Shaham, E., Sleeved coclustering, (Proc. 2004 ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining. Proc. 2004 ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, Seattle, WA (2004)), 635-640
[21] Mishra, N.; Ron, D.; Swaminathan, R., On finding large conjunctive clusters, (Proc. ACM Conf. on Computational Learning Theory (COLT’03) (2003)), 448-462 · Zbl 1274.68340
[22] T.M. Murali, S. Kasif, Extracting conserved gene expression motifs from gene expression data, in: Proc. Pacific Symp. on Biocomputing (PSB’03), 2003, pp. 77-88.; T.M. Murali, S. Kasif, Extracting conserved gene expression motifs from gene expression data, in: Proc. Pacific Symp. on Biocomputing (PSB’03), 2003, pp. 77-88. · Zbl 1219.92024
[23] D.V. Pasechnik, Bipartite sandwiches, Technical Report, available at \(\langle;\) http://arXiv.org/abs/math.CO/\(9907109 \rangle;\), 1999.; D.V. Pasechnik, Bipartite sandwiches, Technical Report, available at \(\langle;\) http://arXiv.org/abs/math.CO/\(9907109 \rangle;\), 1999. · Zbl 1072.05578
[24] R. Peeters, The maximum-edge biclique problem is NP-complete, Technical Report 789, Faculty of Economics and Business Administration, Tilberg University, 2000.; R. Peeters, The maximum-edge biclique problem is NP-complete, Technical Report 789, Faculty of Economics and Business Administration, Tilberg University, 2000. · Zbl 1026.68068
[25] Procopiuc, M.; Jones, M.; Agarwal, P.; Murali, T. M., A Monte Carlo algorithm for fast projective clustering, (Proc. 2002 Internat. Conf. on Management of Data (SIGMOD’02) (2002)), 418-427
[26] Reinert, G.; Schbath, S.; Waterman, M. S., Probabilistic and statistical properties of words: an overview, J. Comput. Biol., 7, 1-46 (2000)
[27] Sheng, Q.; Moreau, Y.; Moor, B. D., Biclustering microarray data by Gibbs sampling, (Proc. of European Conf. on Computational Biology (ECCB’03) (2003)), 196-205
[28] Szpankowski, W., Average Case Analysis of Algorithms on Sequences (2001), Wiley Interscience: Wiley Interscience New York · Zbl 0969.00028
[29] A. Tanay, R. Sharan, R. Shamir, Discovering statistically significant biclusters in gene expression data, in: Proc. 10th Internat. Conf. on Intelligent Systems for Molecular Biology (ISMB’02), Bioinformatics, Vol. 18, 2002, pp. S136-S144.; A. Tanay, R. Sharan, R. Shamir, Discovering statistically significant biclusters in gene expression data, in: Proc. 10th Internat. Conf. on Intelligent Systems for Molecular Biology (ISMB’02), Bioinformatics, Vol. 18, 2002, pp. S136-S144.
[30] Wang, H.; Wang, W.; Yang, J.; Yu, P. S., Clustering by pattern similarity in large data sets, (Proc. 2002 ACM SIGMOD Internat. Conf. on Management of Data (SIGMOD-02) (2002), ACM Press: ACM Press New York), 394-405
[31] J. Yang, H. Wang, W. Wang, P.S. Yu, Enhanced biclustering on gene expression data, in: IEEE Symp. on Bioinformatics and Bioengineering (BIBE’03), 2003, pp. 321-327.; J. Yang, H. Wang, W. Wang, P.S. Yu, Enhanced biclustering on gene expression data, in: IEEE Symp. on Bioinformatics and Bioengineering (BIBE’03), 2003, pp. 321-327.
[32] Zhang, L.; Zhu, S., A new clustering method for microarray data analysis, (Proc. First IEEE Computer Society Bioinformatics Conf. (CSB’02) (2002), IEEE Press: IEEE Press New York), 268-275
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.