Article

Fully automatic cross-associations

Authors:

Deepayan Chakrabarti,

Spiros Papadimitriou,

Dharmendra S. Modha,

Christos FaloutsosAuthors Info & Claims

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 79 - 88

https://doi.org/10.1145/1014052.1014064

Published: 22 August 2004 Publication History

Abstract

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

References

[1]

D. Pelleg and A. Moore, "X-means: Extending K-means with efficient estimation of the number of clusters," in Proc. 17th ICML, pp. 727--734, 2000.]]

Digital Library

[2]

G. Hamerly and C. Elkan, "Learning the k in k-means," in Proc. 17th NIPS, 2003.]]

[3]

B. Zhang, M. Hsu, and U. Dayal, "K-harmonic means---a spatial clustering algorithm with boosting," in Proc. 1st TSDM, pp. 31--45, 2000.]]

Digital Library

[4]

I. S. Dhillon and D. S. Modha, "Concept decom- positions for large sparse text data using clustering," Mach. Learning, vol. 42, pp. 143--175, 2001.]]

Digital Library

[5]

S. Guha, R. Rastogi, and K. Shim, "CURE: an efficient clustering algorithm for large databases," in Proc. SIGMOD, pp. 73--84, 1998.]]

Digital Library

[6]

T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An efficient data clustering method for very large databases," in Proc. SIGMOD, pp. 103--114, 1996.]]

Digital Library

[7]

G. Karypis, E.-H. Han, and V. Kumar, "Chameleon: Hierarchical clustering using dynamic modeling," IEEE Computer, vol. 32, no. 8, pp. 68--75, 1999.]]

Digital Library

[8]

A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise," in Proc. 4th KDD, pp. 58--65, 1998.]]

[9]

J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.]]

Digital Library

[10]

I. S. Dhillon, S. Mallela, and D. S. Modha, "Information-theoretic co-clustering," in Proc. 9th KDD, pp. 89--98, 2003.]]

Digital Library

[11]

M. M. Madiman, M. Harrison, and I. Kontoyiannis, "A minimum description length proposal for lossy data compression," in Proc. IEEE ISIT, 2004.]]

[12]

N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, "Multivariate information bottleneck," in Proc. 17th UAI, pp. 152--161, 2001.]]

Digital Library

[13]

R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," in Proc. 20th VLDB, pp. 487--499, 1994.]]

Digital Library

[14]

J. Han, J. Pei, Y. Yin, and R. Mao, "Mining frequent patterns without candidate generation: A frequent-pattern tree approach," Data Min. Knowl. Discov., vol. 8, no. 1, pp. 53--87, 2004.]]

Digital Library

[15]

A. Tuzhilin and G. Adomavicius, "Handling very large numbers of association rules in the analysis of microarray data," in Proc. 8th KDD, 2002.]]

Digital Library

[16]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," JASI, vol. 41, pp. 391--407, 1990.]]

[17]

T. G. Kolda and D. P. O'Leary, "A semidiscrete matrix decomposition for latent semantic indexing information retrieval," ACM Transactions on Information Systems, vol. 16, no. 4, pp. 322--346, 1998.]]

Digital Library

[18]

T. Hofmann, "Probabilistic latent semantic indexing," in Proc. 22nd SIGIR, pp. 50--57, 1999.]]

Digital Library

[19]

C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, "Latent semantic indexing: A probabilistic analysis," in Proc. 17th PODS, 1998.]]

Digital Library

[20]

G. Karypis and V. Kumar, "Multilevel algorithms for multi-constraint graph partitioning," in Proc. SC98, pp. 1--13, 1998.]]

Digital Library

[21]

A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Proc. NIPS, pp. 849--856, 2001.]]

Digital Library

[22]

N. Mishra, D. Ron, and R. Swaminathan, "On finding large conjunctive clusters," in Proc. 16th COLT, 2003.]]

[23]

P. K. Reddy and M. Kitsuregawa, "An approach to relate the web communities through bipartite graphs," in Proc. 2nd WISE, pp. 302--310, 2001.]]

Digital Library

[24]

C. Tang and A. Zhang, "Mining multiple phenotype structures underlying gene expression profiles," in Proc. CIKM03, pp. 418--425, 2003.]]

Digital Library

[25]

J. Rissanen, "Modeling by shortest data description," Automatica, vol. 14, pp. 465--471, 1978.]]

Digital Library

[26]

J. Rissanen, "Generalized Kraft inequality and arithmetic coding," IBM J. Res. Dev., vol. 20, no. 3, pp. 198--203, 1976.]]

Digital Library

[27]

J. Rissanen and G. G. Langdon Jr., "Arithmetic coding," IBM J. Res. Dev., vol. 23, pp. 149--162, 1979.]]

Digital Library

[28]

I. H. Witten, R. Neal, and J. G. Cleary, "Arithmetic coding for data compression," Comm. ACM, vol. 30, no. 6, pp. 520--540, 1987.]]

Digital Library

[29]

J. Rissanen, "Universal prior for integers and estimation by minimum description length," Annals of Statistics, vol. 11, no. 2, pp. 416--431, 1983.]]

[30]

D. J. Watts, Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton Univ. Press, 1999.]]

Digital Library

[31]

M. Richardson, R. Agrawal, and P. Domingos, "Trust management for the semantic web," in Proc. 2nd ISWC, pp. 351--368, 2003.]]

[32]

A. L. Montgomery and C. Faloutsos, "Identifying web browsing trends and patterns," IEEE Computer, vol. 34, no. 7, pp. 94--95, 2001.]]

Digital Library

Cited By

Yang WGuo YWu JWang ZGuo LLi YLiu S(2024)Interactive Reweighting for Mitigating Label Quality IssuesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.334534030:3(1837-1852)Online publication date: Mar-2024
https://doi.org/10.1109/TVCG.2023.3345340
Feng WWang LHooi BNg SLiu S(2024)Interrelated Dense Pattern Detection in Multilayer NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339868336:11(6462-6476)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3398683
Feng WLiu SCheng X(2023)Hierarchical Dense Pattern Detection in TensorsACM Transactions on Knowledge Discovery from Data10.1145/357702217:6(1-29)Online publication date: 28-Feb-2023
https://dl.acm.org/doi/10.1145/3577022
Show More Cited By

Index Terms

Fully automatic cross-associations
1. Information systems
  1. Information retrieval
    1. Document representation
2. Mathematics of computing
  1. Information theory

Recommendations

Information-theoretic co-clustering
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and ...
MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, ...
Fully dynamic quasi-biclique edge covers via Boolean matrix factorizations
DyNetMM '13: Proceedings of the Workshop on Dynamic Networks Management and Mining

An important way of summarizing a bipartite graph is to give a set of (quasi-) bicliques that contain (almost) all of its edges. These quasi-bicliques are somewhat similar to clustering of the nodes, giving sets of similar nodes. Unlike clustering, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2004

874 pages

ISBN:1581138881

DOI:10.1145/1014052

General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD04

Sponsor:

KDD04: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 22 - 25, 2004

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

147
Total Citations
View Citations
1,536
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)4

Reflects downloads up to 23 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang WGuo YWu JWang ZGuo LLi YLiu S(2024)Interactive Reweighting for Mitigating Label Quality IssuesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.334534030:3(1837-1852)Online publication date: Mar-2024
https://doi.org/10.1109/TVCG.2023.3345340
Feng WWang LHooi BNg SLiu S(2024)Interrelated Dense Pattern Detection in Multilayer NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339868336:11(6462-6476)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3398683
Feng WLiu SCheng X(2023)Hierarchical Dense Pattern Detection in TensorsACM Transactions on Knowledge Discovery from Data10.1145/357702217:6(1-29)Online publication date: 28-Feb-2023
https://dl.acm.org/doi/10.1145/3577022
Ma HGuan SWang MSong QWu Y(2023)Fair Group Summarization with Graph Patterns2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00154(1982-1994)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00154
Li JHu JZhang PYang L(2023)Exposing collaborative spammer groups through the review-response graphMultimedia Tools and Applications10.1007/s11042-023-14650-482:14(21687-21700)Online publication date: 17-Feb-2023
https://doi.org/10.1007/s11042-023-14650-4
Ferragina PManzini GGagie TKöppl DNavarro GStriani MTosoni F(2022)Improving matrix-vector multiplication via lossless grammar-compressed matricesProceedings of the VLDB Endowment10.14778/3547305.354732115:10(2175-2187)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.14778/3547305.3547321
Jang JPark CJang CKim GKang U(2022)Finding Key Structures in MMORPG Graph with Hierarchical Graph SummarizationACM Transactions on Knowledge Discovery from Data10.1145/352269116:6(1-21)Online publication date: 30-Jul-2022
https://dl.acm.org/doi/10.1145/3522691
Murayama TMatsubara YSakurai YAl Hasan MXiong L(2022)Mining Reaction and Diffusion Dynamics in Social ActivitiesProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557396(1521-1531)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557396
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Zhu YXiong J(2021)Modern big data analytics for “Old-fashioned” semiconductor industry applications2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1109/ICCAD.2015.7372649(776-780)Online publication date: 10-Mar-2021
https://dl.acm.org/doi/10.1109/ICCAD.2015.7372649
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents