skip to main content
Article

Fully automatic cross-associations

Published: 22 August 2004 Publication History

Abstract

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.

References

[1]
D. Pelleg and A. Moore, "X-means: Extending K-means with efficient estimation of the number of clusters," in Proc. 17th ICML, pp. 727--734, 2000.]]
[2]
G. Hamerly and C. Elkan, "Learning the k in k-means," in Proc. 17th NIPS, 2003.]]
[3]
B. Zhang, M. Hsu, and U. Dayal, "K-harmonic means---a spatial clustering algorithm with boosting," in Proc. 1st TSDM, pp. 31--45, 2000.]]
[4]
I. S. Dhillon and D. S. Modha, "Concept decom- positions for large sparse text data using clustering," Mach. Learning, vol. 42, pp. 143--175, 2001.]]
[5]
S. Guha, R. Rastogi, and K. Shim, "CURE: an efficient clustering algorithm for large databases," in Proc. SIGMOD, pp. 73--84, 1998.]]
[6]
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An efficient data clustering method for very large databases," in Proc. SIGMOD, pp. 103--114, 1996.]]
[7]
G. Karypis, E.-H. Han, and V. Kumar, "Chameleon: Hierarchical clustering using dynamic modeling," IEEE Computer, vol. 32, no. 8, pp. 68--75, 1999.]]
[8]
A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise," in Proc. 4th KDD, pp. 58--65, 1998.]]
[9]
J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.]]
[10]
I. S. Dhillon, S. Mallela, and D. S. Modha, "Information-theoretic co-clustering," in Proc. 9th KDD, pp. 89--98, 2003.]]
[11]
M. M. Madiman, M. Harrison, and I. Kontoyiannis, "A minimum description length proposal for lossy data compression," in Proc. IEEE ISIT, 2004.]]
[12]
N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, "Multivariate information bottleneck," in Proc. 17th UAI, pp. 152--161, 2001.]]
[13]
R. Agrawal and R. Srikant, "Fast algorithms for mining association rules in large databases," in Proc. 20th VLDB, pp. 487--499, 1994.]]
[14]
J. Han, J. Pei, Y. Yin, and R. Mao, "Mining frequent patterns without candidate generation: A frequent-pattern tree approach," Data Min. Knowl. Discov., vol. 8, no. 1, pp. 53--87, 2004.]]
[15]
A. Tuzhilin and G. Adomavicius, "Handling very large numbers of association rules in the analysis of microarray data," in Proc. 8th KDD, 2002.]]
[16]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," JASI, vol. 41, pp. 391--407, 1990.]]
[17]
T. G. Kolda and D. P. O'Leary, "A semidiscrete matrix decomposition for latent semantic indexing information retrieval," ACM Transactions on Information Systems, vol. 16, no. 4, pp. 322--346, 1998.]]
[18]
T. Hofmann, "Probabilistic latent semantic indexing," in Proc. 22nd SIGIR, pp. 50--57, 1999.]]
[19]
C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, "Latent semantic indexing: A probabilistic analysis," in Proc. 17th PODS, 1998.]]
[20]
G. Karypis and V. Kumar, "Multilevel algorithms for multi-constraint graph partitioning," in Proc. SC98, pp. 1--13, 1998.]]
[21]
A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Proc. NIPS, pp. 849--856, 2001.]]
[22]
N. Mishra, D. Ron, and R. Swaminathan, "On finding large conjunctive clusters," in Proc. 16th COLT, 2003.]]
[23]
P. K. Reddy and M. Kitsuregawa, "An approach to relate the web communities through bipartite graphs," in Proc. 2nd WISE, pp. 302--310, 2001.]]
[24]
C. Tang and A. Zhang, "Mining multiple phenotype structures underlying gene expression profiles," in Proc. CIKM03, pp. 418--425, 2003.]]
[25]
J. Rissanen, "Modeling by shortest data description," Automatica, vol. 14, pp. 465--471, 1978.]]
[26]
J. Rissanen, "Generalized Kraft inequality and arithmetic coding," IBM J. Res. Dev., vol. 20, no. 3, pp. 198--203, 1976.]]
[27]
J. Rissanen and G. G. Langdon Jr., "Arithmetic coding," IBM J. Res. Dev., vol. 23, pp. 149--162, 1979.]]
[28]
I. H. Witten, R. Neal, and J. G. Cleary, "Arithmetic coding for data compression," Comm. ACM, vol. 30, no. 6, pp. 520--540, 1987.]]
[29]
J. Rissanen, "Universal prior for integers and estimation by minimum description length," Annals of Statistics, vol. 11, no. 2, pp. 416--431, 1983.]]
[30]
D. J. Watts, Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton Univ. Press, 1999.]]
[31]
M. Richardson, R. Agrawal, and P. Domingos, "Trust management for the semantic web," in Proc. 2nd ISWC, pp. 351--368, 2003.]]
[32]
A. L. Montgomery and C. Faloutsos, "Identifying web browsing trends and patterns," IEEE Computer, vol. 34, no. 7, pp. 94--95, 2001.]]

Cited By

View all
  • (2024)Interactive Reweighting for Mitigating Label Quality IssuesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.334534030:3(1837-1852)Online publication date: Mar-2024
  • (2024)Interrelated Dense Pattern Detection in Multilayer NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339868336:11(6462-6476)Online publication date: Nov-2024
  • (2023)Hierarchical Dense Pattern Detection in TensorsACM Transactions on Knowledge Discovery from Data10.1145/357702217:6(1-29)Online publication date: 28-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MDL
  2. cross-association
  3. information theory

Qualifiers

  • Article

Conference

KDD04

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)4
Reflects downloads up to 23 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Interactive Reweighting for Mitigating Label Quality IssuesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.334534030:3(1837-1852)Online publication date: Mar-2024
  • (2024)Interrelated Dense Pattern Detection in Multilayer NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339868336:11(6462-6476)Online publication date: Nov-2024
  • (2023)Hierarchical Dense Pattern Detection in TensorsACM Transactions on Knowledge Discovery from Data10.1145/357702217:6(1-29)Online publication date: 28-Feb-2023
  • (2023)Fair Group Summarization with Graph Patterns2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00154(1982-1994)Online publication date: Apr-2023
  • (2023)Exposing collaborative spammer groups through the review-response graphMultimedia Tools and Applications10.1007/s11042-023-14650-482:14(21687-21700)Online publication date: 17-Feb-2023
  • (2022)Improving matrix-vector multiplication via lossless grammar-compressed matricesProceedings of the VLDB Endowment10.14778/3547305.354732115:10(2175-2187)Online publication date: 1-Jun-2022
  • (2022)Finding Key Structures in MMORPG Graph with Hierarchical Graph SummarizationACM Transactions on Knowledge Discovery from Data10.1145/352269116:6(1-21)Online publication date: 30-Jul-2022
  • (2022)Mining Reaction and Diffusion Dynamics in Social ActivitiesProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557396(1521-1531)Online publication date: 17-Oct-2022
  • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
  • (2021)Modern big data analytics for “Old-fashioned” semiconductor industry applications2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)10.1109/ICCAD.2015.7372649(776-780)Online publication date: 10-Mar-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media