skip to main content
research-article

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Published: 01 June 2012 Publication History

Abstract

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology.

References

[1]
Aggarwal, C. and Yu, P. 1998. A new framework for itemset generation. In Proceedings of the 17th ACM Symposium on Principles of Database Systems. 18--24.
[2]
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 207--216.
[3]
Arratia, R., Goldstein, L., and Gordon, L. 1990. Poisson approximation and the Chen-Stein method. Stat. Sci. 5, 4, 403--434.
[4]
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate. J. Roy. Stat. Soc., Ser. B 57, 289--300.
[5]
Benjamini, Y. and Yekutieli, D. 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 4, 1165--1188.
[6]
Bolton, R., Hand, D., and Adams, N. 2002. Determining hit rate in pattern search. In Proceedings of Pattern Detection and Discovery. Lecture Notes in Artificial Intelligence, vol. 2447, 36--48.
[7]
Dudoit, S., Shaffer, J. P., and Boldrick, J. C. 2003. Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 1, 71--103.
[8]
DuMouchel, W. 1999. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. Amer. Statistician 53, 177--202.
[9]
DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multi-item associations. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01) 67--76.
[10]
Gionis, A., Mannila, H., Mielikäinen, T., and Tsaparas, P. 2006. Assessing data mining results via swap randomization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 167--176.
[11]
Goethals, B., and Zaki, M. J., Eds. 2003. Proceedings of the 1st Workshop on Frequent Itemset Mining Implementations (FIMI03). Vol. 90, CEUR-WS Workshop Online Proceedings.
[12]
Goethals, B., Bayardo, R., and Zaki, M. J., Eds. 2004. Proceedings of the 2nd Workshop on Frequent Itemset Mining Implementations (FIMI04). Vol. 126, CEUR-WS Workshop Online Proceedings.
[13]
Hämäläinen, W. and Nykänen, M. 2008. Efficient discovery of statistically significant association rules. In Proceedings of the 8th IEEE International Conference on Data Mining. 203--212.
[14]
Han, J., and Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Mateo, CA.
[15]
Han, J., Cheng, H., Xin, D., and Yan, X. 2007. Frequent pattern mining: Current status and future directions. Data Mining Knowl. Disc. 15, 1, 55--86.
[16]
Jaroszewicz, S. and Scheffer, T. 2005. Fast discovery of unexpected patterns in data, relative to a Bayesian. In Proceedings of the 11th International Conference on Knowledge Discovery and Data Mining. 118--127.
[17]
Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 274--278.
[18]
Mitzenmacher, M., and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, Cambridge UK.
[19]
Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory. 398--416.
[20]
Purdom, P. W., Gucht, D. V., and Groth, D. P. 2004. Average case performance of the apriori algorithm. SIAM J. on Comput. 33, 5, 1223--1260.
[21]
Sayrafi, B., Gucht, D. V., and Purdom, P. W. 2005. On the effectiveness and efficiency of computing bounds on the support of item-sets in the frequent item-sets mining problem. In Proceedings of the 1st International Workshop on Open Source Data Mining. 46--55.
[22]
Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Trans. Knowl. Data Eng. 8, 6, 970--974.
[23]
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining Knowl. Disc. 2, 1, 39--68.
[24]
Srikant, R., and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12.
[25]
Tan, P., Steinbach, M., and Kumar, V. 2006. Introduction to Data Mining. Addison Wesley.
[26]
Xin, D., Han, J., Yan, X., and Cheng, H. 2005. Mining compressed frequent-pattern sets. In Proceedings of the 31st Very Large Data Base Conference. 709--720.
[27]
Zhang, H., Padmanabhan, B., and Tuzhilin, A. 2004. On the discovery of significant statistical quantitative rules. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining. 374--383.

Cited By

View all
  • (2023)Ontology-based data interestingness: A state-of-the-art reviewNatural Language Processing Journal10.1016/j.nlp.2023.1000214(100021)Online publication date: Sep-2023
  • (2022)Discovering Significant Patterns under Sequential False Discovery ControlProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539398(263-272)Online publication date: 14-Aug-2022
  • (2022)MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/353218716:6(1-29)Online publication date: 30-Jul-2022
  • Show More Cited By

Index Terms

  1. An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of the ACM
    Journal of the ACM  Volume 59, Issue 3
    June 2012
    180 pages
    ISSN:0004-5411
    EISSN:1557-735X
    DOI:10.1145/2220357
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2012
    Accepted: 01 February 2012
    Revised: 01 February 2012
    Received: 01 November 2009
    Published in JACM Volume 59, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Frequent itemset mining
    2. Poisson approximation
    3. false discovery rate
    4. multi-hypothesis test
    5. statistical significance

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 23 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Ontology-based data interestingness: A state-of-the-art reviewNatural Language Processing Journal10.1016/j.nlp.2023.1000214(100021)Online publication date: Sep-2023
    • (2022)Discovering Significant Patterns under Sequential False Discovery ControlProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539398(263-272)Online publication date: 14-Aug-2022
    • (2022)MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/353218716:6(1-29)Online publication date: 30-Jul-2022
    • (2020)MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern MiningProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403267(2165-2174)Online publication date: 23-Aug-2020
    • (2019)Hypothesis Testing and Statistically-sound Pattern MiningProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3332286(3215-3216)Online publication date: 25-Jul-2019
    • (2019)Permutation Strategies for Mining Significant Sequential Patterns2019 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2019.00169(1330-1335)Online publication date: Nov-2019
    • (2019)Significance-based discriminative sequential pattern miningExpert Systems with Applications10.1016/j.eswa.2018.12.046122(54-64)Online publication date: May-2019
    • (2018)NorthstarProceedings of the VLDB Endowment10.14778/3229863.324049311:12(2150-2164)Online publication date: 1-Aug-2018
    • (2018)Set Similarity Search for Skewed DataProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196985(63-74)Online publication date: 27-May-2018
    • (2018)BSigData Mining and Knowledge Discovery10.1007/s10618-017-0521-232:1(124-161)Online publication date: 1-Jan-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media