×

New upper bounds for tight and fast approximation of Fisher’s exact test in dependency rule mining. (English) Zbl 1468.62072

Summary: In the dependency rule mining, the goal is to discover the most significant statistical dependencies among all possible collapsed \(2 \times 2\) contingency tables. Fisher’s exact test is a robust method to estimate the significance and it enables efficient pruning of the search space. The problem is that evaluating the required \(p\)-value can be very laborious and the worst case time complexity is \(\mathcal{O}(n)\), where \(n\) is the data size. The traditional solution is to approximate the significance with the \(\chi^2\)-measure, which can be estimated in a constant time. However, the \(\chi^2\)-measure can produce unreliable results (discover spurious dependencies but miss the most significant dependencies). Furthermore, it does not support efficient pruning of the search space. As a solution, a family of tight upper bounds for Fisher’s \(p\) is introduced. The new upper bounds are fast to calculate and approximate Fisher’s \(p\)-value accurately. In addition, the new approximations are not sensitive to the data size, distribution, or smallest expected counts like the \(\chi^2\)-based approximation. In practice, the execution time depends on the desired accuracy level. According to experimental evaluation, the simplest upper bounds are already sufficiently accurate for dependency rule mining purposes and they can be estimated in 0.004-0.1% of the time needed for exact calculation. For other purposes (testing very weak dependencies), one may need more accurate approximations, but even they can be calculated in less than 1% of the exact calculation time.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H17 Contingency tables

Software:

gprof; UCI-ml
Full Text: DOI

References:

[1] Agrawal, R.; Imielinski, T.; Swami, A., Mining association rules between sets of items in large databases, (Int. Conf. Management of Data, (1993), ACM), 207-216
[2] Agresti, A., A survey of exact inference for contingency tables, Statist. Sci., 7, 1, 131-153, (1992) · Zbl 0955.62587
[3] Andrés, A. M.; Mato, A. S.; García, J. T.; Quevedo, M. S., Comparing the asymptotic power of exact tests in 2×2 tables, Comput. Statist. Data Anal., 47, 4, 745-756, (2004) · Zbl 1429.62205
[4] Antonie, M.-L.; Zaïane, O., Mining positive and negative association rules: an approach for confined rules, (8th Eur. Conf. Principles and Practice of Knowledge Discovery in Databases, (2004), Springer), 27-38
[5] erfc (software). ECE44 Laboratory, University of Illinois Urbana-Champaign. Retrieved 1.6.2014. http://fabweb.ece.illinois.edu/utilities/erfc/.
[6] FIMI Repository (data collection). Administered by B. Goethals. Retrieved April 2010. http://fimi.ua.ac.be/data/.
[7] Gnu profiler (gprof, software). Copyright 2009 Free Software Foundation, Inc. https://sourceware.org/binutils/docs/gprof/.
[8] Hämäläinen, W., Statapriori: an efficient algorithm for searching statistically significant association rules, Knowl. Inf. Syst., 23, 3, 373-399, (2010)
[9] Hämäläinen, W., Efficient search methods for statistical dependency rules, Fund. Inform., 113, 2, 117-150, (2011) · Zbl 1435.62206
[10] Hämäläinen, W., Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures, Knowl. Inf. Syst., 32, 2, 383-414, (2012)
[11] Li, J.; Le, T. D.; Liu, L.; Liu, J.; Jin, Z.; Sun, B., Mining causal association rules, (13th Int. Conf. Data Mining Workshops, (2013), IEEE), 114-123
[12] Lydersen, S.; Fagerland, M.; Laake, P., Recommended tests for association in 2×2 tables, Stat. Med., 28, 7, 1159-1175, (2009)
[13] Mehta, C.; Patel, N., A network algorithm for the exact treatment of the 2 \(\times\) k contingency table, Comm. Statist., 9, 6, 649-664, (1980) · Zbl 0452.62041
[14] Mehta, C.; Patel, N., A network algorithm for performing fisher’s exact test in r \(\times\) c contingency tables, J. Amer. Statist. Assoc., 78, 382, 427-434, (1983) · Zbl 0545.62039
[15] Morishita, S.; Sese, J., Transversing itemset lattices with statistical metric pruning, (19th Symp. Principles of Database Systems, (2000), ACM), 226-236
[16] Nijssen, S.; Guns, T.; Raedt, L. D., Correlated itemset mining in ROC space: a constraint programming approach, (15th Conf. Knowledge Discovery and Data Mining, (2009), ACM), 647-656
[17] Online probability distributions (software). Retrieved 1.6. 2014. http://www.xuru.org/st/PD.asp.
[18] Requena, F.; Ciudad, N. M., A major improvement to the network algorithm for fisher’s exact test in 2 \(\times\) c contingency tables, Comput. Statist. Data Anal., 51, 2, 490-498, (2006) · Zbl 1157.62420
[19] Sugiyama, M.; López, L.; Kasenburg, N.; Borgwardt, K., Significant subgraph mining with multiple testing correction, (Int. Conf. Data Mining, (2015), SIAM), 37-45
[20] UCI Machine Learning Repository (data collection), 2013. Administered by M. Lichman. http://archive.ics.uci.edu/ml.
[21] Verbeek, A.; Kroonenberg, P., A survey of algorithms for exact distributions of test statistics in r \(\times\) c contingency tables with fixed margins, Comput. Statist. Data Anal., 3, 159-185, (1985) · Zbl 0586.62083
[22] Webb, G.; Zhang, S., K-optimal rule discovery, Data Min. Knowl. Discov., 10, 1, 39-79, (2005)
[23] Wu, T., An accurate computation of the hypergeometric distribution function, ACM Trans. Math. Software, 19, 1, 33-43, (1993) · Zbl 0889.65148
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.