×

Sampling frequent and minimal Boolean patterns: theory and application in classification. (English) Zbl 1411.68114

Summary: We tackle the challenging problem of mining the simplest Boolean patterns from categorical datasets. Instead of complete enumeration, which is typically infeasible for this class of patterns, we develop effective sampling methods to extract a representative subset of the minimal Boolean patterns in disjunctive normal form (DNF). We propose a novel theoretical characterization of the minimal DNF expressions, which allows us to prune the pattern search space effectively. Our approach can provide a near-uniform sample of the minimal DNF patterns. We perform an extensive set of experiments to demonstrate the effectiveness of our sampling method. We also show that minimal DNF patterns make effective features for classification.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI

References:

[1] Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.; Verkamo, AI; Fayyad, U. (ed.); etal., Fast discovery of association rules, 307-328 (1996), Menlo Park
[2] Akutsu T, Kuhara S, Maruyama O, Miyano S (1998) Identification of gene regulatory networks by strategic gene disruptions and gene overexpressions. In: ACM-SIAM symposium on discrete algorithms · Zbl 0930.68049
[3] Antonie M-L, Zaiane O (2004) Mining positive and negative association rules: an approach for confined rules. In: European conference on principles and practice of knowledge discovery in databases
[4] Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor 2(2):66-75 · doi:10.1145/380995.381017
[5] Bayardo RJ, Agrawal R (1999) Mining the most interesting rules. In: ACM SIGKDD international conference on knowledge discovery and data mining
[6] Boley M, Gärtner T, Grosskreutz H, Fraunhofer I (2010) Formal concept sampling for counting and threshold-free local pattern mining. In: SIAM data mining conference
[7] Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inform Syst 21(1):65-89 · doi:10.1007/s10115-009-0212-4
[8] Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: ACM SIGKDD international conference on knowledge discovery and data mining
[9] Bshouty N (1995) Exact learning boolean functions via the monotone theory. Inform Comput 123(1):146-153 · Zbl 1096.68634 · doi:10.1006/inco.1995.1164
[10] Calders T, Goethals B (2003) Minimal k-free representations of frequent sets. In: European conference on principles and practice of knowledge discovery in databases · Zbl 1020.68566
[11] Calders T, Goethals B (2005) Quick inclusion-exclusion. In: Proceedings ECML-PKDD workshop on knowledge discovery in inductive databases · Zbl 1178.68190
[12] Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1-39 · doi:10.1145/1961189.1961199
[13] Chaoji V, Hasan MA, Salem S, Besson J, Zaki MJ (2008) ORIGAMI: a novel and effective approach for mining representative orthogonal graph patterns. Stat Anal Data Min 1(2):67-84 · Zbl 07260185 · doi:10.1002/sam.10004
[14] Cowles M, Carlin B (1996) Markov chain monte carlo convergence diagnostics: a comparative review. J Am Stat 91(434):883-904 · Zbl 0869.62066 · doi:10.1080/01621459.1996.10476956
[15] Curk T, Demsar J, Xu Q, Leban G, Petrovic U, Bratko I, Shaulsky G, Zupan B (2005) Microarray data mining with visual programming. Bioinformatics 21(3):396-398 · doi:10.1093/bioinformatics/bth474
[16] Dong G, Jiang C, Pei J, Li J, Wong L (2005) Mining succinct systems of minimal generators of formal concepts. In: International conference database systems for advanced applications
[17] Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on articial intelligence
[18] Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, (http://archive.ics.uci.edu/ml)
[19] Ganter B, Wille R (1999) Formal concept analysis: mathematical foundations. Springer, Berlin · Zbl 0909.06001 · doi:10.1007/978-3-642-59830-2
[20] Goethals B, Zaki MJ (2004) Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor 6(1):109-117 · doi:10.1145/1007730.1007744
[21] Gunopulos D, Khardon R, Mannila H, Saluja S, Toivonen H, Sharma R (2003) Discovering all most specific sentences. ACM Trans Database Syst 28(2):140-174 · doi:10.1145/777943.777945
[22] Gunopulos D, Mannila H, Saluja S (1997) Discovering all most specific sentences by randomized algorithm. In: 6th international conference on database theory
[23] Hamrouni T, Yahia S Ben, Mephu Nguifo E (2009) Sweeping the disjunctive search space towards mining new exact concise representations of frequent itemsets. Data & Knowl Eng 68(10):1091-1111 · doi:10.1016/j.datak.2009.05.001
[24] Hasan MA, Zaki MJ (2009) Musk: uniform sampling of k maximal patterns. In: 9th SIAM international conference on data mining
[25] Hasan MA, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730-741 · doi:10.14778/1687627.1687710
[26] Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the eleventh international joint conference on artificial intelligence · Zbl 0709.68057
[27] Jaroszewicz S, Simovici DA (2002) Support approximations using bonferroni-type inequalities. In: 6th European conference on principles of data mining and knowledge discovery · Zbl 1020.68691
[28] Kryszkiewicz M (2001) Concise representation of frequent patterns based on disjunction-free generators. In: IEEE International conference on data mining · Zbl 1048.68809
[29] Kryszkiewicz M (2005) Generalized disjunction-free representation of frequent patterns with negation. J Exp Theor Artif Intell 17(1/2):63-82 · Zbl 1102.68116 · doi:10.1080/09528130512331315882
[30] Li G, Zaki MJ (2012) Sampling minimal frequent boolean (DNF) patterns. In: 18th ACM SIGKDD international conference on knowledge discovery and data mining
[31] Loekito E, Bailey J (2006) Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In: ACM SIGKDD international conference on knowledge discovery and data mining
[32] Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: International conference on knowledge discovery and data mining
[33] Mitchell T (1982) Generalization as search. Artif Intell 18:203-226 · doi:10.1016/0004-3702(82)90040-6
[34] Nanavati A, Chitrapura K, Joshi S, Krishnapuram R (2001) Association rule mining: Mining generalised disjunctive association rules. In: ACM international conference on information and knowledge management
[35] Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm R (Aug. 2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: ACM SIGKDD international conference on knowledge discovery and data mining
[36] Rubinstein RY, Kroese DK (2008) Simulation and the Monte Carlo method, 2nd edn. Wiley, New York · Zbl 1147.68831
[37] Savasere A, Omiecinski E, Navathe S (1998) Mining for strong negative associations in a large database of customer transactions. In: IEEE International conference on data engineeging
[38] Shima Y, Mitsuishi S, Hirata K, Harao M (2004) Extracting minimal and closed monotone dnf formulas. In: International conference on discovery science · Zbl 1110.68466
[39] Stumme G, Taouil R, Bastide Y, Pasquier N, Lakhal L (2002) Computing iceberg concept lattices with titanic. Data Knowl Eng 42(2):189-222 · Zbl 0996.68046 · doi:10.1016/S0169-023X(02)00057-5
[40] Veloso A, Meira W, Zaki MJ (2006) Lazy associative classification. In: IEEE International conference on data mining
[41] Vimieiro R, Moscato P (2012) Mining disjunctive minimal generators with titanicor. Expert Syst Appl 39(9):8228-8238 · doi:10.1016/j.eswa.2012.01.141
[42] Vimieiro R, Moscato P (2014) Disclosed: an efficient depth-first, top-down algorithm for mining disjunctive closed itemsets in high-dimensional data. Inform Sci 280:171-187 · Zbl 1355.68235 · doi:10.1016/j.ins.2014.04.044
[43] Vreeken J, Van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169-214 · Zbl 1235.68071 · doi:10.1007/s10618-010-0202-x
[44] Wu X, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. ACM Trans Inform Syst 22(3):381-405 · doi:10.1145/1010614.1010616
[45] Yuan X, Buckles BP, Yuan Z, Zhang J (2002) Mining negative association rules. In: 7th international symposium on computers and communications
[46] Zaki M, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: ACM SIGKDD international conference on knowledge discovery and data mining
[47] Zaki MJ (Aug. 2000) Generating non-redundant association rules. In: 6th ACM SIGKDD international conference on knowledge discovery and data mining
[48] Zaki MJ, Hsiao C-J (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462-478 · doi:10.1109/TKDE.2005.60
[49] Zaki MJ, Ramakrishnan N, Zhao L (2010) Mining frequent boolean expressions: application to gene expression and regulatory modeling. Int J Knowl Discov Bioinform 1(3):68-96 Special issue on mining complex structures in biology · doi:10.4018/jkdb.2010070105
[50] Zhao L, Zaki MJ, Ramakrishnan N (2006) Blosom: a framework for mining arbitrary boolean expressions. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.