×

Introducing the closure structure and the GDPM algorithm for mining and understanding a tabular dataset. (English) Zbl 1529.68262

Summary: Pattern mining is one of the most studied fields in data mining. Being mostly motivated by practitioners, pattern mining algorithms are often based on heuristics and are lacking suitable formalization. In this paper, we are revisiting pattern mining, and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful itemsets and respective association rules. We introduce a concise representation – the closure structure – based on closed itemsets and their minimum generators (called “passkeys”) for capturing the intrinsic content of a dataset. The closure structure allows one to understand the content of the dataset in terms of closed sets and equivalence classes of itemsets. We discuss theoretical properties of passkeys which are concise representatives of closed itemsets. We propose a formalization of the closure structure and passkeys in terms of Formal Concept Analysis, which is well adapted to studying such elements. Besides theoretical results, we present the GDPM algorithm for enumerating passkeys and discovering the closure structure. GDPM is rather unique as it returns a characterization of a dataset content in terms of complexity levels, highlighting the diversity and the distribution of the itemsets. Finally, some experiments show how the GDPM algorithm and the closure structure can be practically used.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T09 Computational aspects of data analysis and big data
68T10 Pattern recognition, speech recognition
68T30 Knowledge representation

Software:

LCM
Full Text: DOI

References:

[1] Albano, Alexandre; Chornomaz, Bogdan, Why concept lattices are large: extremal theory for generators, concepts, and VC-dimension, Int. J. Gen. Syst., 46, 5 (2017)
[2] Andrews, Simon, A partial-closure canonicity test to increase the efficiency of CbO-type algorithms, (Proceedings of the 21st International Conference on Conceptual Structures (2014), Springer), 37-50
[3] Arimura, Hiroki; Uno, Takeaki, Polynomial-delay and polynomial-space algorithms for mining closed sequences, graphs, and pictures in accessible set systems, (Proceedings of the SIAM International Conference on Data Mining (2009), SIAM), 1088-1099
[4] Armstrong, William Ward, Dependency structures of data base relationships, (Proceedings of the 6th IFIP Congress, vol. 74. Proceedings of the 6th IFIP Congress, vol. 74, Geneva, Switzerland (1974)), 580-583 · Zbl 0296.68038
[5] Baixeries, Jaume; Codocedo, Victor; Kaytoue, Mehdi; Napoli, Amedeo, Characterizing approximate-matching dependencies in formal concept analysis with pattern structures, Discrete Appl. Math., 249, 18-27 (2018) · Zbl 1398.68508
[6] Bastide, Yves; Pasquier, Nicolas; Taouil, Rafik; Stumme, Gerd; Lakhal, Lotfi, Mining minimal non-redundant association rules using frequent closed itemsets, (International Conference on Computational Logic (2000), Springer), 972-986 · Zbl 0983.68511
[7] Boley, Mario; Horv́ath, Taḿas; Poigńe, Axel; Wrobel, Stefan, Efficient closed pattern mining in strongly accessible set systems, (Proceedings of the 11th European Conference on Principles and Practice of Data Mining and Knowledge Discovery (2007), Springer), 382-389
[8] Boley, Mario; Horváth, Tamás; Poigné, Axel; Wrobel, Stefan, Efficient closed pattern mining in strongly accessible set systems, (Proceedings of the 5th International Workshop on Mining and Learning with Graphs (2007)) · Zbl 1191.68346
[9] Boley, Mario; Gartner, Thomas; Grosskreutz, Henrik, Formal concept sampling for counting and threshold-free local pattern mining, (Proceedings of the SIAM International Conference on Data Mining (2010), SIAM), 177-188
[10] Boley, Mario; Horváth, Tamás; Poigné, Axel; Wrobel, Stefan, Listing closed sets of strongly accessible set systems with applications to data mining, Theor. Comput. Sci., 411, 3, 691-700 (2010) · Zbl 1191.68346
[11] Boley, Mario; Lucchese, Claudio; Paurat, Daniel; Gärtner, Thomas, Direct local pattern sampling by efficient two-step random procedures, (Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011), ACM), 582-590
[12] Boley, Mario; Moens, Sandy; Gartner, Thomas, Linear space direct pattern sampling using coupling from the past, (The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2012), ACM), 69-77
[13] Buzmakov, Aleksey; Kuznetsov, Sergei O.; Napoli, Amedeo, Scalable estimates of concept stability, (Proceedings of the 12th International Conference on Formal Concept Analysis (2014), Springer), 157-172 · Zbl 1444.68198
[14] Codocedo, Victor; Lykourentzou, Ioanna; Napoli, Amedeo, A semantic approach to concept lattice-based information retrieval, Ann. Math. Artif. Intell., 72, 1-2, 169-195 (2014) · Zbl 1319.68078
[15] Coenen, Frans, The LUCS-KDD discretised/normalised ARM and CARM data library (2003)
[16] Dushnik, Ben; Miller, Edwin W., Partially ordered sets, Am. J. Math., 63, 3, 600-610 (1941) · Zbl 0025.31002
[17] Ganter, Bernhard; Wille, Rudolf, Formal Concept Analysis-Mathematical Foundations (1999) · Zbl 0909.06001
[18] Guigues, Jean-Louis; Duquenne, Vincent, Familles minimales d’implications informatives résultant d’un tableau de données binaires, Math. Sci. Hum., 95, 5-18 (1986) · Zbl 1504.68217
[19] Gunopulos, Dimitrios; Khardon, Roni; Mannila, Heikki; Saluja, Sanjeev; Toivonen, Hannu; Sharma, Ram Sewak, Discovering all most specific sentences, ACM Trans. Database Syst., 28, 2, 140-174 (2003)
[20] Hamrouni, Tarek; Ben Yahia, Sadok; Nguifo, Engelbert Mephu, Looking for a structural characterization of the sparseness measure of (frequent closed) itemset contexts, Inf. Sci., 222, 343-361 (2013) · Zbl 1293.68260
[21] Han, Jiawei; Pei, Jian; Yin, Yiwen, Mining frequent patterns without candidate generation, SIGMOD Rec., 29, 2, 1-12 (2000)
[22] Hermann, Miki; Sertkaya, Barı̧s, On the complexity of computing generators of closed sets, (Proceedings of the 6th International Conference on Formal Concept Analysis (2008), Springer), 158-168 · Zbl 1131.68537
[23] Hu, Qiong; Imielinski, Tomasz, Alpine: progressive itemset mining with definite guarantees, (Proceedings of the SIAM International Conference on Data Mining (2017), SIAM), 63-71
[24] Janostik, Radek; Konecny, Jan; Krajča, Petr, LCM is well implemented CbO: study of LCM from FCA point of view, (Proceedings of the 15th International Conference on Conceptual Structures (2020), Tallinn University of Technology), 47-58 · Zbl 1441.68246
[25] Kaytoue, Mehdi; Kuznetsov, Sergei O.; Napoli, Amedeo; Duplessis, Sébastien, Mining gene expression data with pattern structures in formal concept analysis, Inf. Sci., 181, 10, 1989-2001 (2011)
[26] Klimushkin, Mikhail; Obiedkov, Sergei; Roth, Camille, Approaches to the selection of relevant concepts in the case of noisy data, (Proceedings of the 8th International Conference on Formal Concept Analysis (2010), Springer), 255-266 · Zbl 1274.68489
[27] Krajca, Petr; Vychodil, Vilem, Comparison of data structures for computing formal concepts, (Proceedings of the 6th International Conference on Modeling Decisions for Artificial Intelligence (2009), Springer), 114-125 · Zbl 1273.68379
[28] Kuznetsov, Sergei O., Stability as an estimate of degree of substantiation of hypotheses derived on the basis of operational similarity, Nauch.-Tekh. Inf. Ser. 2, 12, 21-29 (1990)
[29] Kuznetsov, Sergei O., A fast algorithm for computing all intersections of objects from an arbitrary semilattice, Nauch.-Tekh. Inf. Ser. 2, 1, 17-20 (1993)
[30] Kuznetsov, Sergei O., On stability of a formal concept, Ann. Math. Artif. Intell., 49, 1-4, 101-115 (2007) · Zbl 1129.68086
[31] Kuznetsov, Sergei O.; Obiedkov, Sergei A., Comparing performance of algorithms for generating concept lattices, J. Exp. Theor. Artif. Intell., 14, 2-3, 189-216 (2002) · Zbl 1024.68020
[32] Li, Jinyan; Li, Haiquan; Wong, Limsoon; Pei, Jian; Dong, Guozhu, Minimum description length principle: generators are preferable to closed patterns, (AAAI (2006)), 409-414
[33] Makhalova, Tatiana; Kuznetsov, Sergei O.; Napoli, Amedeo, Gradual discovering of closeness structure in lattices, (Proceedings of the 15th International Conference on Concept Lattices and Their Applications (2020)) · Zbl 1529.68263
[34] Mampaey, Michael; Vreeken, Jilles; Tatti, Nikolaj, Summarizing data succinctly with the most informative itemsets, ACM Trans. Knowl. Discov. Data, 6, 4, 16 (2012)
[35] Pasquier, Nicolas; Bastide, Yves; Taouil, Rafik; Lakhal, Lotfi, Discovering frequent closed itemsets for association rules, (Proceedings of the 7th International Conference on Database Theory (1999), Springer), 398-416
[36] Pasquier, Nicolas; Bastide, Yves; Taouil, Rafik; Lakhal, Lotfi, Efficient mining of association rules using closed itemset lattices, Inf. Sci., 24, 1, 25-46 (1999)
[37] Pasquier, Nicolas; Taouil, Rafik; Bastide, Yves; Stumme, Gerd; Lakhal, Lotfi, Generating a condensed representation for association rules, J. Intell. Inf. Syst., 24, 1, 29-60 (2005) · Zbl 1071.68022
[38] Smets, Koen; Vreeken, Jilles, Slim: directly mining descriptive patterns, (Proceedings of the 12 SIAM International Conference on Data Mining. Proceedings of the 12 SIAM International Conference on Data Mining, Anaheim, California (2012)), 236-247
[39] Stumme, Gerd; Taouil, Rafik; Bastide, Yves; Pasquier, Nicolas; Lakhal, Lotfi, Computing iceberg concept lattices with Titanic, Data Knowl. Eng., 42, 2, 189-222 (2002) · Zbl 0996.68046
[40] Szathmary, Laszlo; Valtchev, Petko; Napoli, Amedeo; Godin, Robert, Efficient vertical mining of frequent closures and generators, (Proceedings of the 8th International Symposium on Intelligent Data Analysis (2009), Springer), 393-404
[41] Uno, Takeaki; Asai, Tatsuya; Uchida, Yuzo; Arimura, Hiroki, An efficient algorithm for enumerating closed patterns in transaction databases, (Proceedings of the 7th International Conference on Discovery Science (2004), Springer), 16-31 · Zbl 1110.68472
[42] Uno, Takeaki; Kiyomi, Masashi; Arimura, Hiroki, LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets, (Proceedings of the IEEE CDM Workshop on Frequent Itemset Mining Implementations, vol. 126 (2004))
[43] Vapnik, Vladimir Naumovich; Chervonenkis, Aleksei Yakovlevich, On uniform convergence of the frequencies of events to their probabilities, Teor. Veroâtn. Primen., 16, 2, 264-279 (1971) · Zbl 0247.60005
[44] Vreeken, Jilles; Tatti, Nikolaj, Interesting patterns, (Aggarwal, Charu C.; Han, Jiawei, Frequent Pattern Mining (2014), Springer), 105-134 · Zbl 1298.68248
[45] Zaki, Mohammed J., Generating non-redundant association rules, (Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)), 34-43
[46] Zaks, Shmuel, Lexicographic generation of ordered trees, Theor. Comput. Sci., 10, 1, 63-82 (1980) · Zbl 0422.05026
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.