×

What to expect from a set of itemsets? (English) Zbl 1535.68313

Summary: Dealing with redundancy is one of the main challenges in frequency based data mining and itemset mining in particular. To tackle this issue in the most objective possible way, we introduce the theoretical bases of a new probabilistic concept: Mutual constrained independence (MCI). Thanks to this notion, we describe a MCI model for the frequencies of all itemsets which is the least binding in terms of model hypotheses defined by the knowledge of the frequencies of some of the itemsets. We provide a method for computing MCI models based on algebraic geometry. We establish the link between MCI models and a class of MaxEnt models which has already known to be used in pattern mining. As such, our research presents further insight on the nature of such models and an entirely novel approach for computing them.

MSC:

68T10 Pattern recognition, speech recognition
14P10 Semialgebraic sets and related spaces
62B11 Information geometry (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
94A17 Measures of information, entropy

Software:

OEIS

References:

[1] Aggarwal, C. C., An Introduction to Frequent Pattern Mining, 1-17 (2014), Springer International Publishing
[2] Bacchus, F.; Grove, A. J.; Halpern, J. Y.; Koller, D., From statistical knowledge bases to degrees of belief, Artif. Intell., 87, 1-2, 75-143 (1996) · Zbl 1506.68146
[3] Basu, S.; Pollack, R.; Roy, M.-F., Algorithms in Real Algebraic Geometry (2006), Springer · Zbl 1102.14041
[4] Bauer, G. R.; Scheim, A. I., Advancing quantitative intersectionality research methods: Intracategorical and intercategorical approaches to shared and differential constructs, Soc. Sci. Med., 226, 260-262 (2019)
[5] Bochnak, J.; Coste, M.; Roy, M.-F., Real algebraic geometry, vol. 36 (2013), Springer Science & Business Media · Zbl 0633.14016
[6] Calders, T.; Goethals, B., Mining all non-derivable frequent itemsets, (European Conference on Principles of Data Mining and Knowledge Discovery (2002), Springer), 74-86 · Zbl 1020.68566
[7] Calders, T.; Goethals, B., Non-derivable itemset mining, Data Min. Knowl. Disc., 14, 1, 171-206 (2007)
[8] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Inf. Theory 14(3) (1968) 462-467. ISSN 0018-9448. doi: 10.1109/TIT.1968.1054142. · Zbl 0165.22305
[9] T.M. Cover and J.A. Thomas, Elements of information theory, John Wiley & Sons, 2012. ISBN 9781118585771. URL:https://books.google.fr/books?id=VWq5GG6ycxMC. · Zbl 1140.94001
[10] Dalleiger, S.; Vreeken, J., The relaxed maximum entropy distribution and its application to pattern discovery, (Proceedings of the IEEE International Conference on Data Mining (ICDM) (2020))
[11] J.N. Darroch and D. Ratcliff, Generalized iterative scaling for log-linear models, Ann. Math. Stat. (1972) 1470-1480. · Zbl 0251.62020
[12] De Bie, T., Maximum entropy models and subjective interestingness: an application to tiles in binary databases, Data Min. Knowl. Disc., 23, 3, 407-446 (2011) · Zbl 1235.68065
[13] T. Delacroix, Meaningful objective frequency-based interesting pattern mining. PhD thesis, 2021.
[14] Delacroix, T.; Boubekki, A.; Lenca, P.; Lallich, S., Constrained independence for detecting interesting patterns, (2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE (2015)), 1-10
[15] Fournier-Viger, P.; Lin, J. C.-W.; Vo, B.; Chi, T. T.; Zhang, J.; Le, H. B., A survey of itemset mining, Wiley Interdiscip. Rev.: Data Min. Knowl. Discov., 7, 4, Article e1207 pp. (2017)
[16] L. Geng and H.J. Hamilton, Interestingness measures for data mining: A survey, ACM Comput. Surveys 38(3) (2006). doi: 10.1145/1132960.1132963. URL:http://doi.acm.org/10.1145/1132960.1132963.
[17] A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas, Assessing data mining results via swap randomization, ACM Trans. Knowl. Discov. Data 1(3) (2007) 14-es.
[18] Goldszmidt, M.; Morris, P.; Pearl, J., A maximum entropy approach to nonmonotonic reasoning, IEEE Trans. Pattern Anal. Mach. Intell., 15, 3, 220-232 (1993)
[19] Grove, A. J.; Halpern, J. Y.; Koller, D., Random worlds and maximum entropy, J. Artif. Intell. Res., 2, 33-88 (1994) · Zbl 0900.68398
[20] Halpern, J. Y., An analysis of first-order logics of probability, Artif. Intell., 46, 3, 311-350 (1990) · Zbl 0723.03007
[21] Han, J.; Cheng, H.; Xin, D.; Yan, X., Frequent pattern mining: current status and future directions, Data mining and knowledge discovery, 15, 55-86 (2007)
[22] S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, and H. Mannila, Tell me something I don’t know: randomization strategies for iterative data mining, in: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’09), pages 379-388. ACM, 2009. ISBN 978-1-60558-495-9. doi: 10.1145/1557019.1557065. URL:http://doi.acm.org/10.1145/1557019.1557065.
[23] Hassaine, A.; Salimi-Khorshidi, G.; Canoy, D.; Rahimi, K., Untangling the complexity of multimorbidity with machine learning, Mechanisms of ageing and development, 190, Article 111325 pp. (2020)
[24] Jaroszewicz, S.; Simovici, D. A., Pruning redundant association rules using maximum entropy principle, (Pacific-Asia Conference on Knowledge Discovery and Data Mining (2002), Springer), 135-147 · Zbl 1048.68788
[25] E.T. Jaynes, On the rationale of maximum-entropy methods, Proc. IEEE 70(9) (1982) 939-952. ISSN 0018-9219. doi: 10.1109/PROC.1982.12425.
[26] Jaynes, E. T., Probability theory: The logic of science (2003), Cambridge University Press · Zbl 1045.62001
[27] Johnston, M. C.; Crilly, M.; Black, C.; Prescott, G. J.; Mercer, S. W., Defining and measuring multimorbidity: a systematic review of systematic reviews, Eur. J. Public Health, 29, 1, 182-189 (2019)
[28] Kuznetsov, S. O.; Makhalova, T., On interestingness measures of formal concepts, Inf. Sci., 442, 202-219 (2018) · Zbl 1440.68282
[29] Le Bras, Y.; Lenca, P.; Lallich, S., Formal framework for the study of algorithmic properties of objective interestingness measures, (Data Mining: Foundations and Intelligent Paradigms (2012), Springer), 77-98 · Zbl 1231.68210
[30] Lenca, P.; Meyer, P.; Vaillant, B.; Lallich, S., On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid, Eur. J. Oper. Res., 184, 2, 610-626 (2008) · Zbl 1168.90513
[31] Luna, J. M.; Fournier-Viger, P.; Ventura, S., Frequent itemset mining: A 25 years review, Wiley Interdiscip. Rev.: Data Min. Knowl. Discovery, 9, 6, Article e1329 pp. (2019)
[32] Mampaey, M.; Tatti, N.; Vreeken, J., Tell me what i need to know: succinctly summarizing data with itemsets, (Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (2011)), 573-581
[33] M. Mampaey, J. Vreeken, and N. Tatti, Summarizing data succinctly with the most informative itemsets, ACM Trans. Knowl. Discovery Data 6(4) (2012) 16. ISSN 1556-4681. doi: 10.1145/2382577.2382580.
[34] McNamee, J. M.; Pan, V., Numerical Methods for Roots of Polynomials-Part II (2013), Newnes · Zbl 1279.65053
[35] R. Meo, Theory of dependence values, ACM Trans. Database Syst. 25(3) (2000) 380-406. ISSN 0362-5915. doi: 10.1145/363951.363956.
[36] National Plant Data Center, The plants database, 2008. URL:https://archive.ics.uci.edu/ml/datasets/Plants.
[37] Naulaerts, S.; Meysman, P.; Bittremieux, W.; Vu, T. N.; Vanden Berghe, W.; Goethals, B.; Laukens, K., A primer to frequent itemset mining for bioinformatics, Briefings Bioinf., 16, 2, 216-231 (2015)
[38] Nilsson, N. J., Probabilistic logic, Artif. Intell., 28, 1, 71-87 (1986) · Zbl 0589.03007
[39] D.N. Pavlov, H. Mannila, P. Smyth, Beyond independence: Probabilistic models for query approximation on binary transaction data, IEEE Trans. Knowl. Data Eng. 15(6) (2003) 1409-1421. ISSN 1041-4347. doi: 10.1109/TKDE.2003.1245281.
[40] Scott, N. A.; Siltanen, J., Intersectionality and quantitative methods: assessing regression from a feminist perspective, Int. J. Soc. Res. Methodol., 20, 4, 373-385 (2017)
[41] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27(3) (1948) 379-423. ISSN 0005-8580. doi: 10.1002/j.1538-7305.1948.tb01338.x. URL:https://ieeexplore.ieee.org/document/6773024. · Zbl 1154.94303
[42] N.J.A. Sloane, The On-Line Encyclopedia of Integer Sequences. published electronically at https://oeis.org, 2019. · Zbl 1044.11108
[43] Sturmfels, B., Solving systems of polynomial equations. Number 97, Am. Math. Soc. (2002) · Zbl 1101.13040
[44] Szathmary, L.; Napoli, A.; Kuznetsov, S. O., Zart: A multifunctional itemset mining algorithm, Research report (2006), URL:https://hal.inria.fr/inria-00001271 · Zbl 1289.68096
[45] N. Tatti. Computational complexity of queries based on itemsets. Information Processing Letters, 98 (5): 183-187, 2006. ISSN 0020-0190. doi: 10.1016/j.ipl.2006.02.003. · Zbl 1187.68255
[46] N. Tatti, Maximum entropy based significance of itemsets, Knowl. Inf. Syst. 17(1) (2008) 57-77. ISSN 0219-3116. doi: 10.1007/s10115-008-0128-4.
[47] Tatti, N.; Mampaey, M., Using background knowledge to rank itemsets, Data Min. Knowl. Disc., 21, 2, 293-309 (2010)
[48] Tew, C.; Giraud-Carrier, C.; Tanner, K.; Burton, S., Behavior-based clustering and analysis of interestingness measures for association rule mining, Data Min. Knowl. Disc., 28, 4, 1004-1045 (2014) · Zbl 1298.68230
[49] Vreeken, J.; Tatti, N., Interesting Patterns, 105-134 (2014), Springer International Publishing · Zbl 1298.68248
[50] Zaki, M. J.; Hsiao, C.-J., Efficient algorithms for mining closed itemsets and their lattice structure, IEEE Trans. Knowl. Data Eng., 17, 4, 462-478 (2005)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.