×

Summarizing transactional databases with overlapped hyperrectangles. (English) Zbl 1235.68068

Summary: Transactional data are ubiquitous. Several methods, including frequent itemset mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles (a concept generally regarded as tile according to some existing papers); we then prove that this problem and its several variations are NP-hard, and we further reveal its relationship with the compact representation of a directed bipartite graph. We develop an approximation algorithm Hyper which can achieve a logarithmic approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm, and we also propose an efficient algorithm Hyper+ to further summarize the set of hyperrectangles by allowing false positive conditions. Additionally, we show that hyperrectangles generated by our algorithms can be properly visualized. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.

MSC:

68P15 Database theory
68T05 Learning and adaptive systems in artificial intelligence
68Q17 Computational difficulty of problems (lower bounds, completeness, difficulty of approximation, etc.)

Software:

Knapsack
Full Text: DOI

References:

[1] Afrati FN, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: KDD, pp 12–19
[2] Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: VLDB, pp 487–499
[3] Agrawal R, Borgida A, Jagadish HV (1989) Efficient management of transitive relationships in large data, knowledge bases. In: SIGMOD Conference, pp 253–262
[4] Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: SIGMOD Conference, pp 207–216
[5] Agrawal A, Mannila H, Srikant R, Toivonen H, Verkamo A (1996) Fast discovery of association rules. Adv Knowl Discov Data Min 307–308
[6] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, pp 94–105
[7] Besson J, Robardet C, De Raedt L, Boulicaut J-F (2006) Mining bi-sets in numerical data. In: KDID, pp 11–23
[8] Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22 · doi:10.1023/A:1021571501451
[9] Burdick D, Calimlim M, Flannick J, Gehrke J, Yiu T (2005) Mafia: a maximal frequent itemset algorithm. IEEE Trans Knowl Data Eng 17(11): 1490–1504 · doi:10.1109/TKDE.2005.183
[10] Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206 · doi:10.1007/s10618-006-0054-6
[11] Chandola V, Kumar V (2007) Summarization–compressing data into an informative representation. Knowl Inf Syst 12(3): 355–378 · doi:10.1007/s10115-006-0039-1
[12] Chvátal V. (1979) A greedy heuristic for the set-covering problem. Math Oper Res 4: 233–235 · Zbl 0443.90066 · doi:10.1287/moor.4.3.233
[13] Faloutsos C, Megalooikonomou V (2007) On data mining, compression, kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20 · doi:10.1007/s10618-006-0057-3
[14] Gao Byron J, Ester M (2006) Turning clusters into patterns: rectangle-based discriminative data description. In: ICDM, pp 200–211
[15] Gao Byron J, Ester M, Cai JY, Schulte O, Xiong H (2007) The minimum consistent subset cover problem, its applications in data mining. In: KDD, pp 310–319
[16] Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Discovery science, pp 278–289 · Zbl 1110.68373
[17] Gionis A, Mannila H, Seppänen JK (2004) Geometric, combinatorial tiles in 0-1 data. In: PKDD, pp 173–184
[18] Han J, Kamber M (2006) Data mining: concepts, techniques, second edition. Morgan Kaufmann, San Francisco · Zbl 1445.68004
[19] Harper LH (1964) Optimal assignments of numbers to vertices. J Soc Ind Appl Math 12(1): 131–135 · Zbl 0222.94004 · doi:10.1137/0112012
[20] Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337): 123–129 · doi:10.1080/01621459.1972.10481214
[21] Jin R, Xiang Y, Fuhry D, Dragan FF (2008) Overlapping matrix pattern visualization: a hypergraph approach. In: ICDM, pp 313–322 (© IEEE, 2008. doi: 10.1109/ICDM.2008.102 )
[22] Jin R, Xiang Y, Liu L (2009) Cartesian contour: a concise representation for a collection of frequent sets. In: KDD, pp 417–426
[23] Johnson D, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large boolean matrices using reordering techniques. In: VLDB, pp 13–23
[24] Kellerer H, Pferschy U, Pisinger D (2004) Knapsack problems. Springer Verlag, New York · Zbl 1103.90003
[25] Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson TJ (2002) The generalized mdl approach for summarization. In: VLDB, pp 766–777
[26] Li T (2005) A general model for clustering binary data. In: KDD, pp 188–197
[27] Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1): 24–45 · doi:10.1109/TCBB.2004.2
[28] Minoux M (1977) Accelerated greedy algorithms for maximizing submodular set functions. In: the 8th IFIP Conference on Optimization Techniques
[29] Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Publishers, Boston · Zbl 0874.90198
[30] Peeters R (2003) The maximum edge biclique problem is np-complete. Discret Appl Math 131(3): 651–654 · Zbl 1026.68068 · doi:10.1016/S0166-218X(03)00333-0
[31] Pei J, Dong G, Zou W, Han J (2004) Mining condensed frequent-pattern bases. Knowl Inf Syst 6(5): 570–594 · doi:10.1007/s10115-003-0133-6
[32] Richardson TJ, Urbanke RL (2008) Modern coding theory. Cambridge University Press, Cambridge · Zbl 1188.94001
[33] Safro I, Ron D, Brandt A (2006) Graph minimum linear arrangement by multilevel weighted edge contractions. J Algorithms 60(1): 24–41 · Zbl 1096.68687 · doi:10.1016/j.jalgor.2004.10.004
[34] Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: SDM · Zbl 1235.68071
[35] Steinbach M, Tan P-N, Kumar V (2004) Support envelopes: a technique for exploring the structure of association patterns. In: KDD, pp 296–305
[36] van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter. In: PKDD, pp 585–592
[37] Vreeken J, van Leeuwen M, Siebes A (2007) Characterising the difference. In: KDD, pp 765–774
[38] Wang J, Karypis G (2006) On efficiently summarizing categorical databases. Knowl Inf Syst 9(1): 19–37 · doi:10.1007/s10115-005-0216-7
[39] Xiang Y, Jin R, Fuhry D, Dragan FF (2008) Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: KDD, pp 758–766 (© ACM, 2008. http://doi.acm.org/10.1145/1401890.1401981 )
[40] Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: VLDB, pp 709–720
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.