×

Targeted and contextual redescription set exploration. (English) Zbl 1475.68287

Summary: One important problem occurring in redescription mining is a very large number of produced redescriptions. This makes analyses time consuming and generally difficult. We present the targeted and contextual redescription set exploration, realized through the tool InterSet. The main purpose of the tool is to derive additional knowledge from the redescription set which allows exploring parts of redescription set of interest and examining redescriptions individually or in the broader context, with the aim of increasing overall understandability. InterSet allows relating, grouping redescriptions, observing distributions of various redescription properties and selecting the appropriate subsets for further, detailed study. This allows gaining knowledge about the underlying data, help in forming, understanding, supporting research hypothesis or assists in understanding one or more redescriptions of interest. The tool provides three different, fully connected interaction modes based on: (1) similarity of entity occurrence in redescription support sets, (2) attribute co-occurrence in redescriptions and (3) redescription quality measures. Additionally, it allows exploration of relations between different redescriptions by creating a graph visualization that includes the top \(k\)-shortest paths containing selected redescriptions. On the individual redescription level, it allows studying value distributions of described entities, for a given set of attributes.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T30 Knowledge representation
Full Text: DOI

References:

[1] Alspaugh, S., Ganapathi, A., Hearst, M.A., & Katz, R. (2014). Better logging to improve interactive data analysis tools. In KDD workshop on interactive data exploration and analytics (IDEA).
[2] Andrienko, G.; Andrienko, N., Interactive maps for visual data exploration, International Journal of Geographical Information Science, 13, 355-374, (1999) · doi:10.1080/136588199241247
[3] Appice, A., & Buono, P. (2005). Analyzing multi-level spatial association rules through a graph-based visualization. In IEA/AIE, Springer, lecture notes in computer science, vol. 3533, pp. 448-458.
[4] Berthold, MR; Cebron, N.; Dill, F.; Gabriel, TR; Kötter, T.; Meinl, T.; etal., Knime-the Konstanz information miner: version 2.0 and beyond, SIGKDD Explorations Newsletter, 11, 26-31, (2009) · doi:10.1145/1656274.1656280
[5] Blanchard, J., Guillet, F., & Briand, H. (2003). A user-driven and quality-oriented visualization for mining association rules. In Proceedings of the 3rd IEEE international conference on data mining (ICDM), Melbourne, Florida, USA, pp. 493-496
[6] Boley, M., Mampaey, M., Kang, B., Tokmakov, P., & Wrobel, S. (2013). One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics. ACM, New York, NY, USA, IDEA ’13, pp. 27-35
[7] Brbić, M.; Piškorec, M.; Vidulin, V.; Kriško, A.; Šmuc, T.; Supek, F., The landscape of microbial phenotypic traits and associated genes, Nucleic Acids Research, 44, 10,074-10,090, (2016)
[8] Castillo-Rojas, W., Peralta, A., & Meneses, C. (2014). Augmented visualization of association rules for data mining. In Eight Alberto Mendelzon workshop on foundations of data management, Cartagena de Indias, Colombia, AMW ’14.
[9] Chakravarthy, S., Zhang, H. (2003). Visualization of association rules over relational DBMSs. In Proceedings of the 2003 ACM symposium on applied computing, ACM, New York, NY, USA, SAC ’03, pp. 922-926.
[10] Chau, DH., Kittur, A., Hong, JI., & Faloutsos, C. (2011). Apolo: Making sense of large network data by combining rich user interaction and machine learning. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, New York, NY, USA, CHI ’11, pp. 167-176.
[11] Chau, DH., Akoglu, L., Vreeken, J., Tong, H., & Faloutsos, C. (2012). Tourviz: Interactive visualization of connection pathways in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’12, pp. 1516-1519.
[12] Cornejo, J.; Willows, RD; Beale, SI, Phytobilin biosynthesis: cloning and expression of a gene encoding soluble ferredoxin-dependent heme oxygenase from synechocystis sp. pcc 6803, The Plant Journal, 15, 99-107, (1998) · doi:10.1046/j.1365-313X.1998.00186.x
[13] DBLP (2010) DBLP dataset. http://dblp.uni-trier.de/db
[14] De Bie, T., Kontonasios, K. N., & Spyropoulou, E. (2010). A framework for mining interesting pattern sets. In Proceedings of the ACM SIGKDD workshop on useful patterns. ACM, New York, NY, USA, UP ’10, pp. 27-35.
[15] Raedt, L.; Nijssen, S.; Guns, T., K-pattern set mining under constraints, IEEE Transactions on Knowledge & Data Engineering, 25, 402-418, (2013) · doi:10.1109/TKDE.2011.204
[16] Desmond, E.; Brochier-Armanet, C.; Gribaldo, S., Phylogenomics of the archaeal flagellum: rare horizontal gene transfer in a unique motility structure, BMC Evolutionary Biology, 7, 106, (2007) · doi:10.1186/1471-2148-7-106
[17] Dijkstra, EW, A note on two problems in connexion with graphs, Numerische Mathematik, 1, 269-271, (1959) · Zbl 0092.16002 · doi:10.1007/BF01386390
[18] Endert, A.; Hossain, MS; Ramakrishnan, N.; North, C.; Fiaux, P.; Andrews, C., The human is the loop: new directions for visual analytics, Journal of Intelligent Information Systems, 43, 411-435, (2014) · doi:10.1007/s10844-014-0304-9
[19] Endert, A., North, C., Chang, R., & Zhou, M. (2014b). Toward usable interactive analytics: Coupling cognition and computation. In: Proceedings of the 2014 Workshop on Interactive Data Exploration and Analytics at KDD (IDEA)
[20] Fiore, M.; Trevors, J., Cell composition and metal tolerance in cyanobacteria, Biometals, 7, 83-103, (1994) · doi:10.1007/BF00140478
[21] Galbrun, E. (2013). Methods for Redescription Mining. PhD thesis, University of Helsinki, Finland
[22] Galbrun, E.; Miettinen, P., From black and white to full color: extending redescription mining outside the Boolean world, Statistical Analysis and Data Mining, 5, 284-303, (2012) · Zbl 07260331 · doi:10.1002/sam.11145
[23] Galbrun, E., & Miettinen, P. (2012b). Siren: An interactive tool for mining and visualizing geospatial redescriptions. In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’12, pp. 1544-1547.
[24] Galbrun, E., & Miettinen, P. (2016). Mining redescriptions with siren. ACM Transactions on Knowledge Discovery from Data: In Press.
[25] Gallo, A., Miettinen, P., & Mannila, H. (2008). Finding subgroups having several descriptions: Algorithms for redescription mining. In Proceedings of the SIAM international conference on data mining (SDM), SIAM, pp. 334-345.
[26] Gamberger, D., Mihelčić, M., & Lavrač, N. (2014), Multilayer clustering: A discovery experiment on country level trading data. In Proceedings of the 17th international conference discovery science, DS 2014, Bled, Slovenia, pp 87-98.
[27] Goethals, B., Moens, S., & Vreeken, J. (2011). Mime: A framework for interactive visual pattern mining. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’11, pp. 757-760.
[28] Guo, H.; Gomez, SR; Ziemkiewicz, C.; Laidlaw, DH, A case study using visualization interaction logs and insight metrics to understand how analysts arrive at insights, IEEE Transactions on Visualization and Computer Graphics, 22, 51-60, (2016) · doi:10.1109/TVCG.2015.2467613
[29] Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and business analytics applications. London: Chapman & Hall/CRC.
[30] Jakovljevic, V.; Leonardy, S.; Hoppert, M.; Søgaard-Andersen, L., Pilb and pilt are atpases acting antagonistically in type iv pilus function in myxococcus xanthus, Journal of Bacteriology, 190, 2411-2421, (2008) · doi:10.1128/JB.01793-07
[31] Jenney, FE; Verhagen, MF; Cui, X.; Adams, MW, Anaerobic microbes: oxygen detoxification without superoxide dismutase, Science, 286, 306-309, (1999) · doi:10.1126/science.286.5438.306
[32] Jennings, ME; Schaff, CW; Horne, AJ; Lessner, FH; Lessner, DJ, Expression of a bacterial catalase in a strictly anaerobic methanogen significantly increases tolerance to hydrogen peroxide but not oxygen, Microbiology, 160, 270-278, (2014) · doi:10.1099/mic.0.070763-0
[33] Jia, B.; Li, Z.; Liu, J.; Sun, Y.; Jia, X.; Xuan, YH; Zhang, J.; Jeon, CO, A zinc-dependent protease amz-tk from a thermophilic archaeon is a new member of the archaemetzincin protein family, Frontiers in Microbiology, 6, 1380, (2015)
[34] Kalofolias, J., Galbrun, E., & Miettinen, P. (2016). From sets of good redescriptions to good sets of redescriptions. In Proceedings of the 16th IEEE international conference on data mining (ICDM’16). IEEE, Los Alamitos. To appear.
[35] Kennedy, SP; Ng, WV; Salzberg, SL; Hood, L.; DasSarma, S., Understanding the adaptation of halobacterium species nrc-1 to its extreme environment through computational analysis of its genome sequence, Genome Research, 11, 1641-1650, (2001) · doi:10.1101/gr.190201
[36] Knobbe, A. J., & Ho, E. K. Y. (2006). Pattern teams (pp. 577-584). Berlin, Heidelberg: Springer.
[37] Kohonen, T., Schroeder, R. M., & Huang, T. S. T. (Eds.). (2001). Self-organizing maps (3rd ed.). USA: Springer, New York Inc. · Zbl 0957.68097
[38] Kranjc, J., Podpecan, V., & Lavrac, N. (2012) Clowdflows: A cloud based scientific workflow platform. In Flach PA, Bie TD, Cristianini N (Eds.) ECML/PKDD (2), Springer, Lecture Notes in Computer Science, vol. 7524, pp. 816-819.
[39] Kroening, D., & Strichman, O. (2008). Decision procedures: An algorithmic point of view (1st ed.). Incorporated: Springer Publishing Company. · Zbl 1149.68071
[40] Kumar, D.; Ramakrishnan, N.; Helm, RF; Potts, M., Algorithms for storytelling, IEEE Transactions on Knowledge and Data Engineering, 20, 736-751, (2008) · doi:10.1109/TKDE.2008.32
[41] Lam, HT; Mörchen, F.; Fradkin, D.; Calders, T., Mining compressing sequential patterns, Statistical Analysis and Data Mining, 7, 34-52, (2014) · Zbl 07260381 · doi:10.1002/sam.11192
[42] Liu, G., Suchitra, A., Zhang, H., Feng, M., Ng, S. K., & Wong, L. (2012). Assocexplorer: An association rule visualization system for exploratory data analysis. In KDD, ACM, pp. 1536-1539.
[43] Lumppio, HL; Shenvi, NV; Summers, AO; Voordouw, G.; Kurtz, DM, Rubrerythrin and rubredoxin oxidoreductase indesulfovibrio vulgaris: A novel oxidative stress protection system, Journal of Bacteriology, 183, 101-108, (2001) · doi:10.1128/JB.183.1.101-108.2001
[44] Michael, H.; Sudheer, C.; Kurt, H.; Christian, B., The arules R-package ecosystem: analyzing interesting patterns from large transaction data sets, Journal of Machine Learning Research, 12, 2021-2025, (2011) · Zbl 1280.68011
[45] Miettinen, P. (2014). Interactive data mining considered harmful (if done wrong). In ACM SIGKDD 2014 full-day workshop on interactive data exploration and analytics, pp. 85-87.
[46] Mihelcic, M.; Dzeroski, S.; Lavrac, N.; Smuc, T., Redescription mining augmented with random forest of multi-target predictive clustering trees, Journal of Intelligent Information Systems, 50, 63-96, (2018) · doi:10.1007/s10844-017-0448-5
[47] Mihelčić, M., & Šmuc, T. (2016) Interset: Interactive redescription set exploration. In Discovery Science - 19th international conference, DS 2016, Bari, Italy, October 19-21, 2016, Proceedings, pp. 35-50.
[48] Mihelčić, M., Džeroski, S., Lavrać, N., & Šmuc, T. (2015). Redescription mining with multi-target predictive clustering trees. In New frontiers in mining complex patterns - 4th international workshop (pp. 125-143). Porto, Portugal: NFMCP.
[49] Mihelčić, M.; Džeroski, S.; Lavrač, N.; Šmuc, T., A framework for redescription set construction, Expert Systems with Applications, 68, 196-215, (2017) · doi:10.1016/j.eswa.2016.10.012
[50] Mihelčić, M.; Šimić, G.; Babić Leko, M.; Lavrač, N.; Džeroski, S.; Šmuc, T.; etal., Using redescription mining to relate clinical and biological characteristics of cognitively impaired and alzheimer’s disease patients, PLOS ONE, 12, 1-35, (2017)
[51] Najjari, A.; Elshahed, MS; Cherif, A.; Youssef, NH, Patterns and determinants of halophilic archaea (class halobacteria) diversity in Tunisian endorheic salt lakes and sebkhet systems, Applied and Environmental Microbiology, 81, 4432-4441, (2015) · doi:10.1128/AEM.01097-15
[52] Nozaki, M.; Tagawa, K.; Arnon, DI, Noncyclic photophosphorylation in photosynthetic bacteria, Proceedings of the National Academy of Sciences, 47, 1334-1340, (1961) · doi:10.1073/pnas.47.9.1334
[53] Ouali, A., Zimmermann, A., Loudni, S., Lebbah, Y., Crémilleux, B., Boizumault, P., & Loukil, L. (2017). Integer Linear Programming for Pattern Set Mining; with an Application to Tiling. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Jeju, South Korea, Advances in Knowledge Discovery and Data Mining 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings.
[54] Parida, L., & Ramakrishnan, N. (2005), Redescription mining: Structure theory and algorithms. In AAAI, AAAI Press / The MIT Press, pp. 837-844.
[55] Pei, J.; Han, J.; Wang, W., Constraint-based sequential pattern mining: the pattern-growth methods, Journal of Intelligent Information Systems, 28, 133-160, (2007) · doi:10.1007/s10844-006-0006-z
[56] Powell, S.; Szklarczyk, D.; Trachana, K.; Roth, A.; Kuhn, M.; Muller, J.; etal., Eggnog v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges, Nucleic Acids Research, 40, d284-d289, (2012) · doi:10.1093/nar/gkr1060
[57] Ragan, ED; Endert, A.; Sanyal, J.; Chen, J., Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes, IEEE Transactions on Visualization and Computer Graphics, 22, 31-40, (2016) · doi:10.1109/TVCG.2015.2467551
[58] Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., & Helm, RF. (2004). Turning cartwheels: An alternating algorithm for mining redescriptions. In Proceedings of the 10Th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD 2004, pp. 266-275.
[59] Tatusov, RL; Koonin, EV; Lipman, DJ, A genomic perspective on protein families, Science, 278, 631-637, (1997) · doi:10.1126/science.278.5338.631
[60] Tatusov, RL; Galperin, MY; Natale, DA; Koonin, EV, The cog database: A tool for genome-scale analysis of protein functions and evolution, Nucleic Acids Research, 28, 33-36, (2000) · doi:10.1093/nar/28.1.33
[61] UNCTAD (2014.) Unctad database. http://unctadstat.unctad.org/
[62] Van Leeuwen, M. (2014). Interactive data exploration using pattern mining. In Interactive knowledge discovery and data mining in biomedical informatics, Berlin: Springer, pp. 169-182.
[63] Walsby, A., Gas vesicles, Microbiological reviews, 58, 94-144, (1994)
[64] Webb, GI, Integrating machine learning with knowledge acquisition through direct interaction with domain experts, Knowledge-Based Systems, 9, 253-266, (1996) · doi:10.1016/0950-7051(96)01033-7
[65] Wehrens, R.; Buydens, LMC, Self and super-organising maps in R: the Kohonen package, J Stat Softw, 21, 1-19, (2007) · doi:10.18637/jss.v021.i05
[66] WorldBank. (2014). World Bank. http://data.worldbank.org/
[67] Xin, D., Han, J., Yan, X., & Cheng, H. (2005) Mining compressed frequent-pattern sets. In Proceedings of the 31st international conference on very large data bases, VLDB endowment, VLDB ’05, pp. 709-720.
[68] Yen, J., Finding the k shortest loopless paths in a network, Management Science, 17, 712-716, (1971) · Zbl 0218.90063 · doi:10.1287/mnsc.17.11.712
[69] Yen, J. (1972). Another algorithm for finding the k shortest loopless network paths. In Proceedings of 41st Mtg Operations Research Society of America 20.
[70] Zaki, MJ., & Phoophakdee, B. (2003). MIRAGE: A framework for mining, exploring and visualizing minimal association rules. Tech. Rep. 03-4, Computer Science Department, Rensselaer Polytechnic Institute.
[71] Zaki, MJ., & Ramakrishnan, N. (2005). Reasoning about sets using redescription mining. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, ACM, New York, USA, KDD 2005, pp. 364-373.
[72] Zinchenko, T. (2014). Redescription mining over non-binary data sets using decision trees. Master’s thesis, Universität des Saarlandes Saarbrücken, Germany.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.