-
Fast Redescription Mining Using Locality-Sensitive Hashing
Authors:
Maiju Karjalainen,
Esther Galbrun,
Pauli Miettinen
Abstract:
Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data co…
▽ More
Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Visualizing Overlapping Biclusterings and Boolean Matrix Factorizations
Authors:
Thibault Marette,
Pauli Miettinen,
Stefan Neumann
Abstract:
Finding (bi-)clusters in bipartite graphs is a popular data analysis approach. Analysts typically want to visualize the clusters, which is simple as long as the clusters are disjoint. However, many modern algorithms find overlapping clusters, making visualization more complicated. In this paper, we study the problem of visualizing \emph{a given clustering} of overlapping clusters in bipartite grap…
▽ More
Finding (bi-)clusters in bipartite graphs is a popular data analysis approach. Analysts typically want to visualize the clusters, which is simple as long as the clusters are disjoint. However, many modern algorithms find overlapping clusters, making visualization more complicated. In this paper, we study the problem of visualizing \emph{a given clustering} of overlapping clusters in bipartite graphs and the related problem of visualizing Boolean Matrix Factorizations. We conceptualize three different objectives that any good visualization should satisfy: (1) proximity of cluster elements, (2) large consecutive areas of elements from the same cluster, and (3) large uninterrupted areas in the visualization, regardless of the cluster membership. We provide objective functions that capture these goals and algorithms that optimize these objective functions. Interestingly, in experiments on real-world datasets, we find that the best trade-off between these competing goals is achieved by a novel heuristic, which locally aims to place rows and columns with similar cluster membership next to each other.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Differentially Private Tree-Based Redescription Mining
Authors:
Matej Mihelčić,
Pauli Miettinen
Abstract:
Differential privacy provides a strong form of privacy and allows preserving most of the original characteristics of the dataset. Utilizing these benefits requires one to design specific differentially private data analysis algorithms. In this work, we present three tree-based algorithms for mining redescriptions while preserving differential privacy. Redescription mining is an exploratory data an…
▽ More
Differential privacy provides a strong form of privacy and allows preserving most of the original characteristics of the dataset. Utilizing these benefits requires one to design specific differentially private data analysis algorithms. In this work, we present three tree-based algorithms for mining redescriptions while preserving differential privacy. Redescription mining is an exploratory data analysis method for finding connections between two views over the same entities, such as phenotypes and genotypes of medical patients, for example. It has applications in many fields, including some, like health care informatics, where privacy-preserving access to data is desired. Our algorithms are the first differentially private redescription mining algorithms, and we show via experiments that, despite the inherent noise in differential privacy, it can return trustworthy results even in smaller datasets where noise typically has a stronger effect.
△ Less
Submitted 28 February, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Finding Rule-Interpretable Non-Negative Data Representation
Authors:
Matej Mihelčić,
Pauli Miettinen
Abstract:
Non-negative Matrix Factorization (NMF) is an intensively used technique for obtaining parts-based, lower dimensional and non-negative representation of non-negative data. It is a popular method in different research fields. Scientists performing research in the fields of biology, medicine and pharmacy often prefer NMF over other dimensionality reduction approaches (such as PCA) because the non-ne…
▽ More
Non-negative Matrix Factorization (NMF) is an intensively used technique for obtaining parts-based, lower dimensional and non-negative representation of non-negative data. It is a popular method in different research fields. Scientists performing research in the fields of biology, medicine and pharmacy often prefer NMF over other dimensionality reduction approaches (such as PCA) because the non-negativity of the approach naturally fits the characteristics of the domain problem and its result is easier to analyze and understand. Despite these advantages, it still can be hard to get exact characterization and interpretation of the NMF's resulting latent factors due to their numerical nature. On the other hand, rule-based approaches are often considered more interpretable but lack the parts-based interpretation. In this work, we present a version of the NMF approach that merges rule-based descriptions with advantages of part-based representation offered by the NMF approach. Given the numerical input data with non-negative entries and a set of rules with high entity coverage, the approach creates the lower-dimensional non-negative representation of the input data in such a way that its factors are described by the appropriate subset of the input rules. In addition to revealing important attributes for latent factors, it allows analyzing relations between these attributes and provides the exact numerical intervals or categorical values they take. The proposed approach provides numerous advantages in tasks such as focused embedding or performing supervised multi-label NMF.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Biclustering and Boolean Matrix Factorization in Data Streams
Authors:
Stefan Neumann,
Pauli Miettinen
Abstract:
We study the clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm that, after one pass over the stream, recovers the set of clusters on the right side of the graph using sublinear space; to the best of…
▽ More
We study the clustering of bipartite graphs and Boolean matrix factorization in data streams. We consider a streaming setting in which the vertices from the left side of the graph arrive one by one together with all of their incident edges. We provide an algorithm that, after one pass over the stream, recovers the set of clusters on the right side of the graph using sublinear space; to the best of our knowledge, this is the first algorithm with this property. We also show that after a second pass over the stream, the left clusters of the bipartite graph can be recovered and we show how to extend our algorithm to solve the Boolean matrix factorization problem (by exploiting the correspondence of Boolean matrices and bipartite graphs). We evaluate an implementation of the algorithm on synthetic data and on real-world data. On real-world datasets the algorithm is orders of magnitudes faster than a static baseline algorithm while providing quality results within a factor 2 of the baseline algorithm. Our algorithm scales linearly in the number of edges in the graph. Finally, we analyze the algorithm theoretically and provide sufficient conditions under which the algorithm recovers a set of planted clusters under a standard random graph model.
△ Less
Submitted 5 December, 2020;
originally announced December 2020.
-
Recent Developments in Boolean Matrix Factorization
Authors:
Pauli Miettinen,
Stefan Neumann
Abstract:
The goal of Boolean Matrix Factorization (BMF) is to approximate a given binary matrix as the product of two low-rank binary factor matrices, where the product of the factor matrices is computed under the Boolean algebra. While the problem is computationally hard, it is also attractive because the binary nature of the factor matrices makes them highly interpretable. In the last decade, BMF has rec…
▽ More
The goal of Boolean Matrix Factorization (BMF) is to approximate a given binary matrix as the product of two low-rank binary factor matrices, where the product of the factor matrices is computed under the Boolean algebra. While the problem is computationally hard, it is also attractive because the binary nature of the factor matrices makes them highly interpretable. In the last decade, BMF has received a considerable amount of attention in the data mining and formal concept analysis communities and, more recently, the machine learning and the theory communities also started studying BMF. In this survey, we give a concise summary of the efforts of all of these communities and raise some open questions which in our opinion require further investigation.
△ Less
Submitted 5 December, 2020;
originally announced December 2020.
-
Boolean matrix factorization meets consecutive ones property
Authors:
Nikolaj Tatti,
Pauli Miettinen
Abstract:
Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or l…
▽ More
Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques.
We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless P = NP). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is NP-hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
Hybrid ASP-based Approach to Pattern Mining
Authors:
Sergey Paramonov,
Daria Stepanova,
Pauli Miettinen
Abstract:
Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been ma…
▽ More
Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset, sequence and graph mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. To further demonstrate the generic nature of our hybrid framework we apply it to a problem of approximately tiling a database. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains for itemset, sequence and graph mining, as well as approximate tiling.
Under consideration in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
Latitude: A Model for Mixed Linear-Tropical Matrix Factorization
Authors:
Sanjar Karaev,
James Hook,
Pauli Miettinen
Abstract:
Nonnegative matrix factorization (NMF) is one of the most frequently-used matrix factorization models in data analysis. A significant reason to the popularity of NMF is its interpretability and the `parts of whole' interpretation of its components. Recently, max-times, or subtropical, matrix factorization (SMF) has been introduced as an alternative model with equally interpretable `winner takes it…
▽ More
Nonnegative matrix factorization (NMF) is one of the most frequently-used matrix factorization models in data analysis. A significant reason to the popularity of NMF is its interpretability and the `parts of whole' interpretation of its components. Recently, max-times, or subtropical, matrix factorization (SMF) has been introduced as an alternative model with equally interpretable `winner takes it all' interpretation. In this paper we propose a new mixed linear--tropical model, and a new algorithm, called Latitude, that combines NMF and SMF, being able to smoothly alternate between the two. In our model, the data is modeled using the latent factors and latent parameters that control whether the factors are interpreted as NMF or SMF features, or their mixtures. We present an algorithm for our novel matrix factorization. Our experiments show that our algorithm improves over both baselines, and can yield interpretable results that reveal more of the latent structure than either NMF or SMF alone.
△ Less
Submitted 18 January, 2018;
originally announced January 2018.
-
Reductions for Frequency-Based Data Mining Problems
Authors:
Stefan Neumann,
Pauli Miettinen
Abstract:
Studying the computational complexity of problems is one of the - if not the - fundamental questions in computer science. Yet, surprisingly little is known about the computational complexity of many central problems in data mining. In this paper we study frequency-based problems and propose a new type of reduction that allows us to compare the complexities of the maximal frequent pattern mining pr…
▽ More
Studying the computational complexity of problems is one of the - if not the - fundamental questions in computer science. Yet, surprisingly little is known about the computational complexity of many central problems in data mining. In this paper we study frequency-based problems and propose a new type of reduction that allows us to compare the complexities of the maximal frequent pattern mining problems in different domains (e.g. graphs or sequences). Our results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader range of data mining problems. Our results show that, by allowing constraints in the pattern space, the complexities of many maximal frequent pattern mining problems collapse. These problems include maximal frequent subgraphs in labelled graphs, maximal frequent itemsets, and maximal frequent subsequences with no repetitions. In addition to theoretical interest, our results might yield more efficient algorithms for the studied problems.
△ Less
Submitted 4 September, 2017;
originally announced September 2017.
-
Algorithms for Approximate Subtropical Matrix Factorization
Authors:
Sanjar Karaev,
Pauli Miettinen
Abstract:
Matrix factorization methods are important tools in data mining and analysis. They can be used for many tasks, ranging from dimensionality reduction to visualization. In this paper we concentrate on the use of matrix factorizations for finding patterns from the data. Rather than using the standard algebra -- and the summation of the rank-1 components to build the approximation of the original matr…
▽ More
Matrix factorization methods are important tools in data mining and analysis. They can be used for many tasks, ranging from dimensionality reduction to visualization. In this paper we concentrate on the use of matrix factorizations for finding patterns from the data. Rather than using the standard algebra -- and the summation of the rank-1 components to build the approximation of the original matrix -- we use the subtropical algebra, which is an algebra over the nonnegative real values with the summation replaced by the maximum operator. Subtropical matrix factorizations allow "winner-takes-it-all" interpretations of the rank-1 components, revealing different structure than the normal (nonnegative) factorizations. We study the complexity and sparsity of the factorizations, and present a framework for finding low-rank subtropical factorizations. We present two specific algorithms, called Capricorn and Cancer, that are part of our framework. They can be used with data that has been corrupted with different types of noise, and with different error metrics, including the sum-of-absolute differences, Frobenius norm, and Jensen--Shannon divergence. Our experiments show that the algorithms perform well on data that has subtropical structure, and that they can find factorizations that are both sparse and easy to interpret.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank
Authors:
Stefan Neumann,
Rainer Gemulla,
Pauli Miettinen
Abstract:
When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what…
▽ More
When factorizing binary matrices, we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. Alternatively, we can first compute a continuous factorization and subsequently apply a rounding procedure to obtain a discrete representation. But what will we gain by rounding? Will this yield lower reconstruction errors? Is it easy to find a low-rank matrix that rounds to a given binary matrix? Does it matter which threshold we use for rounding? Does it matter if we allow for only non-negative factorizations? In this paper, we approach these and further questions by presenting and studying the concept of rounding rank. We show that rounding rank is related to linear classification, dimensionality reduction, and nested matrices. We also report on an extensive experimental study that compares different algorithms for finding good factorizations under the rounding rank model.
△ Less
Submitted 6 October, 2016; v1 submitted 16 September, 2016;
originally announced September 2016.
-
Hyperbolae Are No Hyperbole: Modelling Communities That Are Not Cliques
Authors:
Saskia Metzler,
Stephan Günnemann,
Pauli Miettinen
Abstract:
Cliques are frequently used to model communities: a community is a set of nodes where each pair is equally likely to be connected. But studying real-world communities reveals that they have more structure than that. In particular, the nodes can be ordered in such a way that (almost) all edges in the community lie below a hyperbola. In this paper we present three new models for communities that cap…
▽ More
Cliques are frequently used to model communities: a community is a set of nodes where each pair is equally likely to be connected. But studying real-world communities reveals that they have more structure than that. In particular, the nodes can be ordered in such a way that (almost) all edges in the community lie below a hyperbola. In this paper we present three new models for communities that capture this phenomenon. Our models explain the structure of the communities differently, but we also prove that they are identical in their expressive power. Our models fit to real-world data much better than traditional block models or previously-proposed hyperbolic models, both of which are a special case of our model. Our models also allow for intuitive interpretation of the parameters, enabling us to summarize the shapes of the communities in graphs effectively.
△ Less
Submitted 4 October, 2016; v1 submitted 15 February, 2016;
originally announced February 2016.
-
On Defining SPARQL with Boolean Tensor Algebra
Authors:
Saskia Metzler,
Pauli Miettinen
Abstract:
The Resource Description Framework (RDF) represents information as subject-predicate-object triples. These triples are commonly interpreted as a directed labelled graph. We propose an alternative approach, interpreting the data as a 3-way Boolean tensor. We show how SPARQL queries - the standard queries for RDF - can be expressed as elementary operations in Boolean algebra, giving us a complete re…
▽ More
The Resource Description Framework (RDF) represents information as subject-predicate-object triples. These triples are commonly interpreted as a directed labelled graph. We propose an alternative approach, interpreting the data as a 3-way Boolean tensor. We show how SPARQL queries - the standard queries for RDF - can be expressed as elementary operations in Boolean algebra, giving us a complete re-interpretation of RDF and SPARQL. We show how the Boolean tensor interpretation allows for new optimizations and analyses of the complexity of SPARQL queries. For example, estimating the size of the results for different join queries becomes much simpler.
△ Less
Submitted 1 March, 2015;
originally announced March 2015.
-
Clustering Boolean Tensors
Authors:
Saskia Metzler,
Pauli Miettinen
Abstract:
Tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations -- where the input tensor and all the factors are required to be binary and we use Boolean algebra -- much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly…
▽ More
Tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations -- where the input tensor and all the factors are required to be binary and we use Boolean algebra -- much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. In this paper we investigate what consequences does this partitioning have on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithms with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets.
△ Less
Submitted 4 January, 2015;
originally announced January 2015.
-
Scalable Boolean Tensor Factorizations using Random Walks
Authors:
Dóra Erdős,
Pauli Miettinen
Abstract:
Tensors are becoming increasingly common in data mining, and consequently, tensor factorizations are becoming more and more important tools for data miners. When the data is binary, it is natural to ask if we can factorize it into binary factors while simultaneously making sure that the reconstructed tensor is still binary. Such factorizations, called Boolean tensor factorizations, can provide imp…
▽ More
Tensors are becoming increasingly common in data mining, and consequently, tensor factorizations are becoming more and more important tools for data miners. When the data is binary, it is natural to ask if we can factorize it into binary factors while simultaneously making sure that the reconstructed tensor is still binary. Such factorizations, called Boolean tensor factorizations, can provide improved interpretability and find Boolean structure that is hard to express using normal factorizations. Unfortunately the algorithms for computing Boolean tensor factorizations do not usually scale well. In this paper we present a novel algorithm for finding Boolean CP and Tucker decompositions of large and sparse binary tensors. In our experimental evaluation we show that our algorithm can handle large tensors and accurately reconstructs the latent Boolean structure.
△ Less
Submitted 17 October, 2013;
originally announced October 2013.