subscribe to arXiv mailings

QirK: Question Answering via Intermediate Representation on Knowledge Graphs

Authors: Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu

Abstract: We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR).… ▽ More We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR). The input question is mapped to IR using LLMs, which is then repaired into a valid relational database query with the aid of a semantic search on vector embeddings. This allows a practical synthesis of LLM capabilities and KG reliability. A short video demonstrating QirK is available at https://youtu.be/6c81BLmOZ0U. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2404.17679 [pdf, other]

Recent Increments in Incremental View Maintenance

Authors: Dan Olteanu

Abstract: We overview recent progress on the longstanding problem of incremental view maintenance (IVM), with a focus on the fine-grained complexity and optimality of IVM for classes of conjunctive queries. This theoretical progress guided the development of IVM engines that reported practical benefits in academic papers and industrial settings. When taken in isolation, each of the reported advancements is… ▽ More We overview recent progress on the longstanding problem of incremental view maintenance (IVM), with a focus on the fine-grained complexity and optimality of IVM for classes of conjunctive queries. This theoretical progress guided the development of IVM engines that reported practical benefits in academic papers and industrial settings. When taken in isolation, each of the reported advancements is but a small increment. Yet when taken together, they may well pave the way to a deeper understanding of the IVM problem. This paper accompanies the invited Gems of PODS 2024 talk with the same title. Some of the works highlighted in this paper are based on prior or on-going collaborations with: Ahmet Kara, Milos Nikolic, and Haozhe Zhang in the F-IVM project; and Mahmoud Abo Khamis, Niko Göbel, Hung Ngo, and Dan Suciu at RelationalAI. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 18 pages, 7 figures, Gems of PODS 2024

arXiv:2404.16224 [pdf, ps, other]

Tractable Conjunctive Queries over Static and Dynamic Relations

Authors: Ahmet Kara, Zheng Luo, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes. We characterise syntactically three classes of queries that admit constant update time and constant enumeration delay. We call such queries tractable. Depending on the class, the preprocessing ti… ▽ More We investigate the evaluation of conjunctive queries over static and dynamic relations. While static relations are given as input and do not change, dynamic relations are subject to inserts and deletes. We characterise syntactically three classes of queries that admit constant update time and constant enumeration delay. We call such queries tractable. Depending on the class, the preprocessing time is linear, polynomial, or exponential (under data complexity, so the query size is constant). To decide whether a query is tractable, it does not suffice to analyse separately the sub-query over the static relations and the sub-query over the dynamic relations. Instead, we need to take the interaction between the static and the dynamic relations into account. Even when the sub-query over the dynamic relations is not tractable, the overall query can become tractable if the dynamic relations are sufficiently constrained by the static ones. △ Less

Submitted 24 April, 2024; originally announced April 2024.

ACM Class: H.2.4

arXiv:2312.09331 [pdf, ps, other]

Insert-Only versus Insert-Delete in Dynamic Query Evaluation

Authors: Mahmoud Abo Khamis, Ahmet Kara, Dan Olteanu, Dan Suciu

Abstract: We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(N^w(Q)), where w(Q) is the fracti… ▽ More We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(N^w(Q)), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the amortized time per single-tuple insert is constant for acyclic full conjunctive queries. In contrast, we show that a sequence of N inserts and deletes can be executed in total time O(N^w(Q')), where Q' is obtained from Q by extending every relational atom with extra variables that represent the "lifespans" of tuples in the database. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time for the output of Q. Our approach achieves amortized optimal update times for the hierarchical and Loomis-Whitney join queries. △ Less

Submitted 13 September, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

arXiv:2308.05588 [pdf, other]

Banzhaf Values for Facts in Query Answering

Authors: Omer Abramovich, Daniel Deutch, Nave Frost, Ahmet Kara, Dan Olteanu

Abstract: Quantifying the contribution of database facts to query answers has been studied as means of explanation. The Banzhaf value, originally developed in Game Theory, is a natural measure of fact contribution, yet its efficient computation for select-project-join-union queries is challenging. In this paper, we introduce three algorithms to compute the Banzhaf value of database facts: an exact algorithm… ▽ More Quantifying the contribution of database facts to query answers has been studied as means of explanation. The Banzhaf value, originally developed in Game Theory, is a natural measure of fact contribution, yet its efficient computation for select-project-join-union queries is challenging. In this paper, we introduce three algorithms to compute the Banzhaf value of database facts: an exact algorithm, an anytime deterministic approximation algorithm with relative error guarantees, and an algorithm for ranking and top-$k$. They have three key building blocks: compilation of query lineage into an equivalent function that allows efficient Banzhaf value computation; dynamic programming computation of the Banzhaf values of variables in a Boolean function using the Banzhaf values for constituent functions; and a mechanism to compute efficiently lower and upper bounds on Banzhaf values for any positive DNF function. We complement the algorithms with a dichotomy for the Banzhaf-based ranking problem: given two facts, deciding whether the Banzhaf value of one is greater than of the other is tractable for hierarchical queries and intractable for non-hierarchical queries. We show experimentally that our algorithms significantly outperform exact and approximate algorithms from prior work, most times up to two orders of magnitude. Our algorithms can also cover challenging problem instances that are beyond reach for prior work. △ Less

Submitted 10 August, 2023; originally announced August 2023.

arXiv:2307.16540 [pdf, ps, other]

ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement Learning

Authors: Junxiong Wang, Immanuel Trummer, Ahmet Kara, Dan Olteanu

Abstract: The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join… ▽ More The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize. △ Less

Submitted 31 July, 2023; originally announced July 2023.

ACM Class: H.3

arXiv:2306.14211 [pdf, ps, other]

From Shapley Value to Model Counting and Back

Authors: Ahmet Kara, Dan Olteanu, Dan Suciu

Abstract: In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This… ▽ More In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation. We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the #P-hard problem of model counting for lineage in positive bipartite disjunctive normal form. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 22 pages

ACM Class: F.4.1; F.2; H.2

arXiv:2306.14075 [pdf, ps, other]

Join Size Bounds using Lp-Norms on Degree Sequences

Authors: Mahmoud Abo Khamis, Vasileios Nakos, Dan Olteanu, Dan Suciu

Abstract: Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate s… ▽ More Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple". △ Less

Submitted 5 June, 2024; v1 submitted 24 June, 2023; originally announced June 2023.

arXiv:2306.09610 [pdf, other]

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Authors: Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu

Abstract: We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-c… ▽ More We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models. △ Less

Submitted 5 April, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: To appear in VLDB 2024

arXiv:2303.08583 [pdf, other]

doi 10.1007/s00778-023-00817-w

F-IVM: Analytics over Relational Databases under Updates

Authors: Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multip… ▽ More This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multiplication. F-IVM has three main ingredients: higher-order incremental view maintenance; factorized computation; and ring abstraction. F-IVM reduces the maintenance of a task to that of a hierarchy of simple views. Such views are functions mapping keys, which are tuples of input values, to payloads, which are elements from a ring. F-IVM also supports efficient factorized computation over keys, payloads, and updates. Finally, F-IVM treats uniformly seemingly disparate tasks. In the key space, all tasks require joins and variable marginalization. In the payload space, tasks differ in the definition of the sum and product ring operations. We implemented F-IVM on top of DBToaster and show that it can outperform classical first-order and fully recursive higher-order incremental view maintenance by orders of magnitude while using less memory. △ Less

Submitted 29 January, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

arXiv:2206.09032 [pdf, other]

Conjunctive Queries with Free Access Patterns under Updates

Authors: Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: We study the problem of answering conjunctive queries with free access patterns (CQAPs) under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables. We introduce a fully dynamic evaluation approach that works for all CQAPs and is optimal for two cl… ▽ More We study the problem of answering conjunctive queries with free access patterns (CQAPs) under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables. We introduce a fully dynamic evaluation approach that works for all CQAPs and is optimal for two classes of CQAPs. This approach recovers prior work on the dynamic evaluation of conjunctive queries without access patterns. We first give a syntactic characterisation of all CQAPs that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given a tuple of values over the input variables. We further chart the complexity trade-off between the preprocessing time, update time and enumeration delay for a class of CQAPs. For some of these CQAPs, our approach achieves optimal, albeit non-constant, update time and delay. This optimality is predicated on the Online Matrix-Vector Multiplication conjecture. We finally adapt our approach to the dynamic evaluation of tractable CQAPs over probabilistic databases under updates. △ Less

Submitted 3 September, 2024; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: Extended and polished version. Revised Sections 12 and 13 on the evaluation of conjunctive queries with free access patterns over probabilistic databases

ACM Class: H.2.4

arXiv:2204.00525 [pdf, other]

Givens Rotations for QR Decomposition, SVD and PCA over Database Joins

Authors: Dan Olteanu, Nils Vortmeier, Đorđe Živanović

Abstract: This article introduces Figaro, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. Figaro's main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execut… ▽ More This article introduces Figaro, an algorithm for computing the upper-triangular matrix in the QR decomposition of the matrix defined by the natural join over relational data. Figaro's main novelty is that it pushes the QR decomposition past the join. This leads to several desirable properties. For acyclic joins, it takes time linear in the database size and independent of the join size. Its execution is equivalent to the application of a sequence of Givens rotations proportional to the join size. Its number of rounding errors relative to the classical QR decomposition algorithms is on par with the database size relative to the join output size. The QR decomposition lies at the core of many linear algebra computations including the singular value decomposition (SVD) and the principal component analysis (PCA). We show how Figaro can be used to compute the orthogonal matrix in the QR decomposition, the SVD and the PCA of the join output without the need to materialize the join output. A suite of experiments validate that Figaro can outperform both in runtime performance and numerical accuracy the LAPACK library Intel MKL by a factor proportional to the gap between the sizes of the join output and input. △ Less

Submitted 16 October, 2023; v1 submitted 1 April, 2022; originally announced April 2022.

arXiv:2107.13923 [pdf, ps, other]

Machine Learning over Static and Dynamic Relational Data

Authors: Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1)… ▽ More This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future research △ Less

Submitted 29 July, 2021; originally announced July 2021.

Comments: arXiv admin note: text overlap with arXiv:2008.07864

arXiv:2106.13342 [pdf, other]

The Complexity of Boolean Conjunctive Queries with Intersection Joins

Authors: Mahmoud Abo Khamis, George Chichirim, Antonia Kormpa, Dan Olteanu

Abstract: Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with e… ▽ More Intersection joins over interval data are relevant in spatial and temporal data settings. A set of intervals join if their intersection is non-empty. In case of point intervals, the intersection join becomes the standard equality join. We establish the complexity of Boolean conjunctive queries with intersection joins by a many-one equivalence to disjunctions of Boolean conjunctive queries with equality joins. The complexity of any query with intersection joins is that of the hardest query with equality joins in the disjunction exhibited by our equivalence. This is captured by a new width measure called the IJ-width. We also introduce a new syntactic notion of acyclicity called iota-acyclicity to characterise the class of Boolean queries with intersection joins that admit linear time computation modulo a poly-logarithmic factor in the data size. Iota-acyclicity is for intersection joins what alpha-acyclicity is for equality joins. It strictly sits between gamma-acyclicity and Berge-acyclicity. The intersection join queries that are not iota-acyclic are at least as hard as the Boolean triangle query with equality joins, which is widely considered not computable in linear time. △ Less

Submitted 14 April, 2022; v1 submitted 24 June, 2021; originally announced June 2021.

arXiv:2103.06376 [pdf, other]

Functional Collection Programming with Semi-Ring Dictionaries

Authors: Amir Shaikhha, Mathieu Huot, Jaclyn Smith, Dan Olteanu

Abstract: This paper introduces semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We developed SDQL, a statically typed language that can express relational algebra with aggregations, linear algebra, and functional collections over data such as relations and matrices using se… ▽ More This paper introduces semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We developed SDQL, a statically typed language that can express relational algebra with aggregations, linear algebra, and functional collections over data such as relations and matrices using semi-ring dictionaries. Furthermore, thanks to the algebraic structure behind these dictionaries, SDQL unifies a wide range of optimizations commonly used in databases (DB) and linear algebra (LA). As a result, SDQL enables efficient processing of hybrid DB and LA workloads, by putting together optimizations that are otherwise confined to either DB systems or LA frameworks. We show experimentally that a handful of DB and LA workloads can take advantage of the SDQL language and optimizations. SDQL can be competitive with or outperforms a host of systems that are state of the art in their own domain: in-memory DB systems Typer and Tectorwise for (flat, not nested) relational data; SciPy for LA workloads; sparse tensor compiler taco; the Trance nested relational engine; and the in-database machine learning engines LMFAO and Morpheus for hybrid DB/LA workloads over relational data. △ Less

Submitted 22 March, 2022; v1 submitted 10 March, 2021; originally announced March 2021.

arXiv:2008.08657 [pdf, other]

LMFAO: An Engine for Batches of Group-By Aggregates

Authors: Maximilian Schleich, Dan Olteanu

Abstract: LMFAO is an in-memory optimization and execution engine for large batches of group-by aggregates over joins. Such database workloads capture the data-intensive computation of a variety of data science applications. We demonstrate LMFAO for three popular models: ridge linear regression with batch gradient descent, decision trees with CART, and clustering with Rk-means. LMFAO is an in-memory optimization and execution engine for large batches of group-by aggregates over joins. Such database workloads capture the data-intensive computation of a variety of data science applications. We demonstrate LMFAO for three popular models: ridge linear regression with batch gradient descent, decision trees with CART, and clustering with Rk-means. △ Less

Submitted 19 August, 2020; originally announced August 2020.

Comments: 4 pages, 4 figures

arXiv:2008.07864 [pdf, other]

The Relational Data Borg is Learning

Authors: Dan Olteanu

Abstract: This paper overviews an approach that addresses machine learning over relational data as a database problem. This is justified by two observations. First, the input to the learning task is commonly the result of a feature extraction query over the relational data. Second, the learning task requires the computation of group-by aggregates. This approach has been already investigated for a number o… ▽ More This paper overviews an approach that addresses machine learning over relational data as a database problem. This is justified by two observations. First, the input to the learning task is commonly the result of a feature extraction query over the relational data. Second, the learning task requires the computation of group-by aggregates. This approach has been already investigated for a number of supervised and unsupervised learning tasks, including: ridge linear regression, factorisation machines, support vector machines, decision trees, principal component analysis, and k-means; and also for linear algebra over data matrices. The main message of this work is that the runtime performance of machine learning can be dramatically boosted by a toolbox of techniques that exploit the knowledge of the underlying data. This includes theoretical development on the algebraic, combinatorial, and statistical structure of relational data processing and systems development on code specialisation, low-level computation sharing, and parallelisation. These techniques aim at lowering both the complexity and the constant factors of the learning time. This work is the outcome of extensive collaboration of the author with colleagues from RelationalAI, in particular Mahmoud Abo Khamis, Molham Aref, Hung Ngo, and XuanLong Nguyen, and from the FDB research project, in particular Ahmet Kara, Milos Nikolic, Maximilian Schleich, Amir Shaikhha, Jakub Zavodny, and Haozhe Zhang. The author would also like to thank the members of the FDB project for the figures and examples used in this paper. The author is grateful for support from industry: Amazon Web Services, Google, Infor, LogicBlox, Microsoft Azure, RelationalAI; and from the funding agencies EPSRC and ERC. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 682588. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: 14 pages, 11 figures, VLDB 2020 keynote

arXiv:2006.00694 [pdf, other]

F-IVM: Learning over Fast-Evolving Relational Data

Authors: Milos Nikolic, Haozhe Zhang, Ahmet Kara, Dan Olteanu

Abstract: F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression. F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression. △ Less

Submitted 31 May, 2020; originally announced June 2020.

Comments: SIGMOD DEMO 2020, 5 pages

arXiv:2004.03716 [pdf, ps, other]

Maintaining Triangle Queries under Updates

Authors: Ahmet Kara, Milos Nikolic, Hung Q. Ngo, Dan Olteanu, Haozhe Zhang

Abstract: We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations. We introduce an approach called IVM$^ε$ that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size whi… ▽ More We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations. We introduce an approach called IVM$^ε$ that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size while the delay ranges from constant to linear time. IVM$^ε$ achieves Pareto worst-case optimality in the update-delay space conditioned on the Online Matrix-Vector Multiplication conjecture. It is strongly Pareto optimal for the triangle queries with zero or three free variables and weakly Pareto optimal for the triangle queries with one or two free variables. △ Less

Submitted 7 April, 2020; originally announced April 2020.

Comments: 47 pages, 18 figures

ACM Class: H.2.4

arXiv:2001.03541 [pdf, other]

Multi-layer Optimizations for End-to-End Data Analytics

Authors: Amir Shaikhha, Maximilian Schleich, Alexandru Ghita, Dan Olteanu

Abstract: We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative app… ▽ More We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets. △ Less

Submitted 10 January, 2020; originally announced January 2020.

arXiv:1912.11098 [pdf, other]

Towards Deterministic Decomposable Circuits for Safe Queries

Authors: Mikaël Monet, Dan Olteanu

Abstract: There exist two approaches for exact probabilistic inference of UCQs on tuple-independent databases. In the extensional approach, query evaluation is performed within a DBMS by exploiting the structure of the query. In the intensional approach, one first builds a representation of the lineage of the query on the database, then computes the probability of the lineage. In this paper we propose a new… ▽ More There exist two approaches for exact probabilistic inference of UCQs on tuple-independent databases. In the extensional approach, query evaluation is performed within a DBMS by exploiting the structure of the query. In the intensional approach, one first builds a representation of the lineage of the query on the database, then computes the probability of the lineage. In this paper we propose a new technique to construct lineage representations as deterministic decomposable circuits in PTIME. The technique can apply to a class of UCQs that has been conjectured to separate the complexity of the two approaches. We test our technique experimentally, and show that it succeeds on all the queries of this class up to a certain size parameter, i.e., over $20$ million queries. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: 10 pages. Appeared in the workshop AMW'18

arXiv:1911.06577 [pdf, ps, other]

Learning Models over Relational Data: A Brief Tutorial

Authors: Maximilian Schleich, Dan Olteanu, Mahmoud Abo-Khamis, Hung Q. Ngo, XuanLong Nguyen

Abstract: This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational dat… ▽ More This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database. The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization. △ Less

Submitted 15 November, 2019; originally announced November 2019.

Comments: 10 pages, 1 figure

ACM Class: H.2.4; I.2.6

arXiv:1910.04939 [pdf, ps, other]

Rk-means: Fast Clustering for Relational Data

Authors: Ryan Curtin, Ben Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich

Abstract: Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k -means algorithm, for clusteri… ▽ More Conventional machine learning algorithms cannot be applied until a data matrix is available to process. When the data matrix needs to be obtained from a relational database via a feature extraction query, the computation cost can be prohibitive, as the data matrix may be (much) larger than the total input relation size. This paper introduces Rk-means, or relational k -means algorithm, for clustering relational data tuples without having to access the full data matrix. As such, we avoid having to run the expensive feature extraction query and storing its output. Our algorithm leverages the underlying structures in relational data. It involves construction of a small {\it grid coreset} of the data matrix for subsequent cluster construction. This gives a constant approximation for the k -means objective, while having asymptotic runtime improvements over standard approaches of first running the database query and then clustering. Empirical results show orders-of-magnitude speedup, and Rk-means can run faster on the database than even just computing the data matrix. △ Less

Submitted 10 October, 2019; originally announced October 2019.

arXiv:1907.01988 [pdf, other]

doi 10.46298/lmcs-19(3:11)2023

Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries

Authors: Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database… ▽ More We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration. We define the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By appropriately choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries. We show that for a restricted class of hierarchical queries, our approach achieves worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture. △ Less

Submitted 8 August, 2023; v1 submitted 3 July, 2019; originally announced July 2019.

Journal ref: Logical Methods in Computer Science, Volume 19, Issue 3 (August 9, 2023) lmcs:10035

arXiv:1906.08687 [pdf, other]

A Layered Aggregate Engine for Analytics Workloads

Authors: Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen

Abstract: This paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. The primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into group-by aggregates over the join of the input database re… ▽ More This paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. The primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into group-by aggregates over the join of the input database relations. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classification trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing. LMFAO consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization. We conducted two types of performance benchmarks. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, a commercial database system and MonetDB for computing batches of aggregates, and on the other hand, TensorFlow, Scikit, R, and AC/DC for learning a variety of models over databases. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: 18 pages, 7 figures, 4 tables

ACM Class: H.2.4; I.2.6

arXiv:1902.00585 [pdf, ps, other]

Incremental Techniques for Large-Scale Dynamic Query Processing

Authors: Iman Elghandour, Ahmet Kara, Dan Olteanu, Stijn Vansummeren

Abstract: Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they… ▽ More Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they do not perform well for large databases that are updated at high frequencies. Therefore, new algorithms and approaches have been proposed in the literature to address these challenges by, for instance, reducing the complexity of processing updates. Moreover, many of these algorithms are now leveraging distributed streaming platforms such as Spark Streaming and Flink. In this tutorial, we briefly discuss legacy approaches for incremental query processing, and then give an overview of the new challenges introduced due to processing big data streams. We then discuss in detail the recently proposed algorithms that address some of these challenges. We emphasize the characteristics and algorithmic analysis of various proposed approaches and conclude by discussing future research directions. △ Less

Submitted 1 February, 2019; originally announced February 2019.

arXiv:1812.09526 [pdf, ps, other]

Functional Aggregate Queries with Additive Inequalities

Authors: Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich

Abstract: Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositio… ▽ More Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short. To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle's geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering. Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by #subw. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw. We present three applications of our FAQ-AI framework to relational machine learning: k-means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations. △ Less

Submitted 15 September, 2020; v1 submitted 22 December, 2018; originally announced December 2018.

arXiv:1804.02780 [pdf, ps, other]

Counting Triangles under Updates in Worst-Case Optimal Time

Authors: Ahmet Kara, Hung Q. Ngo, Milos Nikolic, Dan Olteanu, Haozhe Zhang

Abstract: We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on th… ▽ More We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on the Online Matrix-Vector Multiplication conjecture. The classical and factorized incremental view maintenance approaches are recovered as special cases of our approach within the space-time tradeoff. In particular, they require linear-time update maintenance, which is suboptimal. Our approach also recovers the worst-case optimal time complexity for computing the triangle count in the non-incremental setting. △ Less

Submitted 25 March, 2019; v1 submitted 8 April, 2018; originally announced April 2018.

Comments: simplified notation; incremental maintenance of full triangle query, 4-path count query, count queries with three relations added; improved the space complexity of the dynamic algorithm maintaining the triangle count query

ACM Class: H.2.4

arXiv:1803.07480 [pdf, other]

AC/DC: In-Database Learning Thunderstruck

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich

Abstract: We report on the design and implementation of the AC/DC gradient descent solver for a class of optimization problems over normalized databases. AC/DC decomposes an optimization problem into a set of aggregates over the join of the database relations. It then uses the answers to these aggregates to iteratively improve the solution to the problem until it converges. The challenges faced by AC/DC a… ▽ More We report on the design and implementation of the AC/DC gradient descent solver for a class of optimization problems over normalized databases. AC/DC decomposes an optimization problem into a set of aggregates over the join of the database relations. It then uses the answers to these aggregates to iteratively improve the solution to the problem until it converges. The challenges faced by AC/DC are the large database size, the mixture of continuous and categorical features, and the large number of aggregates to compute. AC/DC addresses these challenges by employing a sparse data representation, factorized computation, problem reparameterization under functional dependencies, and a data structure that supports shared computation of aggregates. To train polynomial regression models and factorization machines of up to 154K features over the natural join of all relations from a real-world dataset of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity machine. This is up to three orders of magnitude faster than its competitors R, MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed memory limitation, 24-hour timeout, or internal design limitations. △ Less

Submitted 15 June, 2018; v1 submitted 20 March, 2018; originally announced March 2018.

Comments: 10 pages, 3 figures

ACM Class: H.2.4; I.2.6

arXiv:1712.07445 [pdf, ps, other]

Boolean Tensor Decomposition for Conjunctive Queries with Negation

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, Dan Olteanu, Dan Suciu

Abstract: We propose an algorithm for answering conjunctive queries with negation, where the negated relations have bounded degree. Its data complexity matches that of the best known algorithms for the positive subquery of the input query and is expressed in terms of the fractional hypertree width and the submodular width. The query complexity depends on the structure of the negated subquery; in general it… ▽ More We propose an algorithm for answering conjunctive queries with negation, where the negated relations have bounded degree. Its data complexity matches that of the best known algorithms for the positive subquery of the input query and is expressed in terms of the fractional hypertree width and the submodular width. The query complexity depends on the structure of the negated subquery; in general it is exponential in the number of join variables occurring in negated relations yet it becomes polynomial for several classes of queries. This algorithm relies on several contributions. We show how to rewrite queries with negation on bounded-degree relations into equivalent conjunctive queries with not-all-equal (NAE) predicates, which are a multi-dimensional analog of disequality (not-equal). We then generalize the known color-coding technique to conjunctions of NAE predicates and explain it via a Boolean tensor decomposition of conjunctions of NAE predicates. This decomposition can be achieved via a probabilistic construction that can be derandomized efficiently. △ Less

Submitted 27 January, 2019; v1 submitted 20 December, 2017; originally announced December 2017.

arXiv:1709.01600 [pdf, ps, other]

Covers of Query Results

Authors: Ahmet Kara, Dan Olteanu

Abstract: We introduce succinct lossless representations of query results called covers. They are subsets of the query results that correspond to minimal edge covers in the hypergraphs of these results. We first study covers whose structures are given by fractional hypertree decompositions of join queries. For any decomposition of a query, we give asymptotically tight size bounds for the covers of the que… ▽ More We introduce succinct lossless representations of query results called covers. They are subsets of the query results that correspond to minimal edge covers in the hypergraphs of these results. We first study covers whose structures are given by fractional hypertree decompositions of join queries. For any decomposition of a query, we give asymptotically tight size bounds for the covers of the query result over that decomposition and show that such covers can be computed in worst-case optimal time up to a logarithmic factor in the database size. For acyclic join queries, we can compute covers compositionally using query plans with a new operator called cover-join. The tuples in the query result can be enumerated from any of its covers with linearithmic pre-computation time and constant delay. We then generalize covers from joins to functional aggregate queries that express a host of computational problems such as aggregate-join queries, in-database optimization, matrix chain multiplication, and inference in probabilistic graphical models. △ Less

Submitted 10 January, 2018; v1 submitted 5 September, 2017; originally announced September 2017.

Comments: 33 pages. Notation simplified

MSC Class: 68P15 ACM Class: H.2.1

arXiv:1703.07484 [pdf, other]

Incremental View Maintenance with Triple Lock Factorization Benefits

Authors: Milos Nikolic, Dan Olteanu

Abstract: We introduce F-IVM, a unified incremental view maintenance (IVM) approach for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries. F-IVM is a higher-order IVM algorithm that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpl… ▽ More We introduce F-IVM, a unified incremental view maintenance (IVM) approach for a variety of tasks, including gradient computation for learning linear regression models over joins, matrix chain multiplication, and factorized evaluation of conjunctive queries. F-IVM is a higher-order IVM algorithm that reduces the maintenance of the given task to the maintenance of a hierarchy of increasingly simpler views. The views are functions mapping keys, which are tuples of input data values, to payloads, which are elements from a task-specific ring. Whereas the computation over the keys is the same for all tasks, the computation over the payloads depends on the task. F-IVM achieves efficiency by factorizing the computation of the keys, payloads, and updates. We implemented F-IVM as an extension of DBToaster. We show in a range of scenarios that it can outperform classical first-order IVM, DBToaster's fully recursive higher-order IVM, and plain recomputation by orders of magnitude while using less memory. △ Less

Submitted 28 February, 2018; v1 submitted 21 March, 2017; originally announced March 2017.

Comments: 27 pages, 13 figures, a shorter version appeared in SIGMOD 2018

ACM Class: H.2.4

arXiv:1703.04780 [pdf, other]

Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, Maximilian Schleich

Abstract: Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset… ▽ More Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations. △ Less

Submitted 6 February, 2020; v1 submitted 14 March, 2017; originally announced March 2017.

Comments: 61 pages, 9 figures, 2 tables

ACM Class: H.2.4; I.2.6

arXiv:1412.2221 [pdf, ps, other]

Declarative Statistical Modeling with Datalog

Authors: Vince Barany, Balder ten Cate, Benny Kimelfeld, Dan Olteanu, Zografoula Vagena

Abstract: Formalisms for specifying statistical models, such as probabilistic-programming languages, typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial in… ▽ More Formalisms for specifying statistical models, such as probabilistic-programming languages, typically consist of two components: a specification of a stochastic process (the prior), and a specification of observations that restrict the probability space to a conditional subspace (the posterior). Use cases of such formalisms include the development of algorithms in machine learning and artificial intelligence. We propose and investigate a declarative framework for specifying statistical models on top of a database, through an appropriate extension of Datalog. By virtue of extending Datalog, our framework offers a natural integration with the database, and has a robust declarative semantics. Our Datalog extension provides convenient mechanisms to include numerical probability functions; in particular, conclusions of rules may contain values drawn from such functions. The semantics of a program is a probability distribution over the possible outcomes of the input database with respect to the program; these outcomes are minimal solutions with respect to a related program with existentially quantified variables in conclusions. Observations are naturally incorporated by means of integrity constraints over the extensional and intensional relations. We focus on programs that use discrete numerical distributions, but even then the space of possible outcomes may be uncountable (as a solution can be infinite). We define a probability measure over possible outcomes by applying the known concept of cylinder sets to a probabilistic chase procedure. We show that the resulting semantics is robust under different chases. We also identify conditions guaranteeing that all possible outcomes are finite (and then the probability space is discrete). We argue that the framework we propose retains the purely declarative nature of Datalog, and allows for natural specifications of statistical models. △ Less

Submitted 5 January, 2015; v1 submitted 6 December, 2014; originally announced December 2014.

Comments: 14 pages, 4 figures

ACM Class: F.1.2; G.3; H.2.3; H.2.4; H.2.8; I.2.3

arXiv:1309.0373 [pdf, other]

ENFrame: A Platform for Processing Probabilistic Data

Authors: Sebastiaan J. van Schaik, Dan Olteanu, Robert Fink

Abstract: This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame… ▽ More This paper introduces ENFrame, a unified data processing platform for querying and mining probabilistic data. Using ENFrame, users can write programs in a fragment of Python with constructs such as bounded-range loops, list comprehension, aggregate operations on lists, and calls to external database engines. The program is then interpreted probabilistically by ENFrame. The realisation of ENFrame required novel contributions along several directions. We propose an event language that is expressive enough to succinctly encode arbitrary correlations, trace the computation of user programs, and allow for computation of discrete probability distributions of program variables. We exemplify ENFrame on three clustering algorithms: k-means, k-medoids, and Markov Clustering. We introduce sequential and distributed algorithms for computing the probability of interconnected events exactly or approximately with error guarantees. Experiments with k-medoids clustering of sensor readings from energy networks show orders-of-magnitude improvements of exact clustering using ENFrame over naïve clustering in each possible world, of approximate over exact, and of distributed over sequential algorithms. △ Less

Submitted 2 September, 2013; originally announced September 2013.

Comments: 12 pages

ACM Class: H.2.4; H.2.8; H.3.5

arXiv:1307.0441 [pdf, ps, other]

Aggregation and Ordering in Factorised Databases

Authors: Nurzhan Bakibayev, Tomáš Kočiský, Dan Olteanu, Jakub Závodný

Abstract: A common approach to data analysis involves understanding and manipulating succinct representations of data. In earlier work, we put forward a succinct representation system for relational data called factorised databases and reported on the main-memory query engine FDB for select-project-join queries on such databases. In this paper, we extend FDB to support a larger class of practical queries… ▽ More A common approach to data analysis involves understanding and manipulating succinct representations of data. In earlier work, we put forward a succinct representation system for relational data called factorised databases and reported on the main-memory query engine FDB for select-project-join queries on such databases. In this paper, we extend FDB to support a larger class of practical queries with aggregates and ordering. This requires novel optimisation and evaluation techniques. We show how factorisation coupled with partial aggregation can effectively reduce the number of operations needed for query evaluation. We also show how factorisations of query results can support enumeration of tuples in desired orders as efficiently as listing them from the unfactorised, sorted results. We experimentally observe that FDB can outperform off-the-shelf relational engines by orders of magnitude. △ Less

Submitted 1 July, 2013; originally announced July 2013.

Comments: 12 pages, 8 figures

arXiv:1203.2672 [pdf, ps, other]

FDB: A Query Engine for Factorised Relational Databases

Authors: Nurzhan Bakibayev, Dan Olteanu, Jakub Závodný

Abstract: Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by d… ▽ More Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by data factorisation. Experiments show that for data sets with many-to-many relationships FDB can outperform relational engines by orders of magnitude. △ Less

Submitted 12 March, 2012; originally announced March 2012.

Comments: 12 pages, 9 figures

arXiv:1201.6569 [pdf, other]

Aggregation in Probabilistic Databases via Knowledge Compilation

Authors: Robert Fink, Larisa Han, Dan Olteanu

Abstract: This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into so-called decomposition trees, for which the computation of the probability… ▽ More This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into so-called decomposition trees, for which the computation of the probability distribution can be done in time linear in the product of the sizes of the probability distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomial-time decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPC-H data. △ Less

Submitted 31 January, 2012; originally announced January 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 5, pp. 490-501 (2012)

arXiv:1104.0867 [pdf, ps, other]

Factorised Representations of Query Results

Authors: Dan Olteanu, Jakub Zavodny

Abstract: Query tractability has been traditionally defined as a function of input database and query sizes, or of both input and output sizes, where the query result is represented as a bag of tuples. In this report, we introduce a framework that allows to investigate tractability beyond this setting. The key insight is that, although the cardinality of a query result can be exponential, its structure can… ▽ More Query tractability has been traditionally defined as a function of input database and query sizes, or of both input and output sizes, where the query result is represented as a bag of tuples. In this report, we introduce a framework that allows to investigate tractability beyond this setting. The key insight is that, although the cardinality of a query result can be exponential, its structure can be very regular and thus factorisable into a nested representation whose size is only polynomial in the size of both the input database and query. For a given query result, there may be several equivalent representations, and we quantify the regularity of the result by its readability, which is the minimum over all its representations of the maximum number of occurrences of any tuple in that representation. We give a characterisation of select-project-join queries based on the bounds on readability of their results for any input database. We complement it with an algorithm that can find asymptotically optimal upper bounds and corresponding factorised representations. △ Less

Submitted 5 April, 2011; originally announced April 2011.

Comments: 44 pages, 13 figures

ACM Class: H.2.3; H.2.4

arXiv:0805.3413 [pdf, ps, other]

doi 10.1111/j.1365-2966.2008.13486.x

A semi-empirical simulation of the extragalactic radio continuum sky for next generation radio telescopes

Authors: R. J. Wilman, L. Miller, M. J. Jarvis, T. Mauch, F. Levrier, F. B. Abdalla, S. Rawlings, H. -R. Kloeckner, D. Obreschkow, D. Olteanu, S. Young

Abstract: We have developed a semi-empirical simulation of the extragalactic radio continuum sky suitable for aiding the design of next generation radio interferometers such as the Square Kilometre Array (SKA). The emphasis is on modelling the large-scale cosmological distribution of radio sources rather than the internal details of individual galaxies. Here we provide a description of the simulation to a… ▽ More We have developed a semi-empirical simulation of the extragalactic radio continuum sky suitable for aiding the design of next generation radio interferometers such as the Square Kilometre Array (SKA). The emphasis is on modelling the large-scale cosmological distribution of radio sources rather than the internal details of individual galaxies. Here we provide a description of the simulation to accompany the online release of a catalogue of 320 million simulated radio sources. The simulation covers 20x20 deg^2 - a plausible upper limit to the instantaneous field of view attainable with future (e.g. SKA) aperture array technologies - out to redshift z=20, and down to flux density limits of 10 nJy at 151, 610 MHz, 1.4, 4.86 and 18 GHz. Five distinct source types are included: radio-quiet AGN, radio-loud AGN of the FRI and FRII structural classes, and star-forming galaxies, the latter split into populations of quiescent and starbursting galaxies. In our semi-empirical approach, the simulated sources are drawn from observed (or extrapolated) luminosity functions and grafted onto an underlying dark matter density field with biases which reflect their measured large-scale clustering. A numerical Press-Schechter-style filtering of the density field is used to identify and populate clusters of galaxies. Radio source structures are built from point source and elliptical sub-components, and for FRI and FRII sources an orientation-based unification and beaming model is used to partition flux between the core and extended lobes and hotspots. The simulation output can be post-processed to achieve more complete agreement with observational data in the years ahead, with the aim of using these 'idealised skies' in telescope simulators to optimise the design of the SKA itself (abridged). △ Less

Submitted 22 May, 2008; originally announced May 2008.

Comments: 15 pages; to appear in MNRAS

Journal ref: Mon.Not.Roy.Astron.Soc.388:1335-1348,2008

arXiv:0803.2212 [pdf, ps, other]

Conditioning Probabilistic Databases

Authors: Christoph Koch, Dan Olteanu

Abstract: Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is… ▽ More Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is materialized for subsequent query processing or further refinement. It turns out that the conditioning problem is closely related to the problem of computing exact tuple confidence values. It is known that exact confidence computation is an NP-hard problem. This has led researchers to consider approximation techniques for confidence computation. However, neither conditioning nor exact confidence computation can be solved using such techniques. In this paper we present efficient techniques for both problems. We study several problem decomposition methods and heuristics that are based on the most successful search techniques from constraint satisfaction, such as the Davis-Putnam algorithm. We complement this with a thorough experimental evaluation of the algorithms proposed. Our experiments show that our exact algorithms scale well to realistic database sizes and can in some scenarios compete with the most efficient previous approximation algorithms. △ Less

Submitted 16 June, 2008; v1 submitted 14 March, 2008; originally announced March 2008.

Comments: 13 pages, 13 figures

ACM Class: H.2.1; H.2.4

arXiv:0707.1644 [pdf, ps, other]

Fast and Simple Relational Processing of Uncertain Data

Authors: Lyublena Antova, Thomas Jansen, Christoph Koch, Dan Olteanu

Abstract: This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answers, a query on the logical level can be translated into, and evaluated as, a single relational algebra query o… ▽ More This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answers, a query on the logical level can be translated into, and evaluated as, a single relational algebra query on the U-relation representation. The translation scheme essentially preserves the size of the query in terms of number of operations and, in particular, number of joins. Standard techniques employed in off-the-shelf relational database management systems are effective for optimizing and processing queries on U-relations. In our experiments we show that query evaluation on U-relations scales to large amounts of data with high degrees of uncertainty. △ Less

Submitted 11 July, 2007; originally announced July 2007.

Comments: 12 pages, 14 figures

ACM Class: H.2.1; H.2.4

arXiv:0705.4442 [pdf, ps, other]

World-set Decompositions: Expressiveness and Efficient Algorithms

Authors: Dan Olteanu, Christoph Koch, Lyublena Antova

Abstract: Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of world-set decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSD… ▽ More Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of world-set decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important decision problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. △ Less

Submitted 9 January, 2008; v1 submitted 30 May, 2007; originally announced May 2007.

Comments: 34 pages, 13 figures, extended version of ICDT'07 paper

ACM Class: H.2.1; H.2.4

arXiv:cs/0606075 [pdf, ps, other]

10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information

Authors: Lyublena Antova, Christoph Koch, Dan Olteanu

Abstract: Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-se… ▽ More Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-set decompositions (WSDs), a space-efficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient. △ Less

Submitted 13 February, 2008; v1 submitted 16 June, 2006; originally announced June 2006.

Comments: 17 pages, 24 figures

ACM Class: H.2.1; H.2.4

Showing 1–44 of 44 results for author: Olteanu, D