subscribe to arXiv mailings

Galley: Modern Query Optimization for Sparse Tensor Programs

Authors: Kyle Deeds, Willow Ahrens, Magda Balazinska, Dan Suciu

Abstract: The tensor programming abstraction has become a foundational paradigm for modern computing. This framework allows users to write high performance programs for bulk computation via a high-level imperative interface. Recent work has extended this paradigm to sparse tensors (i.e. tensors where most entries are not explicitly represented) with the use of sparse tensor compilers. These systems excel at… ▽ More The tensor programming abstraction has become a foundational paradigm for modern computing. This framework allows users to write high performance programs for bulk computation via a high-level imperative interface. Recent work has extended this paradigm to sparse tensors (i.e. tensors where most entries are not explicitly represented) with the use of sparse tensor compilers. These systems excel at producing efficient code for computation over sparse tensors, which may be stored in a wide variety of formats. However, they require the user to manually choose the order of operations and the data formats at every step. Unfortunately, these decisions are both highly impactful and complicated, requiring significant effort to manually optimize. In this work, we present Galley, a system for declarative sparse tensor programming. Galley performs cost-based optimization to lower these programs to a logical plan then to a physical plan. It then leverages sparse tensor compilers to execute the physical plan efficiently. We show that Galley achieves high performance on a wide variety of problems including machine learning algorithms, subgraph counting, and iterative graph algorithms. △ Less

Submitted 31 August, 2024; v1 submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.07494 [pdf, other]

QirK: Question Answering via Intermediate Representation on Knowledge Graphs

Authors: Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu

Abstract: We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR).… ▽ More We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR). The input question is mapped to IR using LLMs, which is then repaired into a valid relational database query with the aid of a semantic search on vector embeddings. This allows a practical synthesis of LLM capabilities and KG reliability. A short video demonstrating QirK is available at https://youtu.be/6c81BLmOZ0U. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2407.10367 [pdf, other]

doi 10.5220/0012606000003693

Perceptions of Entrepreneurship Among Graduate Students: Challenges, Opportunities, and Cultural Biases

Authors: Manuela Andreea Petrescu, Dan Mircea Suciu

Abstract: The purpose of the paper is to examine the perceptions of entrepreneurship of graduate students enrolled in a digital-oriented entrepreneurship course, focusing on the challenges and opportunities related to starting a business. In today's digital era, businesses heavily depend on tailored software solutions to facilitate their operational processes, foster expansion, and enhance their competitive… ▽ More The purpose of the paper is to examine the perceptions of entrepreneurship of graduate students enrolled in a digital-oriented entrepreneurship course, focusing on the challenges and opportunities related to starting a business. In today's digital era, businesses heavily depend on tailored software solutions to facilitate their operational processes, foster expansion, and enhance their competitive edge, thus assuming, to a certain degree, the characteristics of software companies. For data gathering, we used online exploratory surveys. The findings indicated that although entrepreneurship was considered an attractive option by students, very few of them declared that they intended to start a business soon. The main issues raised by the students were internal traits and external obstacles, such as lack of resources and support. Gender discrimination and cultural biases persist, limiting opportunities and equality for women. In terms of gender, women face limited representation in leadership roles, are expected to do more unpaid 'family work', are perceived as less capable in ding business, and need to prove their skills. Even if women are less discriminated now, both genders agree that women still face discrimination in business domain. In terms of percentages, women mentioned gender discrimination in higher percentages. Addressing these issues requires awareness, education, and policy changes to ensure fair treatment and opportunities for women. △ Less

Submitted 10 May, 2024; originally announced July 2024.

Comments: In Proceedings of the 16th International Conference on Computer Supported Education - Volume 1, ISBN 978-989-758-697-2, ISSN 2184-5026, pages 347-354

ACM Class: K.3.2; J.4

arXiv:2405.06767 [pdf, other]

Color: A Framework for Applying Graph Coloring to Subgraph Cardinality Estimation

Authors: Kyle Deeds, Diandre Sabale, Moe Kayali, Dan Suciu

Abstract: Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a… ▽ More Graph workloads pose a particularly challenging problem for query optimizers. They typically feature large queries made up of entirely many-to-many joins with complex correlations. This puts significant stress on traditional cardinality estimation methods which generally see catastrophic errors when estimating the size of queries with only a handful of joins. To overcome this, we propose COLOR, a framework for subgraph cardinality estimation which applies insights from graph compression theory to produce a compact summary that captures the global topology of the data graph. Further, we identify several key optimizations that enable tractable estimation over this summary even for large query graphs. We then evaluate several designs within this framework and find that they improve accuracy by up to 10$^3$x over all competing methods while maintaining fast inference, a small memory footprint, efficient construction, and graceful degradation under updates. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2402.02001 [pdf, ps, other]

PANDA: Query Evaluation in Submodular Width

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu

Abstract: In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corr… ▽ More In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding information-theoretic bounds. In this paper, we describe PANDA, an algorithm that takes a Shannon-inequality that underlies the bound, and translates each proof step into an algorithmic step corresponding to some database operation. PANDA computes answers to a conjunctive query in time given by the the submodular width plus the output size of the query. The version in this paper represents a significant simplification of the original version [ANS, PODS'17]. △ Less

Submitted 13 September, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.16210 [pdf, ps, other]

The Non-Cancelling Intersections Conjecture

Authors: Antoine Amarilli, Mikaël Monet, Dan Suciu

Abstract: In this note, we present a conjecture on intersections of set families, and a rephrasing of the conjecture in terms of principal downsets of Boolean lattices. The conjecture informally states that, whenever we can express the measure of a union of sets in terms of the measure of some of their intersections using the inclusion-exclusion formula, then we can express the union as a set from these sam… ▽ More In this note, we present a conjecture on intersections of set families, and a rephrasing of the conjecture in terms of principal downsets of Boolean lattices. The conjecture informally states that, whenever we can express the measure of a union of sets in terms of the measure of some of their intersections using the inclusion-exclusion formula, then we can express the union as a set from these same intersections via the set operations of disjoint union and subset complement. We also present a partial result towards establishing the conjecture. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 30 pages

arXiv:2312.09331 [pdf, ps, other]

Insert-Only versus Insert-Delete in Dynamic Query Evaluation

Authors: Mahmoud Abo Khamis, Ahmet Kara, Dan Olteanu, Dan Suciu

Abstract: We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(N^w(Q)), where w(Q) is the fracti… ▽ More We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(N^w(Q)), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the amortized time per single-tuple insert is constant for acyclic full conjunctive queries. In contrast, we show that a sequence of N inserts and deletes can be executed in total time O(N^w(Q')), where Q' is obtained from Q by extending every relational atom with extra variables that represent the "lifespans" of tuples in the database. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time for the output of Q. Our approach achieves amortized optimal update times for the hierarchical and Loomis-Whitney join queries. △ Less

Submitted 13 September, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

arXiv:2309.12347 [pdf, other]

Transitioning a Project-Based Course between Onsite and Online. An Experience Report

Authors: Dan Mircea Suciu, Simona Motogna, Arthur-Jozsef Molnar

Abstract: We present an investigation regarding the challenges faced by student teams across four consecutive iterations of a team-focused, project-based course in software engineering. The studied period includes the switch to fully online activities in the spring of 2020, and covers the return to face-to-face teaching two years later. We cover the feedback provided by over 1,500 students, collected in a f… ▽ More We present an investigation regarding the challenges faced by student teams across four consecutive iterations of a team-focused, project-based course in software engineering. The studied period includes the switch to fully online activities in the spring of 2020, and covers the return to face-to-face teaching two years later. We cover the feedback provided by over 1,500 students, collected in a free-text form on the basis of a survey. A qualitative research method was utilized to discern and examine the challenges and perceived benefits of a course that was conducted entirely online. We show that technical challenges remain a constant in project-based courses, with time management being the most affected by the move to online. Students reported that the effective use of collaborative tools eased team organization and communication while online. We conclude by providing a number of action points regarding the integration of online activities in face-to-face course unfolding related to project management, communication tools, the importance of teamwork, and of active mentor participation. △ Less

Submitted 28 August, 2023; originally announced September 2023.

Comments: 12 pages, 2 figures

arXiv:2306.14211 [pdf, ps, other]

From Shapley Value to Model Counting and Back

Authors: Ahmet Kara, Dan Olteanu, Dan Suciu

Abstract: In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This… ▽ More In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation. We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the #P-hard problem of model counting for lineage in positive bipartite disjunctive normal form. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 22 pages

ACM Class: F.4.1; F.2; H.2

arXiv:2306.14075 [pdf, ps, other]

Join Size Bounds using Lp-Norms on Degree Sequences

Authors: Mahmoud Abo Khamis, Vasileios Nakos, Dan Olteanu, Dan Suciu

Abstract: Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate s… ▽ More Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet they their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating $\ell_p$-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are "simple". △ Less

Submitted 5 June, 2024; v1 submitted 24 June, 2023; originally announced June 2023.

arXiv:2306.09610 [pdf, other]

CHORUS: Foundation Models for Unified Data Discovery and Exploration

Authors: Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, Dan Suciu

Abstract: We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-c… ▽ More We apply foundation models to data discovery and exploration tasks. Foundation models include large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models. △ Less

Submitted 5 April, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: To appear in VLDB 2024

arXiv:2304.11996 [pdf, other]

Applications of Information Inequalities to Database Theory Problems

Authors: Dan Suciu

Abstract: The paper describes several applications of information inequalities to problems in database theory. The problems discussed include: upper bounds of a query's output, worst-case optimal join algorithms, the query domination problem, and the implication problem for approximate integrity constraints. The paper is self-contained: all required concepts and results from information inequalities are int… ▽ More The paper describes several applications of information inequalities to problems in database theory. The problems discussed include: upper bounds of a query's output, worst-case optimal join algorithms, the query domination problem, and the implication problem for approximate integrity constraints. The paper is self-contained: all required concepts and results from information inequalities are introduced here, gradually, and motivated by database problems. △ Less

Submitted 4 June, 2024; v1 submitted 24 April, 2023; originally announced April 2023.

Comments: This paper was invited for LICS'2023

arXiv:2301.10841 [pdf, other]

Free Join: Unifying Worst-Case Optimal and Traditional Joins

Authors: Yisu Remy Wang, Max Willsey, Dan Suciu

Abstract: Over the last decade, worst-case optimal join (WCOJ) algorithms have emerged as a new paradigm for one of the most fundamental challenges in query processing: computing joins efficiently. Such an algorithm can be asymptotically faster than traditional binary joins, all the while remaining simple to understand and implement. However, they have been found to be less efficient than the old paradigm,… ▽ More Over the last decade, worst-case optimal join (WCOJ) algorithms have emerged as a new paradigm for one of the most fundamental challenges in query processing: computing joins efficiently. Such an algorithm can be asymptotically faster than traditional binary joins, all the while remaining simple to understand and implement. However, they have been found to be less efficient than the old paradigm, traditional binary join plans, on the typical acyclic queries found in practice. Some database systems that support WCOJ use a hypbrid approach: use WCOJ to process the cyclic subparts of the query (if any), and rely on traditional binary joins otherwise. In this paper we propose a new framework, called Free Join, that unifies the two paradigms. We describe a new type of plan, a new data structure (which unifies the hash tables and tries used by the two paradigms), and a suite of optimization techniques. Our system, implemented in Rust, matches or outperforms both traditional binary joins and Generic Join on standard query benchmarks. △ Less

Submitted 27 January, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

arXiv:2211.11912 [pdf, other]

Quasi-stable Coloring for Graph Compression: Approximating Max-Flow, Linear Programs, and Centrality

Authors: Moe Kayali, Dan Suciu

Abstract: We propose quasi-stable coloring, an approximate version of stable coloring. Stable coloring, also called color refinement, is a well-studied technique in graph theory for classifying vertices, which can be used to build compact, lossless representations of graphs. However, its usefulness is limited due to its reliance on strict symmetries. Real data compresses very poorly using color refinement.… ▽ More We propose quasi-stable coloring, an approximate version of stable coloring. Stable coloring, also called color refinement, is a well-studied technique in graph theory for classifying vertices, which can be used to build compact, lossless representations of graphs. However, its usefulness is limited due to its reliance on strict symmetries. Real data compresses very poorly using color refinement. We propose the first, to our knowledge, approximate color refinement scheme, which we call quasi-stable coloring. By using approximation, we alleviate the need for strict symmetry, and allow for a tradeoff between the degree of compression and the accuracy of the representation. We study three applications: Linear Programming, Max-Flow, and Betweenness Centrality, and provide theoretical evidence in each case that a quasi-stable coloring can lead to good approximations on the reduced graph. Next, we consider how to compute a maximal quasi-stable coloring: we prove that, in general, this problem is NP-hard, and propose a simple, yet effective algorithm based on heuristics. Finally, we evaluate experimentally the quasi-stable coloring technique on several real graphs and applications, comparing with prior approximation techniques. A reference implementation and the experiment code are available at https://github.com/mkyl/QuasiStableColors.jl . △ Less

Submitted 28 November, 2022; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: To be presented at VLDB 2023

arXiv:2211.09864 [pdf, other]

SafeBound: A Practical System for Generating Cardinality Bounds

Authors: Kyle Deeds, Dan Suciu, Magda Balazinska

Abstract: Recent work has reemphasized the importance of cardinality estimates for query optimization. While new techniques have continuously improved in accuracy over time, they still generally allow for under-estimates which often lead optimizers to make overly optimistic decisions. This can be very costly for expensive queries. An alternative approach to estimation is cardinality bounding, also called pe… ▽ More Recent work has reemphasized the importance of cardinality estimates for query optimization. While new techniques have continuously improved in accuracy over time, they still generally allow for under-estimates which often lead optimizers to make overly optimistic decisions. This can be very costly for expensive queries. An alternative approach to estimation is cardinality bounding, also called pessimistic cardinality estimation, where the cardinality estimator provides guaranteed upper bounds of the true cardinality. By never underestimating, this approach allows the optimizer to avoid potentially inefficient plans. However, existing pessimistic cardinality estimators are not yet practical: they use very limited statistics on the data, and cannot handle predicates. In this paper, we introduce SafeBound, the first practical system for generating cardinality bounds. SafeBound builds on a recent theoretical work that uses degree sequences on join attributes to compute cardinality bounds, extends this framework with predicates, introduces a practical compression method for the degree sequences, and implements an efficient inference algorithm. Across four workloads, SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL, and is on par or better than state of the art ML-based estimators and pessimistic cardinality estimators, by improving the runtime of the expensive queries. It also saves up to 500x in query planning time, and uses up to 6.8x less space compared to state of the art cardinality estimation methods. △ Less

Submitted 17 November, 2022; originally announced November 2022.

arXiv:2210.17071 [pdf, other]

Computing Rule-Based Explanations by Leveraging Counterfactuals

Authors: Zixuan Geng, Maximilian Schleich, Dan Suciu

Abstract: Sophisticated machine models are increasingly used for high-stakes decisions in everyday life. There is an urgent need to develop effective explanation techniques for such automated decisions. Rule-Based Explanations have been proposed for high-stake decisions like loan applications, because they increase the users' trust in the decision. However, rule-based explanations are very inefficient to co… ▽ More Sophisticated machine models are increasingly used for high-stakes decisions in everyday life. There is an urgent need to develop effective explanation techniques for such automated decisions. Rule-Based Explanations have been proposed for high-stake decisions like loan applications, because they increase the users' trust in the decision. However, rule-based explanations are very inefficient to compute, and existing systems sacrifice their quality in order to achieve reasonable performance. We propose a novel approach to compute rule-based explanations, by using a different type of explanation, Counterfactual Explanations, for which several efficient systems have already been developed. We prove a Duality Theorem, showing that rule-based and counterfactual-based explanations are dual to each other, then use this observation to develop an efficient algorithm for computing rule-based explanations, which uses the counterfactual-based explanation as an oracle. We conduct extensive experiments showing that our system computes rule-based explanations of higher quality, and with the same or better performance, than two previous systems, MinSetCover and Anchor. △ Less

Submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.06267 [pdf, other]

Optimizing Tensor Programs on Flexible Storage

Authors: Maximilian Schleich, Amir Shaikhha, Dan Suciu

Abstract: Tensor programs often need to process large tensors (vectors, matrices, or higher order tensors) that require a specialized storage format for their memory layout. Several such layouts have been proposed in the literature, such as the Coordinate Format, the Compressed Sparse Row format, and many others, that were especially designed to optimally store tensors with specific sparsity properties. How… ▽ More Tensor programs often need to process large tensors (vectors, matrices, or higher order tensors) that require a specialized storage format for their memory layout. Several such layouts have been proposed in the literature, such as the Coordinate Format, the Compressed Sparse Row format, and many others, that were especially designed to optimally store tensors with specific sparsity properties. However, existing tensor processing systems require specialized extensions in order to take advantage of every new storage format. In this paper we describe a system that allows users to define flexible storage formats in a declarative tensor query language, similar to the language used by the tensor program. The programmer only needs to write storage mappings, which describe, in a declarative way, how the tensors are laid out in main memory. Then, we describe a cost-based optimizer that optimizes the tensor program for the specific memory layout. We demonstrate empirically significant performance improvements compared to state-of-the-art tensor processing systems. △ Less

Submitted 12 October, 2022; originally announced October 2022.

arXiv:2207.08891 [pdf, other]

Wink: Deniable Secure Messaging

Authors: Anrin Chakraborti, Darius Suciu, Radu Sion

Abstract: End-to-end encrypted (E2EE) messaging is an essential first step in providing message confidentiality. Unfortunately, all security guarantees of end-to-end encryption are lost when keys or plaintext are disclosed, either due to device compromise or (sometimes lawful) coercion by powerful adversaries. This work introduces Wink, the first plausibly-deniable messaging system protecting message confid… ▽ More End-to-end encrypted (E2EE) messaging is an essential first step in providing message confidentiality. Unfortunately, all security guarantees of end-to-end encryption are lost when keys or plaintext are disclosed, either due to device compromise or (sometimes lawful) coercion by powerful adversaries. This work introduces Wink, the first plausibly-deniable messaging system protecting message confidentiality from partial device compromise and compelled key disclosure. Wink can surreptitiously inject hidden messages in standard random coins (e.g., salts, IVs) used by existing E2EE protocols. It does so as part of legitimate secure cryptographic functionality deployed inside the widely-available trusted execution environment (TEE) TrustZone. This results in hidden communication using virtually unchanged existing E2EE messaging apps, as well as strong plausible deniability. Wink has been demonstrated with multiple existing E2EE applications (including Telegram and Signal) with minimal (external) instrumentation, negligible overheads, and crucially, without changing on-wire message formats. △ Less

Submitted 10 June, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

arXiv:2202.10390 [pdf, other]

Optimizing Recursive Queries with Program Synthesis

Authors: Yisu Remy Wang, Mahmoud Abo Khamis, Hung Q. Ngo, Reinhard Pichler, Dan Suciu

Abstract: Most work on query optimization has concentrated on loop-free queries. However, data science and machine learning workloads today typically involve recursive or iterative computation. In this work, we propose a novel framework for optimizing recursive queries using methods from program synthesis. In particular, we introduce a simple yet powerful optimization rule called the "FGH-rule" which aims t… ▽ More Most work on query optimization has concentrated on loop-free queries. However, data science and machine learning workloads today typically involve recursive or iterative computation. In this work, we propose a novel framework for optimizing recursive queries using methods from program synthesis. In particular, we introduce a simple yet powerful optimization rule called the "FGH-rule" which aims to find a faster way to evaluate a recursive program. The solution is found by making use of powerful tools, such as a program synthesizer, an SMT-solver, and an equality saturation system. We demonstrate the strength of the optimization by showing that the FGH-rule can lead to speedups up to 4 orders of magnitude on three, already optimized Datalog systems. △ Less

Submitted 21 February, 2022; originally announced February 2022.

arXiv:2201.04166 [pdf, other]

Degree Sequence Bound For Join Cardinality Estimation

Authors: Kyle Deeds, Dan Suciu, Magda Balazinska, Walter Cai

Abstract: Recent work has demonstrated the catastrophic effects of poor cardinality estimates on query processing time. In particular, underestimating query cardinality can result in overly optimistic query plans which take orders of magnitude longer to complete than one generated with the true cardinality. Cardinality bounding avoids this pitfall by computing a strict upper bound on the query's output size… ▽ More Recent work has demonstrated the catastrophic effects of poor cardinality estimates on query processing time. In particular, underestimating query cardinality can result in overly optimistic query plans which take orders of magnitude longer to complete than one generated with the true cardinality. Cardinality bounding avoids this pitfall by computing a strict upper bound on the query's output size using statistics about the database such as table sizes and degrees, i.e. value frequencies. In this paper, we extend this line of work by proving a novel bound called the Degree Sequence Bound which takes into account the full degree sequences and the max tuple multiplicity. This bound improves upon previous work incorporating degree constraints which focused on the maximum degree rather than the degree sequence. Further, we describe how to practically compute this bound using a learned approximation of the true degree sequences. △ Less

Submitted 30 March, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

arXiv:2105.14435 [pdf, ps, other]

Convergence of Datalog over (Pre-) Semirings

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, Reinhard Pichler, Dan Suciu, Yisu Remy Wang

Abstract: Recursive queries have been traditionally studied in the framework of datalog, a language that restricts recursion to monotone queries over sets, which is guaranteed to converge in polynomial time in the size of the input. But modern big data systems require recursive computations beyond the Boolean space. In this paper we study the convergence of datalog when it is interpreted over an arbitrary s… ▽ More Recursive queries have been traditionally studied in the framework of datalog, a language that restricts recursion to monotone queries over sets, which is guaranteed to converge in polynomial time in the size of the input. But modern big data systems require recursive computations beyond the Boolean space. In this paper we study the convergence of datalog when it is interpreted over an arbitrary semiring. We consider an ordered semiring, define the semantics of a datalog program as a least fixpoint in this semiring, and study the number of steps required to reach that fixpoint, if ever. We identify algebraic properties of the semiring that correspond to certain convergence properties of datalog programs. Finally, we describe a class of ordered semirings on which one can use the semi-naïve evaluation algorithm on any datalog program. △ Less

Submitted 24 January, 2024; v1 submitted 30 May, 2021; originally announced May 2021.

arXiv:2101.01292 [pdf, other]

GeCo: Quality Counterfactual Explanations in Real Time

Authors: Maximilian Schleich, Zixuan Geng, Yihong Zhang, Dan Suciu

Abstract: Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, b… ▽ More Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because of the inherent tension between a rich semantics of the domain, and the need for real time response. In this paper we present GeCo, the first system that can compute plausible and feasible counterfactual explanations in real time. At its core, GeCo relies on a genetic algorithm, which is customized to favor searching counterfactual explanations with the smallest number of changes. To achieve real-time performance, we introduce two novel optimizations: $Δ$-representation of candidate counterfactuals, and partial evaluation of the classifier. We compare empirically GeCo against five other systems described in the literature, and show that it is the only system that can achieve both high quality explanations and real time answers. △ Less

Submitted 18 May, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: 16 pages, 12 figures, 3 tables, 3 algorithms

arXiv:2011.14482 [pdf, other]

doi 10.46298/lmcs-18(2:6)2022

A Near-Optimal Parallel Algorithm for Joining Binary Relations

Authors: Bas Ketsman, Dan Suciu, Yufei Tao

Abstract: We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of $\tilde{O}(m/p^{1/ρ})$ where $m$ is the total size of the input relations, $p$ is the number of machines, $ρ$ is the join's fractional edge covering number, and $\tilde{O}(.)$ hides a polylogarithmic fa… ▽ More We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of $\tilde{O}(m/p^{1/ρ})$ where $m$ is the total size of the input relations, $p$ is the number of machines, $ρ$ is the join's fractional edge covering number, and $\tilde{O}(.)$ hides a polylogarithmic factor. The load matches a known lower bound up to a polylogarithmic factor. At the core of the proposed algorithm is a new theorem (which we name the "isolated cartesian product theorem") that provides fresh insight into the problem's mathematical structure. Our result implies that the subgraph enumeration problem, where the goal is to report all the occurrences of a constant-sized subgraph pattern, can be settled optimally (up to a polylogarithmic factor) in the MPC model. △ Less

Submitted 4 May, 2022; v1 submitted 29 November, 2020; originally announced November 2020.

Journal ref: Logical Methods in Computer Science, Volume 18, Issue 2 (May 5, 2022) lmcs:6944

arXiv:2009.08634 [pdf, ps, other]

On the Tractability of SHAP Explanations

Authors: Guy Van den Broeck, Anton Lykov, Maximilian Schleich, Dan Suciu

Abstract: SHAP explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the influence of individual features on the prediction of a machine learning model. Despite a lot of recent interest from both academia and industry, it is not known whether SHAP explanations of common machine learning models can be computed efficiently. In this paper, we es… ▽ More SHAP explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the influence of individual features on the prediction of a machine learning model. Despite a lot of recent interest from both academia and industry, it is not known whether SHAP explanations of common machine learning models can be computed efficiently. In this paper, we establish the complexity of computing the SHAP explanation in three important settings. First, we consider fully-factorized data distributions, and show that the complexity of computing the SHAP explanation is the same as the complexity of computing the expected value of the model. This fully-factorized setting is often used to simplify the SHAP computation, yet our results show that the computation can be intractable for commonly used models such as logistic regression. Going beyond fully-factorized distributions, we show that computing SHAP explanations is already intractable for a very simple setting: computing SHAP explanations of trivial classifiers over naive Bayes distributions. Finally, we show that even computing SHAP over the empirical distribution is #P-hard. △ Less

Submitted 30 January, 2021; v1 submitted 18 September, 2020; originally announced September 2020.

Comments: Proceedings of the 35th AAAI Conference on Artificial Intelligence

arXiv:2008.00896 [pdf, other]

A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries

Authors: Batya Kenig, Dan Suciu

Abstract: We study the $generalized~model~counting~problem$, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in… ▽ More We study the $generalized~model~counting~problem$, defined as follows: given a database, and a set of deterministic tuples, count the number of subsets of the database that include all deterministic tuples and satisfy the query. This problem is computationally equivalent to the evaluation of the query over a tuple-independent probabilistic database where all tuples have probabilities in $\{0,\frac{1}{2},1\}$. Previous work has established a dichotomy for Unions of Conjunctive Queries (UCQ) when the probabilities are arbitrary rational numbers, showing that, for each query, its complexity is either in polynomial time or #P-hard. The query is called $safe$ in the first case, and $unsafe$ in the second case. Here, we strengthen the hardness proof, by proving that an unsafe UCQ query remains #P-hard even if the probabilities are restricted to $\{0,\frac{1}{2},1\}$. This requires a complete redesign of the hardness proof, using new techniques. A related problem is the $model~counting~problem$, which asks for the probability of the query when the input probabilities are restricted to $\{0,\frac{1}{2}\}$. While our result does not extend to model counting for all unsafe UCQs, we prove that model counting is #P-hard for a class of unsafe queries called Type-I forbidden queries. △ Less

Submitted 20 May, 2021; v1 submitted 3 August, 2020; originally announced August 2020.

arXiv:2004.08783 [pdf, other]

Decision Problems in Information Theory

Authors: Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, Dan Suciu

Abstract: Constraints on entropies are considered to be the laws of information theory. Even though the pursuit of their discovery has been a central theme of research in information theory, the algorithmic aspects of constraints on entropies remain largely unexplored. Here, we initiate an investigation of decision problems about constraints on entropies by placing several different such problems into level… ▽ More Constraints on entropies are considered to be the laws of information theory. Even though the pursuit of their discovery has been a central theme of research in information theory, the algorithmic aspects of constraints on entropies remain largely unexplored. Here, we initiate an investigation of decision problems about constraints on entropies by placing several different such problems into levels of the arithmetical hierarchy. We establish the following results on checking the validity over all almost-entropic functions: first, validity of a Boolean information constraint arising from a monotone Boolean formula is co-recursively enumerable; second, validity of "tight" conditional information constraints is in $Π^0_3$. Furthermore, under some restrictions, validity of conditional information constraints "with slack" is in $Σ^0_2$, and validity of information inequality constraints involving max is Turing equivalent to validity of information inequality constraints (with no max involved). We also prove that the classical implication problem for conditional independence statements is co-recursively enumerable. △ Less

Submitted 27 April, 2020; v1 submitted 19 April, 2020; originally announced April 2020.

arXiv:2004.03644 [pdf, other]

Causal Relational Learning

Authors: Babak Salimi, Harsh Parikh, Moe Kayali, Sudeepa Roy, Lise Getoor, Dan Suciu

Abstract: Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational… ▽ More Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions and specifying causal queries using simple Datalog-like rules.CaRL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CaRL in social sciences and healthcare. △ Less

Submitted 7 April, 2020; originally announced April 2020.

arXiv:2003.06868 [pdf, other]

Causality-based Explanation of Classification Outcomes

Authors: Leopoldo Bertossi, Jordan Li, Maximilian Schleich, Dan Suciu, Zografoula Vagena

Abstract: We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain. We propose a simple definition of an explanation for the outcome of a classifier based on concepts from causality. We compare it with previously proposed notions of explanation, and study their complexity. We conduct an experimental evaluation with two real datasets from the financial domain. △ Less

Submitted 25 May, 2020; v1 submitted 15 March, 2020; originally announced March 2020.

Comments: 16 pages, 6 figures, 1 table

arXiv:2002.09799 [pdf, other]

Sample Debiasing in the Themis Open World Database System (Extended Version)

Authors: Laurel Orr, Magda Balazinska, Dan Suciu

Abstract: Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop a… ▽ More Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic modeling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining interactive query response times. We also show that \name is robust to differences in the support between the sample and population, a key use case when using social media samples. △ Less

Submitted 29 February, 2020; v1 submitted 22 February, 2020; originally announced February 2020.

Comments: SIGMOD 2020

arXiv:2002.07951 [pdf, other]

SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra

Authors: Yisu Remy Wang, Shana Hutchison, Jonathan Leang, Bill Howe, Dan Suciu

Abstract: Machine learning algorithms are commonly specified in linear algebra (LA). LA expressions can be rewritten into more efficient forms, by taking advantage of input properties such as sparsity, as well as program properties such as common subexpressions and fusible operators. The complex interaction among these properties' impact on the execution cost poses a challenge to optimizing compilers. Exist… ▽ More Machine learning algorithms are commonly specified in linear algebra (LA). LA expressions can be rewritten into more efficient forms, by taking advantage of input properties such as sparsity, as well as program properties such as common subexpressions and fusible operators. The complex interaction among these properties' impact on the execution cost poses a challenge to optimizing compilers. Existing compilers resort to intricate heuristics that complicate the codebase and add maintenance cost but fail to search through the large space of equivalent LA expressions to find the cheapest one. We introduce a general optimization technique for LA expressions, by converting the LA expressions into Relational Algebra (RA) expressions, optimizing the latter, then converting the result back to (optimized) LA expressions. One major advantage of this method is that it is complete, meaning that any equivalent LA expression can be found using the equivalence rules in RA. The challenge is the major size of the search space, and we address this by adopting and extending a technique used in compilers, called equality saturation. We integrate the optimizer into SystemML and validate it empirically across a spectrum of machine learning tasks; we show that we can derive all existing hand-coded optimizations in SystemML, and perform new optimizations that lead to speedups from 1.2X to 5X. △ Less

Submitted 22 December, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:1912.07777 [pdf, other]

Mosaic: A Sample-Based Database System for Open World Query Processing

Authors: Laurel Orr, Samuel Ainsworth, Walter Cai, Kevin Jamieson, Magda Balazinska, Dan Suciu

Abstract: Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom technique… ▽ More Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom techniques to avoid inaccurate results. In this vision paper, we propose Mosaic, a database system that treats samples as first-class citizens and allows users to ask questions over populations represented by these samples. Answering queries over biased samples is non-trivial as there is no existing, standard technique to answer population queries when the sampling probability is unknown. In this paper, we show how our envisioned system solves this problem by having a unique sample-based data model with extensions to the SQL language. We propose how to perform population query answering using biased samples and give preliminary results for one of our novel query answering techniques. △ Less

Submitted 10 January, 2020; v1 submitted 16 December, 2019; originally announced December 2019.

Comments: CIDR 2020

arXiv:1911.12933 [pdf, other]

Mining Approximate Acyclic Schemes from Relations

Authors: Batya Kenig, Pranay Mundra, Guna Prasad, Babak Salimi, Dan Suciu

Abstract: Acyclic schemes have numerous applications in databases and in machine learning, such as improved design, more efficient storage, and increased performance for queries and machine learning algorithms. Multivalued dependencies (MVDs) are the building blocks of acyclic schemes. The discovery from data of both MVDs and acyclic schemes is more challenging than other forms of data dependencies, such as… ▽ More Acyclic schemes have numerous applications in databases and in machine learning, such as improved design, more efficient storage, and increased performance for queries and machine learning algorithms. Multivalued dependencies (MVDs) are the building blocks of acyclic schemes. The discovery from data of both MVDs and acyclic schemes is more challenging than other forms of data dependencies, such as Functional Dependencies, because these dependencies do not hold on subsets of data, and because they are very sensitive to noise in the data; for example a single wrong or missing tuple may invalidate the schema. In this paper we present Maimon, a system for discovering approximate acyclic schemes and MVDs from data. We give a principled definition of approximation, by using notions from information theory, then describe the two components of Maimon: mining for approximate MVDs, then reconstructing acyclic schemes from approximate MVDs. We conduct an experimental evaluation of Maimon on 20 real-world datasets, and show that it can scale up to 1M rows, and up to 30 columns. △ Less

Submitted 28 November, 2019; originally announced November 2019.

arXiv:1911.04948 [pdf, other]

EntropyDB: A Probabilistic Approach to Approximate Query Processing

Authors: Laurel Orr, Magdalena Balazinska, Dan Suciu

Abstract: We present EntropyDB, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulati… ▽ More We present EntropyDB, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. We also discuss extensions that can allow for data updates and linear queries over joins. △ Less

Submitted 9 November, 2019; originally announced November 2019.

Comments: arXiv admin note: text overlap with arXiv:1703.03856

Journal ref: VLDB Journal 2019

arXiv:1908.07924 [pdf, other]

Data Management for Causal Algorithmic Fairness

Authors: Babak Salimi, Bill Howe, Dan Suciu

Abstract: Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires c… ▽ More Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness. △ Less

Submitted 30 September, 2019; v1 submitted 20 August, 2019; originally announced August 2019.

Comments: arXiv admin note: text overlap with arXiv:1902.08283

arXiv:1906.09727 [pdf, ps, other]

Bag Query Containment and Information Theory

Authors: Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, Dan Suciu

Abstract: The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the… ▽ More The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of maxima of information inequalities is many-one equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree. △ Less

Submitted 5 July, 2021; v1 submitted 24 June, 2019; originally announced June 2019.

arXiv:1902.08283 [pdf, other]

Capuchin: Causal Database Repair for Algorithmic Fairness

Authors: Babak Salimi, Luke Rodriguez, Bill Howe, Dan Suciu

Abstract: Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflect discrimination, suggesting a database repair problem. Existing treatments of fairness rely on statistical correlations that can be fooled by statistical anomalies, such as Simpson's paradox. Proposals for causality-based d… ▽ More Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflect discrimination, suggesting a database repair problem. Existing treatments of fairness rely on statistical correlations that can be fooled by statistical anomalies, such as Simpson's paradox. Proposals for causality-based definitions of fairness can correctly model some of these situations, but they require specification of the underlying causal models. In this paper, we formalize the situation as a database repair problem, proving sufficient conditions for fair classifiers in terms of admissible variables as opposed to a complete causal model. We show that these conditions correctly capture subtle fairness violations. We then use these conditions as the basis for database repair algorithms that provide provable fairness guarantees about classifiers trained on their training labels. We evaluate our algorithms on real data, demonstrating improvement over the state of the art on multiple fairness metrics proposed in the literature while retaining high utility. △ Less

Submitted 1 October, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

Journal ref: Proceedings of the 2019 International Conference on Management of Data. ACM, 2019

arXiv:1812.09987 [pdf, other]

doi 10.46298/lmcs-18(1:5)2022

Integrity Constraints Revisited: From Exact to Approximate Implication

Authors: Batya Kenig, Dan Suciu

Abstract: Integrity constraints such as functional dependencies (FD) and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been invest… ▽ More Integrity constraints such as functional dependencies (FD) and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Then, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Finally, we show how some of the results in the paper can be derived using the I-measure theory, which relates between information theoretic measures and set theory. Our results recover, and sometimes extend, previously known results about the implication problem: the implication of MVDs and FDs can be checked by considering only 2-tuple relations. △ Less

Submitted 10 January, 2022; v1 submitted 24 December, 2018; originally announced December 2018.

Journal ref: Logical Methods in Computer Science, Volume 18, Issue 1 (January 11, 2022) lmcs:6925

arXiv:1810.01997 [pdf, other]

Improving High Contention OLTP Performance via Transaction Scheduling

Authors: Guna Prasaad, Alvin Cheung, Dan Suciu

Abstract: Research in transaction processing has made significant progress in improving the performance of multi-core in-memory transactional systems. However, the focus has mainly been on low-contention workloads. Modern transactional systems perform poorly on workloads with transactions accessing a few highly contended data items. We observe that most transactional workloads, including those with high con… ▽ More Research in transaction processing has made significant progress in improving the performance of multi-core in-memory transactional systems. However, the focus has mainly been on low-contention workloads. Modern transactional systems perform poorly on workloads with transactions accessing a few highly contended data items. We observe that most transactional workloads, including those with high contention, can be divided into clusters of data conflict-free transactions and a small set of residuals. In this paper, we introduce a new concurrency control protocol called Strife that leverages the above observation. Strife executes transactions in batches, where each batch is partitioned into clusters of conflict-free transactions and a small set of residual transactions. The conflict-free clusters are executed in parallel without any concurrency control, followed by executing the residual cluster either serially or with concurrency control. We present a low-overhead algorithm that partitions a batch of transactions into clusters that do not have cross-cluster conflicts and a small residual cluster. We evaluate Strife against the optimistic concurrency control protocol and several variants of two-phase locking, where the latter is known to perform better than other concurrency protocols under high contention, and show that Strife can improve transactional throughput by up to 2x. We also perform an in-depth micro-benchmark analysis to empirically characterize the performance and quality of our clustering algorithm △ Less

Submitted 3 October, 2018; originally announced October 2018.

arXiv:1804.00443 [pdf, ps, other]

A Note on the Hardness of the Critical Tuple Problem

Authors: Egor V. Kostylev, Dan Suciu

Abstract: The notion of critical tuple was introduced by Miklau and Suciu (Gerome Miklau and Dan Suciu. A formal analysis of information disclosure in data exchange. J. Comput. Syst. Sci., 73(3):507-534, 2007), who also claimed that the problem of checking whether a tuple is non-critical is complete for the second level of the polynomial hierarchy. Kostylev identified an error in the 12-page-long hardness p… ▽ More The notion of critical tuple was introduced by Miklau and Suciu (Gerome Miklau and Dan Suciu. A formal analysis of information disclosure in data exchange. J. Comput. Syst. Sci., 73(3):507-534, 2007), who also claimed that the problem of checking whether a tuple is non-critical is complete for the second level of the polynomial hierarchy. Kostylev identified an error in the 12-page-long hardness proof. It turns out that the issue is rather fundamental: the proof can be adapted to show hardness of a relative variant of tuple-non-criticality, but we have neither been able to prove the original claim nor found an algorithm for it of lower complexity. In this note we state formally the relative variant and present an alternative, simplified proof of its hardness; we also give an NP-hardness proof for the original problem, the best lower bound we have been able to show. Hence, the precise complexity of the original critical tuple problem remains open. △ Less

Submitted 2 April, 2018; originally announced April 2018.

arXiv:1803.04562 [pdf, other]

Bias in OLAP Queries: Detection, Explanation, and Removal

Authors: Babak Salimi, Johannes Gehrke, Dan Suciu

Abstract: On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose HypDB, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple d… ▽ More On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose HypDB, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a \emph{biased query}, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case. △ Less

Submitted 24 July, 2018; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: This paper is an extended version of a paper presented at SIGMOD 2018

arXiv:1802.02229 [pdf, other]

Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries

Authors: Shumo Chu, Brendan Murphy, Jared Roesch, Alvin Cheung, Dan Suciu

Abstract: Deciding the equivalence of SQL queries is a fundamental problem in data management. As prior work has mainly focused on studying the theoretical limitations of the problem, very few implementations for checking such equivalences exist. In this paper, we present a new formalism and implementation for reasoning about the equivalences of SQL queries. Our formalism, U-semiring, extends SQL's semiring… ▽ More Deciding the equivalence of SQL queries is a fundamental problem in data management. As prior work has mainly focused on studying the theoretical limitations of the problem, very few implementations for checking such equivalences exist. In this paper, we present a new formalism and implementation for reasoning about the equivalences of SQL queries. Our formalism, U-semiring, extends SQL's semiring semantics with unbounded summation and duplicate elimination. U-semiring is defined using only very few axioms and can thus be easily implemented using proof assistants such as Coq for automated query reasoning. Yet, they are sufficient enough to enable us reason about sophisticated SQL queries that are evaluated over bags and sets, along with various integrity constraints. To evaluate the effectiveness of U-semiring, we have used it to formally verify 39 query rewrite rules from both classical data management research papers and real-world SQL engines, where many of them have never been proven correct before. △ Less

Submitted 23 May, 2018; v1 submitted 6 February, 2018; originally announced February 2018.

arXiv:1712.07445 [pdf, ps, other]

Boolean Tensor Decomposition for Conjunctive Queries with Negation

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, Dan Olteanu, Dan Suciu

Abstract: We propose an algorithm for answering conjunctive queries with negation, where the negated relations have bounded degree. Its data complexity matches that of the best known algorithms for the positive subquery of the input query and is expressed in terms of the fractional hypertree width and the submodular width. The query complexity depends on the structure of the negated subquery; in general it… ▽ More We propose an algorithm for answering conjunctive queries with negation, where the negated relations have bounded degree. Its data complexity matches that of the best known algorithms for the positive subquery of the input query and is expressed in terms of the fractional hypertree width and the submodular width. The query complexity depends on the structure of the negated subquery; in general it is exponential in the number of join variables occurring in negated relations yet it becomes polynomial for several classes of queries. This algorithm relies on several contributions. We show how to rewrite queries with negation on bounded-degree relations into equivalent conjunctive queries with not-all-equal (NAE) predicates, which are a multi-dimensional analog of disequality (not-equal). We then generalize the known color-coding technique to conjunctions of NAE predicates and explain it via a Boolean tensor decomposition of conjunctions of NAE predicates. This decomposition can be achieved via a probabilistic construction that can be derandomized efficiently. △ Less

Submitted 27 January, 2019; v1 submitted 20 December, 2017; originally announced December 2017.

arXiv:1703.07342 [pdf, other]

doi 10.1145/3070607.3070608

LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation

Authors: Dylan Hutchison, Bill Howe, Dan Suciu

Abstract: Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as we… ▽ More Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position Lara %formal and informal at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the Lara operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps. To evaluate these ideas, we implemented the Lara operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how Lara expresses a sensor quality control task, and we measure the performance impact of optimizations Lara admits on this task. Second we show that the LaraDB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LaraDB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans. △ Less

Submitted 13 April, 2017; v1 submitted 21 March, 2017; originally announced March 2017.

Comments: 10 pages, to appear in the BeyondMR workshop at the 2017 ACM SIGMOD conference

arXiv:1703.03856 [pdf, other]

Probabilistic Database Summarization for Interactive Data Exploration

Authors: Laurel Orr, Magda Balazinska, Dan Suciu

Abstract: We present a probabilistic approach to generate a small, query-able summary of a dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic rep… ▽ More We present a probabilistic approach to generate a small, query-able summary of a dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques and give three critical optimizations to improve preprocessing time and query accuracy. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values. △ Less

Submitted 23 May, 2017; v1 submitted 10 March, 2017; originally announced March 2017.

Comments: To appear VLDB 2017

arXiv:1701.09007 [pdf, other]

Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151)

Authors: Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Calvanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, Ke Yi

Abstract: In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (ICDT). The mission of this workshop was to identify… ▽ More In April 2016, a community of researchers working in the area of Principles of Data Management (PDM) joined in a workshop at the Dagstuhl Castle in Germany. The workshop was organized jointly by the Executive Committee of the ACM Symposium on Principles of Database Systems (PODS) and the Council of the International Conference on Database Theory (ICDT). The mission of this workshop was to identify and explore some of the most important research directions that have high relevance to society and to Computer Science today, and where the PDM community has the potential to make significant contributions. This report describes the family of research directions that the workshop focused on from three perspectives: potential practical relevance, results already obtained, and research questions that appear surmountable in the short and medium term. △ Less

Submitted 31 January, 2017; originally announced January 2017.

arXiv:1612.02503 [pdf, ps, other]

What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do with one another?

Authors: Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu

Abstract: Recent works on bounding the output size of a conjunctive query with functional dependencies and degree constraints have shown a deep connection between fundamental questions in information theory and database theory. We prove analogous output bounds for disjunctive datalog rules, and answer several open questions regarding the tightness and looseness of these bounds along the way. Our bounds are… ▽ More Recent works on bounding the output size of a conjunctive query with functional dependencies and degree constraints have shown a deep connection between fundamental questions in information theory and database theory. We prove analogous output bounds for disjunctive datalog rules, and answer several open questions regarding the tightness and looseness of these bounds along the way. Our bounds are intimately related to Shannon-type information inequalities. We devise the notion of a "proof sequence" of a specific class of Shannon-type information inequalities called "Shannon flow inequalities". We then show how such a proof sequence can be interpreted as symbolic instructions guiding an algorithm called "PANDA", which answers disjunctive datalog rules within the time that the size bound predicted. We show that PANDA can be used as a black-box to devise algorithms matching precisely the fractional hypertree width and the submodular width runtimes for aggregate and conjunctive queries with functional dependencies and degree constraints. Our results improve upon known results in three ways. First, our bounds and algorithms are for the much more general class of disjunctive datalog rules, of which conjunctive queries are a special case. Second, the runtime of PANDA matches precisely the submodular width bound, while the previous algorithm by Marx has a runtime that is polynomial in this bound. Third, our bounds and algorithms work for queries with input cardinality bounds, functional dependencies, and degree constraints. Overall, our results show a deep connection between three seemingly unrelated lines of research; and, our results on proof sequences for Shannon flow inequalities might be of independent interest. △ Less

Submitted 23 December, 2023; v1 submitted 7 December, 2016; originally announced December 2016.

arXiv:1609.03540 [pdf, ps, other]

ZaliQL: A SQL-Based Framework for Drawing Causal Inference from Big Data

Authors: Babak Salimi, Dan Suciu

Abstract: Causal inference from observational data is a subject of active research and development in statistics and computer science. Many toolkits have been developed for this purpose that depends on statistical software. However, these toolkits do not scale to large datasets. In this paper we describe a suite of techniques for expressing causal inference tasks from observational data in SQL. This suite s… ▽ More Causal inference from observational data is a subject of active research and development in statistics and computer science. Many toolkits have been developed for this purpose that depends on statistical software. However, these toolkits do not scale to large datasets. In this paper we describe a suite of techniques for expressing causal inference tasks from observational data in SQL. This suite supports the state-of-the-art methods for causal inference and run at scale within a database engine. In addition, we introduce several optimization techniques that significantly speedup causal inference, both in the online and offline setting. We evaluate the quality and performance of our techniques by experiments of real datasets. △ Less

Submitted 12 September, 2016; v1 submitted 12 September, 2016; originally announced September 2016.

arXiv:1607.04822 [pdf, other]

HoTTSQL: Proving Query Rewrites with Univalent SQL Semantics

Authors: Shumo Chu, Konstantin Weitz, Alvin Cheung, Dan Suciu

Abstract: Every database system contains a query optimizer that performs query rewrites. Unfortunately, developing query optimizers remains a highly challenging task. Part of the challenges comes from the intricacies and rich features of query languages, which makes reasoning about rewrite rules difficult. In this paper, we propose a machine-checkable denotational semantics for SQL, the de facto language fo… ▽ More Every database system contains a query optimizer that performs query rewrites. Unfortunately, developing query optimizers remains a highly challenging task. Part of the challenges comes from the intricacies and rich features of query languages, which makes reasoning about rewrite rules difficult. In this paper, we propose a machine-checkable denotational semantics for SQL, the de facto language for relational database, for rigorously validating rewrite rules. Unlike previously proposed semantics that are either non-mechanized or only cover a small amount of SQL language features, our semantics covers all major features of SQL, including bags, correlated subqueries, aggregation, and indexes. Our mechanized semantics, called HoTTSQL, is based on K-Relations and homotopy type theory, where we denote relations as mathematical functions from tuples to univalent types. We have implemented HoTTSQL in Coq, which takes only fewer than 300 lines of code and have proved a wide range of SQL rewrite rules, including those from database research literature (e.g., magic set rewrites) and real-world query optimizers (e.g., subquery elimination). Several of these rewrite rules have never been previously proven correct. In addition, while query equivalence is generally undecidable, we have implemented an automated decision procedure using HoTTSQL for conjunctive queries: a well-studied decidable fragment of SQL that encompasses many real-world queries. △ Less

Submitted 5 August, 2016; v1 submitted 16 July, 2016; originally announced July 2016.

arXiv:1604.03607 [pdf, other]

Lara: A Key-Value Algebra underlying Arrays and Relations

Authors: Dylan Hutchison, Bill Howe, Dan Suciu

Abstract: Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned onto multiple families. Discovering ways to execute computation among multiple available systems, let alone discovering an optimal execution plan, is challenging… ▽ More Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned onto multiple families. Discovering ways to execute computation among multiple available systems, let alone discovering an optimal execution plan, is challenging given semantic differences between disparate families of systems. In this paper we introduce a new algebra, Lara, which underlies and unifies algebras representing the families above in order to facilitate translation between systems. We describe the operations and objects of Lara---union, join, and ext on associative tables---and show her properties and equivalences to other algebras. Multi-system optimization has a bright future, in which we proffer Lara for the role of universal connector. △ Less

Submitted 12 April, 2016; originally announced April 2016.

Comments: Working draft

arXiv:1604.01848 [pdf, other]

Worst-Case Optimal Algorithms for Parallel Query Processing

Authors: Paul Beame, Paraschos Koutris, Dan Suciu

Abstract: In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis… ▽ More In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with $p$ servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of [18] for sequential algorithms. We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query $q$ when all relations have size equal to $M$ is $O(M/p^{1/ψ^*})$, where $ψ^*$ is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries. △ Less

Submitted 6 April, 2016; originally announced April 2016.

Showing 1–50 of 78 results for author: Suciu, D