subscribe to arXiv mailings

Survey of Results on the ModPath and ModCycle Problems

Abstract: This note summarizes the state of what is known about the tractability of the problem ModPath, which asks if an input undirected graph contains a simple st-path whose length satisfies modulo constraints. We also consider the problem ModCycle, which asks for the existence of a simple cycle subject to such constraints. We also discuss the status of these problems on directed graphs, and on restricte… ▽ More This note summarizes the state of what is known about the tractability of the problem ModPath, which asks if an input undirected graph contains a simple st-path whose length satisfies modulo constraints. We also consider the problem ModCycle, which asks for the existence of a simple cycle subject to such constraints. We also discuss the status of these problems on directed graphs, and on restricted classes of graphs. We explain connections to the problem variant asking for a constant vertex-disjoint number of such paths or cycles, and discuss links to other related work. △ Less

Submitted 1 September, 2024; originally announced September 2024.

Comments: 11 pages. Unpublished note surveying existing work

arXiv:2407.01127 [pdf, other]

Tractable Circuits in Database Theory

Authors: Antoine Amarilli, Florent Capelli

Abstract: This work reviews how database theory uses tractable circuit classes from knowledge compilation. We present relevant query evaluation tasks, and notions of tractable circuits. We then show how these tractable circuits can be used to address database tasks. We first focus on Boolean provenance and its applications for aggregation tasks, in particular probabilistic query evaluation. We study these f… ▽ More This work reviews how database theory uses tractable circuit classes from knowledge compilation. We present relevant query evaluation tasks, and notions of tractable circuits. We then show how these tractable circuits can be used to address database tasks. We first focus on Boolean provenance and its applications for aggregation tasks, in particular probabilistic query evaluation. We study these for Monadic Second Order (MSO) queries on trees, and for safe Conjunctive Queries (CQs) and Union of Conjunctive Queries (UCQs). We also study circuit representations of query answers, and their applications to enumeration tasks: both in the Boolean setting (for MSO) and the multivalued setting (for CQs and UCQs). △ Less

Submitted 25 August, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

Comments: 15 pages including 12 pages of main text

arXiv:2404.09674 [pdf, ps, other]

A Circus of Circuits: Connections Between Decision Diagrams, Circuits, and Automata

Authors: Antoine Amarilli, Marcelo Arenas, YooJung Choi, Mikaël Monet, Guy Van den Broeck, Benjie Wang

Abstract: This document is an introduction to two related formalisms to define Boolean functions: binary decision diagrams, and Boolean circuits. It presents these formalisms and several of their variants studied in the setting of knowledge compilation. Last, it explains how these formalisms can be connected to the notions of automata over words and trees. This document is an introduction to two related formalisms to define Boolean functions: binary decision diagrams, and Boolean circuits. It presents these formalisms and several of their variants studied in the setting of knowledge compilation. Last, it explains how these formalisms can be connected to the notions of automata over words and trees. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 26 pages

arXiv:2401.16210 [pdf, ps, other]

The Non-Cancelling Intersections Conjecture

Authors: Antoine Amarilli, Mikaël Monet, Dan Suciu

Abstract: In this note, we present a conjecture on intersections of set families, and a rephrasing of the conjecture in terms of principal downsets of Boolean lattices. The conjecture informally states that, whenever we can express the measure of a union of sets in terms of the measure of some of their intersections using the inclusion-exclusion formula, then we can express the union as a set from these sam… ▽ More In this note, we present a conjecture on intersections of set families, and a rephrasing of the conjecture in terms of principal downsets of Boolean lattices. The conjecture informally states that, whenever we can express the measure of a union of sets in terms of the measure of some of their intersections using the inclusion-exclusion formula, then we can express the union as a set from these same intersections via the set operations of disjoint union and subset complement. We also present a partial result towards establishing the conjecture. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 30 pages

arXiv:2310.00731 [pdf, other]

Ranked Enumeration for MSO on Trees via Knowledge Compilation

Authors: Antoine Amarilli, Pierre Bourhis, Florent Capelli, Mikaël Monet

Abstract: We study the problem of enumerating the satisfying assignments for circuit classes from knowledge compilation, where assignments are ranked in a specific order. In particular, we show how this problem can be used to efficiently perform ranked enumeration of the answers to MSO queries over trees, with the order being given by a ranking function satisfying a subset-monotonicity property. Assuming… ▽ More We study the problem of enumerating the satisfying assignments for circuit classes from knowledge compilation, where assignments are ranked in a specific order. In particular, we show how this problem can be used to efficiently perform ranked enumeration of the answers to MSO queries over trees, with the order being given by a ranking function satisfying a subset-monotonicity property. Assuming that the number of variables is constant, we show that we can enumerate the satisfying assignments in ranked order for so-called multivalued circuits that are smooth, decomposable, and in negation normal form (smooth multivalued DNNF). There is no preprocessing and the enumeration delay is linear in the size of the circuit times the number of values, plus a logarithmic term in the number of assignments produced so far. If we further assume that the circuit is deterministic (smooth multivalued d-DNNF), we can achieve linear-time preprocessing in the circuit, and the delay only features the logarithmic term. △ Less

Submitted 22 January, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

Comments: 26 pages; this is the authors version of the corresponding ICDT'24 article

arXiv:2309.13287 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2024.15

Conjunctive Queries on Probabilistic Graphs: The Limits of Approximability

Authors: Antoine Amarilli, Timothy van Bremen, Kuldeep S. Meel

Abstract: Query evaluation over probabilistic databases is a notoriously intractable problem -- not only in combined complexity, but for many natural queries in data complexity as well. This motivates the study of probabilistic query evaluation through the lens of approximation algorithms, and particularly of combined FPRASes, whose runtime is polynomial in both the query and instance size. In this paper, w… ▽ More Query evaluation over probabilistic databases is a notoriously intractable problem -- not only in combined complexity, but for many natural queries in data complexity as well. This motivates the study of probabilistic query evaluation through the lens of approximation algorithms, and particularly of combined FPRASes, whose runtime is polynomial in both the query and instance size. In this paper, we focus on tuple-independent probabilistic databases over binary signatures, which can be equivalently viewed as probabilistic graphs. We study in which cases we can devise combined FPRASes for probabilistic query evaluation in this setting. We settle the complexity of this problem for a variety of query and instance classes, by proving both approximability and (conditional) inapproximability results. This allows us to deduce many corollaries of possible independent interest. For example, we show how the results of Arenas et al. on counting fixed-length strings accepted by an NFA imply the existence of an FPRAS for the two-terminal network reliability problem on directed acyclic graphs: this was an open problem until now. We also show that one cannot extend a recent result of van Bremen and Meel that gives a combined FPRAS for self-join-free conjunctive queries of bounded hypertree width on probabilistic databases: neither the bounded-hypertree-width condition nor the self-join-freeness hypothesis can be relaxed. Finally, we complement all our inapproximability results with unconditional lower bounds, showing that DNNF provenance circuits must have at least moderately exponential size in combined complexity. △ Less

Submitted 9 April, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: 20 pages. This article is identical to the ICDT'24 publication up to minor changes (including the correction of a mistake in the proof of Proposition 4.1)

arXiv:2304.06155 [pdf, other]

Skyline Operators for Document Spanners

Authors: Antoine Amarilli, Benny Kimelfeld, Sébastien Labbé, Stefan Mengel

Abstract: When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger inte… ▽ More When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable. △ Less

Submitted 4 March, 2024; v1 submitted 12 April, 2023; originally announced April 2023.

Comments: 42 pages. This is the full version of the ICDT'24 publication, which includes all reviewer feedback; the main body is identical to the ICDT'24 article up to minor changes

arXiv:2302.03461 [pdf, other]

Degree-3 Planar Graphs as Topological Minors of Wall Graphs in Polynomial Time

Authors: Antoine Amarilli

Abstract: In this note, we give a proof of the fact that we can efficiently find degree-3 planar graphs as topological minors of sufficiently large wall graphs. The result is needed as an intermediate step to fix a proof in my PhD thesis. In this note, we give a proof of the fact that we can efficiently find degree-3 planar graphs as topological minors of sufficiently large wall graphs. The result is needed as an intermediate step to fix a proof in my PhD thesis. △ Less

Submitted 22 February, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

Comments: V2: Updated to fix an error in the proof pointed out by Mikaël Monet. V3: Updated to point out alternative and simpler proof route following https://cstheory.stackexchange.com/a/52489

arXiv:2212.11362 [pdf, ps, other]

Tighter bounds for query answering with Guarded TGDs

Authors: Antoine Amarilli, Michael Benedikt

Abstract: We consider the complexity of the open-world query answering problem, where we wish to determine certain answers to conjunctive queries over incomplete datasets specified by an initial set of facts and a set of Guarded TGDs. This problem has been well-studied in the literature and is decidable but with a high complexity, namely, it is 2EXPTIME complete. Further, the complexity shrinks by one expon… ▽ More We consider the complexity of the open-world query answering problem, where we wish to determine certain answers to conjunctive queries over incomplete datasets specified by an initial set of facts and a set of Guarded TGDs. This problem has been well-studied in the literature and is decidable but with a high complexity, namely, it is 2EXPTIME complete. Further, the complexity shrinks by one exponential when the arity is fixed. We show in this paper how we can obtain better complexity bounds when considering separately the arity of the guard atom and that of the additional atoms, called the side signature. Our results make use of the technique of linearizing Guarded TGDs, introduced in a paper of Gottlog, Manna, and Pieris. Specifically, we present a variant of the linearization process, making use of a restricted version of the chase that we recently introduced. Our results imply that open-world query answering can be solved in EXPTIME with arbitrary-arity guard relations if we simply bound the arity of the side signature; and that the complexity drops to NP if we fix the side signature and bound the head arity and width of the dependencies. △ Less

Submitted 8 January, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: arXiv admin note: text overlap with arXiv:1706.07936

arXiv:2209.14878 [pdf, other]

Enumerating Regular Languages with Bounded Delay

Authors: Antoine Amarilli, Mikaël Monet

Abstract: We study the task, for a given language $L$, of enumerating the (generally infinite) sequence of its words, without repetitions, while bounding the delay between two consecutive words. To allow for delay bounds that do not depend on the current word length, we assume a model where we produce each word by editing the preceding word with a small edit script, rather than writing out the word from scr… ▽ More We study the task, for a given language $L$, of enumerating the (generally infinite) sequence of its words, without repetitions, while bounding the delay between two consecutive words. To allow for delay bounds that do not depend on the current word length, we assume a model where we produce each word by editing the preceding word with a small edit script, rather than writing out the word from scratch. In particular, this witnesses that the language is orderable, i.e., we can write its words as an infinite sequence such that the Levenshtein edit distance between any two consecutive words is bounded by a value that depends only on the language. For instance, $(a+b)^*$ is orderable (with a variant of the Gray code), but $a^* + b^*$ is not. We characterize which regular languages are enumerable in this sense, and show that this can be decided in PTIME in an input deterministic finite automaton (DFA) for the language. In fact, we show that, given a DFA $A$, we can compute in PTIME automata $A_1, \ldots, A_t$ such that $L(A)$ is partitioned as $L(A_1) \sqcup \ldots \sqcup L(A_t)$ and every $L(A_i)$ is orderable in this sense. Further, we show that the value of $t$ obtained is optimal, i.e., we cannot partition $L(A)$ into less than $t$ orderable languages. In the case where $L(A)$ is orderable (i.e., $t=1$), we show that the ordering can be produced by a bounded-delay algorithm: specifically, the algorithm runs in a suitable pointer machine model, and produces a sequence of bounded-length edit scripts to visit the words of $L(A)$ without repetitions, with bounded delay -- exponential in $|A|$ -- between each script. In fact, we show that we can achieve this while only allowing the edit operations push and pop at the beginning and end of the word, which implies that the word can in fact be maintained in a double-ended queue. △ Less

Submitted 7 January, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: This is the full versions with proofs of the STACS'23 article

arXiv:2209.11177 [pdf, other]

Uniform Reliability for Unbounded Homomorphism-Closed Graph Queries

Authors: Antoine Amarilli

Abstract: We study the uniform query reliability problem, which asks, for a fixed Boolean query Q, given an instance I, how many subinstances of I satisfy Q. Equivalently, this is a restricted case of Boolean query evaluation on tuple-independent probabilistic databases where all facts must have probability 1/2. We focus on graph signatures, and on queries closed under homomorphisms. We show that for any su… ▽ More We study the uniform query reliability problem, which asks, for a fixed Boolean query Q, given an instance I, how many subinstances of I satisfy Q. Equivalently, this is a restricted case of Boolean query evaluation on tuple-independent probabilistic databases where all facts must have probability 1/2. We focus on graph signatures, and on queries closed under homomorphisms. We show that for any such query that is unbounded, i.e., not equivalent to a union of conjunctive queries, the uniform reliability problem is #P-hard. This recaptures the hardness, e.g., of s-t connectedness, which counts how many subgraphs of an input graph have a path between a source and a sink. This new hardness result on uniform reliability strengthens our earlier hardness result on probabilistic query evaluation for unbounded homomorphism-closed queries (ICDT'20). Indeed, our earlier proof crucially used facts with probability 1, so it did not apply to the unweighted case. The new proof presented in this paper avoids this; it uses our recent hardness result on uniform reliability for non-hierarchical conjunctive queries without self-joins (ICDT'21), along with new techniques. △ Less

Submitted 17 January, 2023; v1 submitted 22 September, 2022; originally announced September 2022.

Comments: Full version with proofs of the ICDT'23 article

arXiv:2205.04224 [pdf, ps, other]

Worst-case Analysis for Interactive Evaluation of Boolean Provenance

Authors: Antoine Amarilli, Yael Amsterdamer

Abstract: In recent work, we have introduced a framework for fine-grained consent management in databases, which combines Boolean data provenance with the field of interactive Boolean evaluation. In turn, interactive Boolean evaluation aims at unveiling the underlying truth value of a Boolean expression by frugally probing the truth values of individual values. The required number of probes depends on the B… ▽ More In recent work, we have introduced a framework for fine-grained consent management in databases, which combines Boolean data provenance with the field of interactive Boolean evaluation. In turn, interactive Boolean evaluation aims at unveiling the underlying truth value of a Boolean expression by frugally probing the truth values of individual values. The required number of probes depends on the Boolean provenance structure and on the (a-priori unknown) probe answers. Prior work has analyzed and aimed to optimize the expected number of probes, where expectancy is with respect to a probability distribution over probe answers. This paper gives a novel worst-case analysis for the problem, inspired by the decision tree depth of Boolean functions. Specifically, we introduce a notion of evasive provenance expressions, namely expressions, where one may need to probe all variables in the worst case. We show that read-once expressions are evasive, and identify an additional class of expressions (acyclic monotone 2-DNF) for which evasiveness may be decided in PTIME. As for the more general question of finding the optimal strategy, we show that it is coNP-hard in general. We are still able to identify a sub-class of provenance expressions that is "far from evasive", namely, where an optimal worst-case strategy probes only log(n) out of the n variables in the expression, and show that we can find this optimal strategy in polynomial time. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2205.00851 [pdf, other]

Weighted Counting of Matchings in Unbounded-Treewidth Graph Families

Authors: Antoine Amarilli, Mikaël Monet

Abstract: We consider a weighted counting problem on matchings, denoted $\textrm{PrMatching}(\mathcal{G})$, on an arbitrary fixed graph family $\mathcal{G}$. The input consists of a graph $G\in \mathcal{G}$ and of rational probabilities of existence on every edge of $G$, assuming independence. The output is the probability of obtaining a matching of $G$ in the resulting distribution, i.e., a set of edges th… ▽ More We consider a weighted counting problem on matchings, denoted $\textrm{PrMatching}(\mathcal{G})$, on an arbitrary fixed graph family $\mathcal{G}$. The input consists of a graph $G\in \mathcal{G}$ and of rational probabilities of existence on every edge of $G$, assuming independence. The output is the probability of obtaining a matching of $G$ in the resulting distribution, i.e., a set of edges that are pairwise disjoint. It is known that, if $\mathcal{G}$ has bounded treewidth, then $\textrm{PrMatching}(\mathcal{G})$ can be solved in polynomial time. In this paper we show that, under some assumptions, bounded treewidth in fact characterizes the tractable graph families for this problem. More precisely, we show intractability for all graph families $\mathcal{G}$ satisfying the following treewidth-constructibility requirement: given an integer $k$ in unary, we can construct in polynomial time a graph $G \in \mathcal{G}$ with treewidth at least $k$. Our hardness result is then the following: for any treewidth-constructible graph family $\mathcal{G}$, the problem $\textrm{PrMatching}(\mathcal{G})$ is intractable. This generalizes known hardness results for weighted matching counting under some restrictions that do not bound treewidth, e.g., being planar, 3-regular, or bipartite; it also answers a question left open in Amarilli, Bourhis and Senellart (PODS'16). We also obtain a similar lower bound for the weighted counting of edge covers. △ Less

Submitted 7 January, 2023; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: This is the full version with proofs of the MFCS'22 article

arXiv:2202.08555 [pdf, ps, other]

Query Answering with Transitive and Linear-Ordered Data

Authors: Antoine Amarilli, Michael Benedikt, Pierre Bourhis, Michael Vanden Boom

Abstract: We consider entailment problems involving powerful constraint languages such as frontier-guarded existential rules in which we impose additional semantic restrictions on a set of distinguished relations. We consider restricting a relation to be transitive, restricting a relation to be the transitive closure of another relation, and restricting a relation to be a linear order. We give some natural… ▽ More We consider entailment problems involving powerful constraint languages such as frontier-guarded existential rules in which we impose additional semantic restrictions on a set of distinguished relations. We consider restricting a relation to be transitive, restricting a relation to be the transitive closure of another relation, and restricting a relation to be a linear order. We give some natural variants of guardedness that allow inference to be decidable in each case, and isolate the complexity of the corresponding decision problems. Finally we show that slight changes in these conditions lead to undecidability. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: This article was originally published at JAIR in 2018: https://www.jair.org/index.php/jair/article/view/11240 (DOI 10.1613/jair.1.11240). This version of the paper includes one modification from the publisher version: we fix an incorrect proof for one of our undecidability results (Theorem 6.2). arXiv admin note: substantial text overlap with arXiv:1607.00813

arXiv:2201.00549 [pdf, ps, other]

doi 10.1145/3517804.3526232

Efficient Enumeration Algorithms for Annotated Grammars

Authors: Antoine Amarilli, Louis Jachiet, Martín Muñoz, Cristian Riveros

Abstract: We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string… ▽ More We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by Peterfreund. We study the enumeration problem for annotated grammars: fixing a grammar, and given a string as input, enumerate all annotations of the string that form a word derivable from the grammar. Our first result is an algorithm for unambiguous annotated grammars, which preprocesses the input string in cubic time and enumerates all annotations with output-linear delay. This improves over Peterfreund's result, which needs quintic time preprocessing to achieve this delay bound. We then study how we can reduce the preprocessing time while keeping the same delay bound, by making additional assumptions on the grammar. Specifically, we present a class of grammars which only have one derivation shape for all outputs, for which we can enumerate with quadratic time preprocessing. We also give classes that generalize regular spanners for which linear time preprocessing suffices. △ Less

Submitted 17 May, 2022; v1 submitted 3 January, 2022; originally announced January 2022.

Comments: 54 pages. Full version with proofs of the article to appear at PODS'22. Except formatting and minor differences, this article contains all the contents of the PODS'22 article, plus the technical appendices

arXiv:2102.07728 [pdf, other]

Dynamic Membership for Regular Languages

Authors: Antoine Amarilli, Louis Jachiet, Charles Paperman

Abstract: We study the dynamic membership problem for regular languages: fix a language L, read a word w, build in time O(|w|) a data structure indicating if w is in L, and maintain this structure efficiently under letter substitutions on w. We consider this problem on the unit cost RAM model with logarithmic word length, where the problem always has a solution in O(log |w| / log log |w|) per operation. W… ▽ More We study the dynamic membership problem for regular languages: fix a language L, read a word w, build in time O(|w|) a data structure indicating if w is in L, and maintain this structure efficiently under letter substitutions on w. We consider this problem on the unit cost RAM model with logarithmic word length, where the problem always has a solution in O(log |w| / log log |w|) per operation. We show that the problem is in O(log log |w|) for languages in an algebraically-defined, decidable class QSG, and that it is in O(1) for another such class QLZG. We show that languages not in QSG admit a reduction from the prefix problem for a cyclic group, so that they require Ω(log |w| / log log |w|) operations in the worst case; and that QSG languages not in QLZG admit a reduction from the prefix problem for the multiplicative monoid U 1 = {0, 1}, which we conjecture cannot be maintained in O(1). This yields a conditional trichotomy. We also investigate intermediate cases between O(1) and O(log log |w|). Our results are shown via the dynamic word problem for monoids and semigroups, for which we also give a classification. We thus solve open problems of the paper of Skovbjerg Frandsen, Miltersen, and Skyum [30] on the dynamic word problem, and additionally cover regular languages. △ Less

Submitted 4 June, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 34 pages. This is the full version with proofs of the ICALP'21 article

arXiv:2102.07724 [pdf, other]

doi 10.46298/lmcs-19(4:4)2023

Locality and Centrality: The Variety ZG

Authors: Antoine Amarilli, Charles Paperman

Abstract: We study the variety ZG of monoids where the elements that belong to a group are central, i.e., commute with all other elements. We show that ZG is local, that is, the semidirect product ZG * D of ZG by definite semigroups is equal to LZG, the variety of semigroups where all local monoids are in ZG. Our main result is thus: ZG * D = LZG. We prove this result using Straubing's delay theorem, by con… ▽ More We study the variety ZG of monoids where the elements that belong to a group are central, i.e., commute with all other elements. We show that ZG is local, that is, the semidirect product ZG * D of ZG by definite semigroups is equal to LZG, the variety of semigroups where all local monoids are in ZG. Our main result is thus: ZG * D = LZG. We prove this result using Straubing's delay theorem, by considering paths in the category of idempotents. In the process, we obtain the characterization ZG = MNil \vee Com, and also characterize the ZG languages, i.e., the languages whose syntactic monoid is in ZG: they are precisely the languages that are finite unions of disjoint shuffles of singleton languages and regular commutative languages. △ Less

Submitted 17 October, 2023; v1 submitted 15 February, 2021; originally announced February 2021.

Journal ref: Logical Methods in Computer Science, Volume 19, Issue 4 (October 18, 2023) lmcs:11555

arXiv:2003.07316 [pdf, other]

Equivalent Rewritings on Path Views with Binding Patterns

Authors: Julien Romero, Nicoleta Preda, Antoine Amarilli, Fabian Suchanek

Abstract: A view with a binding pattern is a parameterized query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, the views have to be orchestrated together in execution plans. We show how queries can be rewritten into equivalent execution plans, which are guaranteed to deliver the same results as the query on all databases. We provide a correct and complete… ▽ More A view with a binding pattern is a parameterized query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, the views have to be orchestrated together in execution plans. We show how queries can be rewritten into equivalent execution plans, which are guaranteed to deliver the same results as the query on all databases. We provide a correct and complete algorithm to find these plans for path views and atomic queries. Finally, we show that our method can be used to answer queries on real-world Web services. △ Less

Submitted 19 March, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: 33 pages including 16 pages of main text. This is the full version of the ESWC'2020 article, which integrates all reviewer feedback, with the same text as the publisher version except minor changes. Several corrections relative to the first version

arXiv:2003.02576 [pdf, ps, other]

doi 10.1145/3436487

Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors: Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth

Abstract: We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the resu… ▽ More We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation. △ Less

Submitted 7 December, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: 29 pages. Extended version of arXiv:1807.09320. Integrates all corrections following reviewer feedback. Outside of some minor formatting differences and tweaks, this paper is the same as the paper to appear in the ACM TODS journal

arXiv:2003.02521 [pdf, ps, other]

Finite Open-World Query Answering with Number Restrictions

Authors: Antoine Amarilli, Michael Benedikt

Abstract: Open-world query answering is the problem of deciding, given a set of facts, conjunction of constraints, and query, whether the facts and constraints imply the query. This amounts to reasoning over all instances that include the facts and satisfy the constraints. We study finite open-world query answering (FQA), which assumes that the underlying world is finite and thus only considers the finite c… ▽ More Open-world query answering is the problem of deciding, given a set of facts, conjunction of constraints, and query, whether the facts and constraints imply the query. This amounts to reasoning over all instances that include the facts and satisfy the constraints. We study finite open-world query answering (FQA), which assumes that the underlying world is finite and thus only considers the finite completions of the instance. The major known decidable cases of FQA derive from the following: the guarded fragment of first-order logic, which can express referential constraints (data in one place points to data in another) but cannot express number restrictions such as functional dependencies; and the guarded fragment with number restrictions but on a signature of arity only two. In this paper, we give the first decidability results for FQA that combine both referential constraints and number restrictions for arbitrary signatures: we show that, for unary inclusion dependencies and functional dependencies, the finiteness assumption of FQA can be lifted up to taking the finite implication closure of the dependencies. Our result relies on new techniques to construct finite universal models of such constraints, for any bound on the maximal query size. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: 70 pages. Extended journal version of arXiv:1505.04216. This article is the same as what will be published in ToCL, except for publisher-induced changes, minor changes, and reordering of the material (in the ToCL version some detailed proofs are moved from the article body to an appendix)

arXiv:1910.02048 [pdf, other]

doi 10.46298/lmcs-18(1:2)2022

The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs

Authors: Antoine Amarilli, İsmail İlkan Ceylan

Abstract: We study the problem of query evaluation on probabilistic graphs, namely, tuple-independent probabilistic databases over signatures of arity two. We focus on the class of queries closed under homomorphisms, or, equivalently, the infinite unions of conjunctive queries. Our main result states that the probabilistic query evaluation problem is #P-hard for all unbounded queries from this class. As bou… ▽ More We study the problem of query evaluation on probabilistic graphs, namely, tuple-independent probabilistic databases over signatures of arity two. We focus on the class of queries closed under homomorphisms, or, equivalently, the infinite unions of conjunctive queries. Our main result states that the probabilistic query evaluation problem is #P-hard for all unbounded queries from this class. As bounded queries from this class are equivalent to a union of conjunctive queries, they are already classified by the dichotomy of Dalvi and Suciu (2012). Hence, our result and theirs imply a complete data complexity dichotomy, between polynomial time and #P-hardness, on evaluating homomorphism-closed queries over probabilistic graphs. This dichotomy covers in particular all fragments of infinite unions of conjunctive queries over arity-two signatures, such as negation-free (disjunctive) Datalog, regular path queries, and a large class of ontology-mediated queries. The dichotomy also applies to a restricted case of probabilistic query evaluation called generalized model counting, where fact probabilities must be 0, 0.5, or 1. We show the main result by reducing from the problem of counting the valuations of positive partitioned 2-DNF formulae, or from the source-to-target reliability problem in an undirected graph, depending on properties of minimal models for the query. △ Less

Submitted 6 January, 2022; v1 submitted 4 October, 2019; originally announced October 2019.

Journal ref: Logical Methods in Computer Science, Volume 18, Issue 1 (January 7, 2022) lmcs:7065

arXiv:1908.07093 [pdf, other]

doi 10.46298/lmcs-18(4:3)2022

Uniform Reliability of Self-Join-Free Conjunctive Queries

Authors: Antoine Amarilli, Benny Kimelfeld

Abstract: The reliability of a Boolean Conjunctive Query (CQ) over a tuple-independent probabilistic database is the probability that the CQ is satisfied when the tuples of the database are sampled one by one, independently, with their associated probability. For queries without self-joins (repeated relation symbols), the data complexity of this problem is fully characterized by a known dichotomy: reliabili… ▽ More The reliability of a Boolean Conjunctive Query (CQ) over a tuple-independent probabilistic database is the probability that the CQ is satisfied when the tuples of the database are sampled one by one, independently, with their associated probability. For queries without self-joins (repeated relation symbols), the data complexity of this problem is fully characterized by a known dichotomy: reliability can be computed in polynomial time for hierarchical queries, and is #P-hard for non-hierarchical queries. Inspired by this dichotomy, we investigate a fundamental counting problem for CQs without self-joins: how many sets of facts from the input database satisfy the query? This is equivalent to the uniform case of the query reliability problem, where the probability of every tuple is required to be 1/2. Of course, for hierarchical queries, uniform reliability is solvable in polynomial time, like the reliability problem. We show that being hierarchical is also necessary for this tractability (under conventional complexity assumptions). In fact, we establish a generalization of the dichotomy that covers every restricted case of reliability in which the probabilities of tuples are determined by their relation. △ Less

Submitted 8 November, 2022; v1 submitted 19 August, 2019; originally announced August 2019.

Comments: Extended version of the ICDT'21 paper

Journal ref: Logical Methods in Computer Science, Volume 18, Issue 4 (November 9, 2022) lmcs:10088

arXiv:1906.00311 [pdf, other]

Smoothing Structured Decomposable Circuits

Authors: Andy Shih, Guy Van den Broeck, Paul Beame, Antoine Amarilli

Abstract: We study the task of smoothing a circuit, i.e., ensuring that all children of a plus-gate mention the same variables. Circuits serve as the building blocks of state-of-the-art inference algorithms on discrete probabilistic graphical models and probabilistic programs. They are also important for discrete density estimation algorithms. Many of these tasks require the input circuit to be smooth. Howe… ▽ More We study the task of smoothing a circuit, i.e., ensuring that all children of a plus-gate mention the same variables. Circuits serve as the building blocks of state-of-the-art inference algorithms on discrete probabilistic graphical models and probabilistic programs. They are also important for discrete density estimation algorithms. Many of these tasks require the input circuit to be smooth. However, smoothing has not been studied in its own right yet, and only a trivial quadratic algorithm is known. This paper studies efficient smoothing for structured decomposable circuits. We propose a near-linear time algorithm for this task and explore lower bounds for smoothing decomposable circuits, using existing results on range-sum queries. Further, for the important case of All-Marginals, we show a more efficient linear-time algorithm. We validate experimentally the performance of our methods. △ Less

Submitted 28 October, 2019; v1 submitted 1 June, 2019; originally announced June 2019.

Journal ref: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada

arXiv:1812.09519 [pdf, ps, other]

doi 10.1145/3294052.3319702

Enumeration on Trees with Tractable Combined Complexity and Efficient Updates

Authors: Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth

Abstract: We give an algorithm to enumerate the results on trees of monadic second-order (MSO) queries represented by nondeterministic tree automata. After linear time preprocessing (in the input tree), we can enumerate answers with linear delay (in each answer). We allow updates on the tree to take place at any time, and we can then restart the enumeration after logarithmic time in the tree. Further, all o… ▽ More We give an algorithm to enumerate the results on trees of monadic second-order (MSO) queries represented by nondeterministic tree automata. After linear time preprocessing (in the input tree), we can enumerate answers with linear delay (in each answer). We allow updates on the tree to take place at any time, and we can then restart the enumeration after logarithmic time in the tree. Further, all our combined complexities are polynomial in the automaton. Our result follows our previous circuit-based enumeration algorithms based on deterministic tree automata, and is also inspired by our earlier result on words and nondeterministic sequential extended variable-set automata in the context of document spanners. We extend these results and combine them with a recent tree balancing scheme by Niewerth, so that our enumeration structure supports updates to the underlying tree in logarithmic time (with leaf insertions, leaf deletions, and node relabelings). Our result implies that, for MSO queries with free first-order variables, we can enumerate the results with linear preprocessing and constant-delay and update the underlying tree in logarithmic time, which improves on several known results for words and trees. Building on lower bounds from data structure research, we also show unconditionally that up to a doubly logarithmic factor the update time of our algorithm is optimal. Thus, unlike other settings, there can be no algorithm with constant update time. △ Less

Submitted 27 August, 2019; v1 submitted 22 December, 2018; originally announced December 2018.

Comments: 16 pages of main material, 37 references, 11 pages of appendix. This is the extended version with proofs of the PODS'19 paper. Except for minor rephrasings and formatting differences, the contents are exactly the same as the version published in the PODS'19 proceedings

arXiv:1811.02944 [pdf, ps, other]

doi 10.1007/s00224-019-09930-2

Connecting Knowledge Compilation Classes and Width Parameters

Authors: Antoine Amarilli, Florent Capelli, Mikaël Monet, Pierre Senellart

Abstract: The field of knowledge compilation establishes the tractability of many tasks by studying how to compile them to Boolean circuit classes obeying some requirements such as structuredness, decomposability, and determinism. However, in other settings such as intensional query evaluation on databases, we obtain Boolean circuits that satisfy some width bounds, e.g., they have bounded treewidth or pathw… ▽ More The field of knowledge compilation establishes the tractability of many tasks by studying how to compile them to Boolean circuit classes obeying some requirements such as structuredness, decomposability, and determinism. However, in other settings such as intensional query evaluation on databases, we obtain Boolean circuits that satisfy some width bounds, e.g., they have bounded treewidth or pathwidth. In this work, we give a systematic picture of many circuit classes considered in knowledge compilation and show how they can be systematically connected to width measures, through upper and lower bounds. Our upper bounds show that bounded-treewidth circuits can be constructively converted to d-SDNNFs, in time linear in the circuit size and singly exponential in the treewidth; and that bounded-pathwidth circuits can similarly be converted to uOBDDs. We show matching lower bounds on the compilation of monotone DNF or CNF formulas to structured targets, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable): any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth, and the same holds for uOBDDs (resp., n-OBDDs) when considering pathwidth. Unlike most previous work, our bounds apply to any formula of this class, not just a well-chosen family. Hence, we show that pathwidth and treewidth respectively characterize the efficiency of compiling monotone DNFs to uOBDDs and d-SDNNFs with compilation being singly exponential in the corresponding width parameter. We also show that our lower bounds on CNFs extend to unstructured compilation targets, with an exponential lower bound in the treewidth (resp., pathwidth) when compiling monotone CNFs of constant arity and degree to DNNFs (resp., nFBDDs). △ Less

Submitted 20 July, 2019; v1 submitted 7 November, 2018; originally announced November 2018.

Comments: 46 pages. Extended version of arXiv:1709.06188. Up to the stylesheet, page/environment numbering, minor formatting, and publisher-induced changes, this is the exact content of the paper in Theory of Computing Systems <https://link.springer.com/article/10.1007%2Fs00224-019-09930-2>. The difference in the titles (missing "and") is an error introduced by the publisher

ACM Class: H.2

arXiv:1810.07822 [pdf, other]

doi 10.46298/lmcs-18(2:14)2022

When Can We Answer Queries Using Result-Bounded Data Interfaces?

Authors: Antoine Amarilli, Michael Benedikt

Abstract: We consider answering queries on data available through access methods, that provide lookup access to the tuples matching a given binding. Such interfaces are common on the Web; further, they often have bounds on how many results they can return, e.g., because of pagination or rate limits. We thus study result-bounded methods, which may return only a limited number of tuples. We study how to decid… ▽ More We consider answering queries on data available through access methods, that provide lookup access to the tuples matching a given binding. Such interfaces are common on the Web; further, they often have bounds on how many results they can return, e.g., because of pagination or rate limits. We thus study result-bounded methods, which may return only a limited number of tuples. We study how to decide if a query is answerable using result-bounded methods, i.e., how to compute a plan that returns all answers to the query using the methods, assuming that the underlying data satisfies some integrity constraints. We first show how to reduce answerability to a query containment problem with constraints. Second, we show "schema simplification" theorems describing when and how result-bounded services can be used. Finally, we use these theorems to give decidability and complexity results about answerability for common constraint classes. △ Less

Submitted 1 June, 2022; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: journal version of the PODS'18 paper arXiv:1706.07936

Journal ref: Logical Methods in Computer Science, Volume 18, Issue 2 (June 2, 2022) lmcs:4903

arXiv:1808.04663 [pdf, other]

doi 10.1007/s00224-018-9901-2

Evaluating Datalog via Tree Automata and Cycluits

Authors: Antoine Amarilli, Pierre Bourhis, Mikaël Monet, Pierre Senellart

Abstract: We investigate parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity. We show that clique-frontier-guarded Datalog with stratified negation (CFG-Datalog) enjoys bilinear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Such programs capture in particular conjunctive queries with simp… ▽ More We investigate parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity. We show that clique-frontier-guarded Datalog with stratified negation (CFG-Datalog) enjoys bilinear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Such programs capture in particular conjunctive queries with simplicial decompositions of bounded width, guarded negation fragment queries of bounded CQ-rank, or two-way regular path queries. Our result is shown by translating to alternating two-way automata, whose semantics is defined via cyclic provenance circuits (cycluits) that can be tractably evaluated. △ Less

Submitted 29 May, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

Comments: 56 pages, 63 references. Journal version of "Combined Tractability of Query Evaluation via Tree Automata and Cycluits (Extended Version)" at arXiv:1612.04203. Up to the stylesheet, page/environment numbering, and possible minor publisher-induced changes, this is the exact content of the journal paper that will appear in Theory of Computing Systems. Update wrt version 1: latest reviewer feedback

arXiv:1807.09320 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2019.19

Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors: Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth

Abstract: We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the resu… ▽ More We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. △ Less

Submitted 7 December, 2020; v1 submitted 24 July, 2018; originally announced July 2018.

Comments: 25 pages including 17 pages of main material. Integrates all reviewer feedback. T paper is exactly the same as the ICDT'19 paper except that it contains 6 pages of technical appendix, and except that we corrected some additional minor mistakes following reviews of the journal version (arXiv:2003.02576). We recommend reading the journal version instead of this paper

arXiv:1801.06396 [pdf, ps, other]

Computing Possible and Certain Answers over Order-Incomplete Data

Authors: Antoine Amarilli, Mouhamadou Lamine Ba, Daniel Deutch, Pierre Senellart

Abstract: This paper studies the complexity of query evaluation for databases whose relations are partially ordered; the problem commonly arises when combining or transforming ordered data from multiple sources. We focus on queries in a useful fragment of SQL, namely positive relational algebra with aggregates, whose bag semantics we extend to the partially ordered setting. Our semantics leads to the study… ▽ More This paper studies the complexity of query evaluation for databases whose relations are partially ordered; the problem commonly arises when combining or transforming ordered data from multiple sources. We focus on queries in a useful fragment of SQL, namely positive relational algebra with aggregates, whose bag semantics we extend to the partially ordered setting. Our semantics leads to the study of two main computational problems: the possibility and certainty of query answers. We show that these problems are respectively NP-complete and coNP-complete, but identify tractable cases depending on the query operators or input partial orders. We further introduce a duplicate elimination operator and study its effect on the complexity results. △ Less

Submitted 29 May, 2019; v1 submitted 19 January, 2018; originally announced January 2018.

Comments: 55 pages, 56 references. Extended journal version of arXiv:1707.07222. Up to the stylesheet, page/environment numbering, and possible minor publisher-induced changes, this is the exact content of the journal paper that will appear in Theoretical Computer Science

arXiv:1709.06188 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2018.6

Connecting Width and Structure in Knowledge Compilation (Extended Version)

Authors: Antoine Amarilli, Mikaël Monet, Pierre Senellart

Abstract: Several query evaluation tasks can be done via knowledge compilation: the query result is compiled as a lineage circuit from which the answer can be determined. For such tasks, it is important to leverage some width parameters of the circuit, such as bounded treewidth or pathwidth, to convert the circuit to structured classes, e.g., deterministic structured NNFs (d-SDNNFs) or OBDDs. In this work,… ▽ More Several query evaluation tasks can be done via knowledge compilation: the query result is compiled as a lineage circuit from which the answer can be determined. For such tasks, it is important to leverage some width parameters of the circuit, such as bounded treewidth or pathwidth, to convert the circuit to structured classes, e.g., deterministic structured NNFs (d-SDNNFs) or OBDDs. In this work, we show how to connect the width of circuits to the size of their structured representation, through upper and lower bounds. For the upper bound, we show how bounded-treewidth circuits can be converted to a d-SDNNF, in time linear in the circuit size. Our bound, unlike existing results, is constructive and only singly exponential in the treewidth. We show a related lower bound on monotone DNF or CNF formulas, assuming a constant bound on the arity (size of clauses) and degree (number of occurrences of each variable). Specifically, any d-SDNNF (resp., SDNNF) for such a DNF (resp., CNF) must be of exponential size in its treewidth; and the same holds for pathwidth when compiling to OBDDs. Our lower bounds, in contrast with most previous work, apply to any formula of this class, not just a well-chosen family. Hence, for our language of DNF and CNF, pathwidth and treewidth respectively characterize the efficiency of compiling to OBDDs and (d-)SDNNFs, that is, compilation is singly exponential in the width parameter. We conclude by applying our lower bound results to the task of query evaluation. △ Less

Submitted 15 December, 2022; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: 33 pages, no figures, 40 references. This is the full version with proofs of the corresponding ICDT'18 publication, and it integrates all reviewer feedback. Except for the additional appendices, and except for formatting differences and inessential changes, the contents are the same as in the conference version. Fixed in version 4 a minor omission in the proof of Theorem 33 and small typos

arXiv:1709.06185 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2018.5

Enumeration on Trees under Relabelings

Authors: Antoine Amarilli, Pierre Bourhis, Stefan Mengel

Abstract: We study how to evaluate MSO queries with free variables on trees, within the framework of enumeration algorithms. Previous work has shown how to enumerate answers with linear-time preprocessing and delay linear in the size of each output, i.e., constant-delay for free first-order variables. We extend this result to support relabelings, a restricted kind of update operations on trees which allows… ▽ More We study how to evaluate MSO queries with free variables on trees, within the framework of enumeration algorithms. Previous work has shown how to enumerate answers with linear-time preprocessing and delay linear in the size of each output, i.e., constant-delay for free first-order variables. We extend this result to support relabelings, a restricted kind of update operations on trees which allows us to change the node labels. Our main result shows that we can enumerate the answers of MSO queries on trees with linear-time preprocessing and delay linear in each answer, while supporting node relabelings in logarithmic time. To prove this, we reuse the circuit-based enumeration structure from our earlier work, and develop techniques to maintain its index under node relabelings. We also show how enumeration under relabelings can be applied to evaluate practical query languages, such as aggregate, group-by, and parameterized queries. △ Less

Submitted 31 May, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: 37 pages including appendix, 31 references. This is the full version with proofs of the corresponding ICDT'18 publication, and it integrates all reviewer feedback. Except for the additional appendices, the contents are exactly the same as in the conference version

arXiv:1707.07222 [pdf, other]

doi 10.4230/LIPIcs.TIME.2017.4

Possible and Certain Answers for Queries over Order-Incomplete Data

Authors: Antoine Amarilli, Mouhamadou Lamine Ba, Daniel Deutch, Pierre Senellart

Abstract: To combine and query ordered data from multiple sources, one needs to handle uncertainty about the possible orderings. Examples of such "order-incomplete" data include integrated event sequences such as log entries, lists of properties (e.g., hotels and restaurants) ranked by an unknown function reflecting relevance or customer ratings, and documents edited concurrently with an uncertain order on… ▽ More To combine and query ordered data from multiple sources, one needs to handle uncertainty about the possible orderings. Examples of such "order-incomplete" data include integrated event sequences such as log entries, lists of properties (e.g., hotels and restaurants) ranked by an unknown function reflecting relevance or customer ratings, and documents edited concurrently with an uncertain order on edits. This paper introduces a query language for order-incomplete data, based on the positive relational algebra with order-aware accumulation. We use partial orders to represent order-incomplete data, and study possible and certain answers for queries in this context. We show that these problems are respectively NP-complete and coNP-complete, but identify many tractable cases depending on the query operators or input partial orders. △ Less

Submitted 26 January, 2018; v1 submitted 22 July, 2017; originally announced July 2017.

Comments: This paper is the full version with appendices of the TIME'17 article. See also the upcoming journal version: arXiv:1801.06396. Important note: This version (version 2) removes some results because we found a bug in their proofs. See Appendix G for detailed explanations. The journal version also omits the affected results (and does not contain Appendix G)

arXiv:1707.04310 [pdf, other]

doi 10.4230/LIPIcs.ICALP.2018.115

Topological Sorting under Regular Constraints

Authors: Antoine Amarilli, Charles Paperman

Abstract: We introduce the constrained topological sorting problem (CTS): given a regular language K and a directed acyclic graph G with labeled vertices, determine if G has a topological sort that forms a word in K. This natural problem applies to several settings, e.g., scheduling with costs or verifying concurrent programs. We consider the problem CTS[K] where the target language K is fixed, and study it… ▽ More We introduce the constrained topological sorting problem (CTS): given a regular language K and a directed acyclic graph G with labeled vertices, determine if G has a topological sort that forms a word in K. This natural problem applies to several settings, e.g., scheduling with costs or verifying concurrent programs. We consider the problem CTS[K] where the target language K is fixed, and study its complexity depending on K. We show that CTS[K] is tractable when K falls in several language families, e.g., unions of monomials, which can be used for pattern matching. However, we show that CTS[K] is NP-hard for K = (ab)^* and introduce a shuffle reduction technique to show hardness for more languages. We also study the special case of the constrained shuffle problem (CSh), where the input graph is a disjoint union of strings, and show that CSh[K] is additionally tractable when K is a group language or a union of district group monomials. We conjecture that a dichotomy should hold on the complexity of CTS[K] or CSh[K] depending on K, and substantiate this by proving a coarser dichotomy under a different problem phrasing which ensures that tractable languages are closed under common operators. △ Less

Submitted 30 April, 2018; v1 submitted 13 July, 2017; originally announced July 2017.

Comments: 45 pages, 31 references in the main text. This is the full version with proofs of the ICALP'18 paper, and is the same as the ICALP proceedings version up to minor publisher-dependent changes. Several important changes with respect to version 1, including fixing some errors. Title changed with respect to version 2

arXiv:1706.07936 [pdf, ps, other]

doi 10.1145/3196959.3196965

When Can We Answer Queries Using Result-Bounded Data Interfaces?

Authors: Antoine Amarilli, Michael Benedikt

Abstract: We consider answering queries where the underlying data is available only over limited interfaces which provide lookup access to the tuples matching a given binding, but possibly restricting the number of output tuples returned. Interfaces imposing such "result bounds" are common in accessing data via the web. Given a query over a set of relations as well as some integrity constraints that relate… ▽ More We consider answering queries where the underlying data is available only over limited interfaces which provide lookup access to the tuples matching a given binding, but possibly restricting the number of output tuples returned. Interfaces imposing such "result bounds" are common in accessing data via the web. Given a query over a set of relations as well as some integrity constraints that relate the queried relations to the data sources, we examine the problem of deciding if the query is answerable over the interfaces; that is, whether there exists a plan that returns all answers to the query, assuming the source data satisfies the integrity constraints. The first component of our analysis of answerability is a reduction to a query containment problem with constraints. The second component is a set of "schema simplification" theorems capturing limitations on how interfaces with result bounds can be useful to obtain complete answers to queries. These results also help to show decidability for the containment problem that captures answerability, for many classes of constraints. The final component in our analysis of answerability is a "linearization" method, showing that query containment with certain guarded dependencies -- including those that emerge from answerability problems -- can be reduced to query containment for a well-behaved class of linear dependencies. Putting these components together, we get a detailed picture of how to check answerability over result-bounded services. △ Less

Submitted 31 August, 2018; v1 submitted 24 June, 2017; originally announced June 2017.

Comments: 45 pages, 2 tables, 43 references. Complete version with proofs of the PODS'18 paper. The main text of this paper is almost identical to the PODS'18 except that we have fixed some small mistakes. Relative to the earlier arXiv version, many errors were corrected, and some terminology has changed

arXiv:1703.03201 [pdf, ps, other]

doi 10.1145/3034786.3056121

Conjunctive Queries on Probabilistic Graphs: Combined Complexity

Authors: Antoine Amarilli, Mikaël Monet, Pierre Senellart

Abstract: Query evaluation over probabilistic databases is known to be intractable in many cases, even in data complexity, i.e., when the query is fixed. Although some restrictions of the queries [19] and instances [4] have been proposed to lower the complexity, these known tractable cases usually do not apply to combined complexity, i.e., when the query is not fixed. This leaves open the question of which… ▽ More Query evaluation over probabilistic databases is known to be intractable in many cases, even in data complexity, i.e., when the query is fixed. Although some restrictions of the queries [19] and instances [4] have been proposed to lower the complexity, these known tractable cases usually do not apply to combined complexity, i.e., when the query is not fixed. This leaves open the question of which query and instance languages ensure the tractability of probabilistic query evaluation in combined complexity. This paper proposes the first general study of the combined complexity of conjunctive query evaluation on probabilistic instances over binary signatures, which we can alternatively phrase as a probabilistic version of the graph homomorphism problem, or of a constraint satisfaction problem (CSP) variant. We study the complexity of this problem depending on whether instances and queries can use features such as edge labels, disconnectedness, branching, and edges in both directions. We show that the complexity landscape is surprisingly rich, using a variety of technical tools: automata-based compilation to d-DNNF lineages as in [4], \b{eta}-acyclic lineages using [10], the X-property for tractable CSP from [24], graded DAGs [27] and various coding techniques for hardness proofs. △ Less

Submitted 27 August, 2019; v1 submitted 9 March, 2017; originally announced March 2017.

Comments: 36 pages including 4 appendix sections. This is the PODS'17 article with all proofs and all reviewer feedback. Relative to the previous version and to the PODS version, this version adds details about a subtle point in Appendix D, and fixes some minor formatting issues

arXiv:1702.05589 [pdf, other]

doi 10.4230/LIPIcs.ICALP.2017.111

A Circuit-Based Approach to Efficient Enumeration

Authors: Antoine Amarilli, Pierre Bourhis, Louis Jachiet, Stefan Mengel

Abstract: We study the problem of enumerating the satisfying valuations of a circuit while bounding the delay, i.e., the time needed to compute each successive valuation. We focus on the class of structured d-DNNF circuits originally introduced in knowledge compilation, a sub-area of artificial intelligence. We propose an algorithm for these circuits that enumerates valuations with linear preprocessing and… ▽ More We study the problem of enumerating the satisfying valuations of a circuit while bounding the delay, i.e., the time needed to compute each successive valuation. We focus on the class of structured d-DNNF circuits originally introduced in knowledge compilation, a sub-area of artificial intelligence. We propose an algorithm for these circuits that enumerates valuations with linear preprocessing and delay linear in the Hamming weight of each valuation. Moreover, valuations of constant Hamming weight can be enumerated with linear preprocessing and constant delay. Our results yield a framework for efficient enumeration that applies to all problems whose solutions can be compiled to structured d-DNNFs. In particular, we use it to recapture classical results in database theory, for factorized database representations and for MSO evaluation. This gives an independent proof of constant-delay enumeration for MSO formulae with first-order free variables on bounded-treewidth structures. △ Less

Submitted 5 May, 2017; v1 submitted 18 February, 2017; originally announced February 2017.

Comments: 45 pages, 1 figure, 36 references. Accepted at ICALP'17. This paper is the full version with appendices of the article in the ICALP proceedings. The main text of this full version is the same as the ICALP proceedings version, except some superficial changes (to fit the proceedings version to 12 pages, and to obey LIPIcs-specific formatting requirements)

arXiv:1701.02634 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2017.5

Top-k Querying of Unknown Values under Order Constraints (Extended Version)

Authors: Antoine Amarilli, Yael Amsterdamer, Tova Milo, Pierre Senellart

Abstract: Many practical scenarios make it necessary to evaluate top-k queries over data items with partially unknown values. This paper considers a setting where the values are taken from a numerical domain, and where some partial order constraints are given over known and unknown values: under these constraints, we assume that all possible worlds are equally likely. Our work is the first to propose a prin… ▽ More Many practical scenarios make it necessary to evaluate top-k queries over data items with partially unknown values. This paper considers a setting where the values are taken from a numerical domain, and where some partial order constraints are given over known and unknown values: under these constraints, we assume that all possible worlds are equally likely. Our work is the first to propose a principled scheme to derive the value distributions and expected values of unknown items in this setting, with the goal of computing estimated top-k results by interpolating the unknown values from the known ones. We study the complexity of this general task, and show tight complexity bounds, proving that the problem is intractable, but can be tractably approximated. We then consider the case of tree-shaped partial orders, where we show a constructive PTIME solution. We also compare our problem setting to other top-k definitions on uncertain data. △ Less

Submitted 10 January, 2017; originally announced January 2017.

Comments: 32 pages, 1 figure, 1 algorithm, 51 references. Extended version of paper at ICDT'17

arXiv:1612.05786 [pdf, other]

doi 10.1145/3018661.3018739

Predicting Completeness in Knowledge Bases

Authors: Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek

Abstract: Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real facts that the knowledge bases cover. In this work, we investigate different signals to identify the areas where a knowledge base is complete. We show… ▽ More Knowledge bases such as Wikidata, DBpedia, or YAGO contain millions of entities and facts. In some knowledge bases, the correctness of these facts has been evaluated. However, much less is known about their completeness, i.e., the proportion of real facts that the knowledge bases cover. In this work, we investigate different signals to identify the areas where a knowledge base is complete. We show that we can combine these signals in a rule mining approach, which allows us to predict where facts may be missing. We also show that completeness predictions can help other applications such as fact prediction. △ Less

Submitted 17 December, 2016; originally announced December 2016.

Comments: 21 pages, 19 references, 1 figure, 5 tables. Complete version of the article accepted at WSDM'17

arXiv:1612.04203 [pdf, other]

doi 10.4230/LIPIcs.ICDT.2017.6

Combined Tractability of Query Evaluation via Tree Automata and Cycluits (Extended Version)

Authors: Antoine Amarilli, Pierre Bourhis, Mikaël Monet, Pierre Senellart

Abstract: We investigate parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity. We introduce a new Datalog fragment with stratified negation, intensional-clique-guarded Datalog (ICG-Datalog), with linear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Such programs capture in particular conju… ▽ More We investigate parameterizations of both database instances and queries that make query evaluation fixed-parameter tractable in combined complexity. We introduce a new Datalog fragment with stratified negation, intensional-clique-guarded Datalog (ICG-Datalog), with linear-time evaluation on structures of bounded treewidth for programs of bounded rule size. Such programs capture in particular conjunctive queries with simplicial decompositions of bounded width, guarded negation fragment queries of bounded CQ-rank, or two-way regular path queries. Our result proceeds via compilation to alternating two-way automata, whose semantics is defined via cyclic provenance circuits (cycluits) that can be tractably evaluated. Last, we prove that probabilistic query evaluation remains intractable in combined complexity under this parameterization. △ Less

Submitted 15 January, 2017; v1 submitted 13 December, 2016; originally announced December 2016.

Comments: 69 pages, accepted at ICDT'17. Appendix F contains results from an independent upcoming journal paper by Michael Benedikt, Pierre Bourhis, Georg Gottlob, and Pierre Senellart

arXiv:1607.05538 [pdf, other]

doi 10.1007/978-3-319-45856-4_22

Challenges for Efficient Query Evaluation on Structured Probabilistic Data

Authors: Antoine Amarilli, Silviu Maniu, Mikaël Monet

Abstract: Query answering over probabilistic data is an important task but is generally intractable. However, a new approach for this problem has recently been proposed, based on structural decompositions of input databases, following, e.g., tree decompositions. This paper presents a vision for a database management system for probabilistic data built following this structural approach. We review our existi… ▽ More Query answering over probabilistic data is an important task but is generally intractable. However, a new approach for this problem has recently been proposed, based on structural decompositions of input databases, following, e.g., tree decompositions. This paper presents a vision for a database management system for probabilistic data built following this structural approach. We review our existing and ongoing work on this topic and highlight many theoretical and practical challenges that remain to be addressed. △ Less

Submitted 19 July, 2016; originally announced July 2016.

Comments: 9 pages, 1 figure, 23 references. Accepted for publication at SUM 2016

arXiv:1607.00813 [pdf, other]

Query Answering with Transitive and Linear-Ordered Data

Authors: Antoine Amarilli, Michael Benedikt, Pierre Bourhis, Michael Vanden Boom

Abstract: We consider entailment problems involving powerful constraint languages such as guarded existential rules, in which additional semantic restrictions are put on a set of distinguished relations. We consider restricting a relation to be transitive, restricting a relation to be the transitive closure of another relation, and restricting a relation to be a linear order. We give some natural generaliza… ▽ More We consider entailment problems involving powerful constraint languages such as guarded existential rules, in which additional semantic restrictions are put on a set of distinguished relations. We consider restricting a relation to be transitive, restricting a relation to be the transitive closure of another relation, and restricting a relation to be a linear order. We give some natural generalizations of guardedness that allow inference to be decidable in each case, and isolate the complexity of the corresponding decision problems. Finally we show that slight changes in our conditions lead to undecidability. △ Less

Submitted 4 July, 2016; originally announced July 2016.

Comments: 36 pages. To appear in IJCAI 2016. Extended version with proofs

Journal ref: A journal version of this conference article was published in JAIR (Volume 63, 2018): https://www.jair.org/index.php/jair/article/view/11240

arXiv:1604.02761 [pdf, ps, other]

doi 10.1145/2902251.2902301

Tractable Lineages on Treelike Instances: Limits and Extensions

Authors: Antoine Amarilli, Pierre Bourhis, Pierre Senellart

Abstract: Query evaluation on probabilistic databases is generally intractable (#P-hard). Existing dichotomy results have identified which queries are tractable (or safe), and connected them to tractable lineages. In our previous work, using different tools, we showed that query evaluation is linear-time on probabilistic databases for arbitrary monadic second-order queries, if we bound the treewidth of the… ▽ More Query evaluation on probabilistic databases is generally intractable (#P-hard). Existing dichotomy results have identified which queries are tractable (or safe), and connected them to tractable lineages. In our previous work, using different tools, we showed that query evaluation is linear-time on probabilistic databases for arbitrary monadic second-order queries, if we bound the treewidth of the instance. In this paper, we study limitations and extensions of this result. First, for probabilistic query evaluation, we show that MSO tractability cannot extend beyond bounded treewidth: there are even FO queries that are hard on any efficiently constructible unbounded-treewidth class of graphs. This dichotomy relies on recent polynomial bounds on the extraction of planar graphs as minors, and implies lower bounds in non-probabilistic settings, for query evaluation and match counting in subinstance-closed families. Second, we show how to explain our tractability result in terms of lineage: the lineage of MSO queries on bounded-treewidth instances can be represented as bounded-treewidth circuits, polynomial-size OBDDs, and linear-size d-DNNFs. By contrast, we can strengthen the previous dichotomy to lineages, and show that there are even UCQs with disequalities that have superpolynomial OBDDs on all unbounded-treewidth graph classes; we give a characterization of such queries. Last, we show how bounded-treewidth tractability explains the tractability of the inversion-free safe queries: we can rewrite their input instances to have bounded-treewidth. △ Less

Submitted 12 April, 2023; v1 submitted 10 April, 2016; originally announced April 2016.

Comments: 36 pages, 2 tables. Version with proofs of the PODS'16 article. Some omitted proofs are available in the thesis of the first author. Includes a corrected proof of Theorem 5.5

arXiv:1511.08723 [pdf, ps, other]

doi 10.1007/978-3-662-47666-6_5

Provenance Circuits for Trees and Treelike Instances (Extended Version)

Authors: Antoine Amarilli, Pierre Bourhis, Pierre Senellart

Abstract: Query evaluation in monadic second-order logic (MSO) is tractable on trees and treelike instances, even though it is hard for arbitrary instances. This tractability result has been extended to several tasks related to query evaluation, such as counting query results [3] or performing query evaluation on probabilistic trees [10]. These are two examples of the more general problem of computing augme… ▽ More Query evaluation in monadic second-order logic (MSO) is tractable on trees and treelike instances, even though it is hard for arbitrary instances. This tractability result has been extended to several tasks related to query evaluation, such as counting query results [3] or performing query evaluation on probabilistic trees [10]. These are two examples of the more general problem of computing augmented query output, that is referred to as provenance. This article presents a provenance framework for trees and treelike instances, by describing a linear-time construction of a circuit provenance representation for MSO queries. We show how this provenance can be connected to the usual definitions of semiring provenance on relational instances [20], even though we compute it in an unusual way, using tree automata; we do so via intrinsic definitions of provenance for general semirings, independent of the operational details of query evaluation. We show applications of this provenance to capture existing counting and probabilistic results on trees and treelike instances, and give novel consequences for probability evaluation. △ Less

Submitted 27 November, 2015; originally announced November 2015.

Comments: 48 pages. Presented at ICALP'15

arXiv:1507.04955 [pdf, ps, other]

doi 10.1145/2744680.2744690

Structurally Tractable Uncertain Data

Authors: Antoine Amarilli

Abstract: Many data management applications must deal with data which is uncertain, incomplete, or noisy. However, on existing uncertain data representations, we cannot tractably perform the important query evaluation tasks of determining query possibility, certainty, or probability: these problems are hard on arbitrary uncertain input instances. We thus ask whether we could restrict the structure of uncert… ▽ More Many data management applications must deal with data which is uncertain, incomplete, or noisy. However, on existing uncertain data representations, we cannot tractably perform the important query evaluation tasks of determining query possibility, certainty, or probability: these problems are hard on arbitrary uncertain input instances. We thus ask whether we could restrict the structure of uncertain data so as to guarantee the tractability of exact query evaluation. We present our tractability results for tree and tree-like uncertain data, and a vision for probabilistic rule reasoning. We also study uncertainty about order, proposing a suitable representation, and study uncertain data conditioned by additional observations. △ Less

Submitted 17 July, 2015; originally announced July 2015.

Comments: 11 pages, 1 figure, 1 table. To appear in SIGMOD/PODS PhD Symposium 2015

ACM Class: H.2.1

arXiv:1505.04216 [pdf, ps, other]

doi 10.1109/LICS.2015.37

Finite Open-World Query Answering with Number Restrictions (Extended Version)

Authors: Antoine Amarilli, Michael Benedikt

Abstract: Open-world query answering is the problem of deciding, given a set of facts, conjunction of constraints, and query, whether the facts and constraints imply the query. This amounts to reasoning over all instances that include the facts and satisfy the constraints. We study finite open-world query answering (FQA), which assumes that the underlying world is finite and thus only considers the finite c… ▽ More Open-world query answering is the problem of deciding, given a set of facts, conjunction of constraints, and query, whether the facts and constraints imply the query. This amounts to reasoning over all instances that include the facts and satisfy the constraints. We study finite open-world query answering (FQA), which assumes that the underlying world is finite and thus only considers the finite completions of the instance. The major known decidable cases of FQA derive from the following: the guarded fragment of first-order logic, which can express referential constraints (data in one place points to data in another) but cannot express number restrictions such as functional dependencies; and the guarded fragment with number restrictions but on a signature of arity only two. In this paper, we give the first decidability results for FQA that combine both referential constraints and number restrictions for arbitrary signatures: we show that, for unary inclusion dependencies and functional dependencies, the finiteness assumption of FQA can be lifted up to taking the finite implication closure of the dependencies. Our result relies on new techniques to construct finite universal models of such constraints, for any bound on the maximal query size. △ Less

Submitted 15 May, 2015; originally announced May 2015.

Comments: 59 pages. To appear in LICS 2015. Extended version including proofs

arXiv:1505.00841 [pdf, other]

doi 10.1145/2767109.2767116

Harvesting Entities from the Web Using Unique Identifiers -- IBEX

Authors: Aliaksandr Talaika, Joanna Biega, Antoine Amarilli, Fabian M. Suchanek

Abstract: In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extracti… ▽ More In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web. △ Less

Submitted 4 May, 2015; originally announced May 2015.

Comments: 30 pages, 5 figures, 9 tables. Complete technical report for A. Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting Entities from the Web Using Unique Identifiers. WebDB workshop, 2015

arXiv:1505.00326 [pdf, ps, other]

Combining Existential Rules and Description Logics (Extended Version)

Authors: Antoine Amarilli, Michael Benedikt

Abstract: Query answering under existential rules -- implications with existential quantifiers in the head -- is known to be decidable when imposing restrictions on the rule bodies such as frontier-guardedness [BLM10, BLMS11]. Query answering is also decidable for description logics [Baa03], which further allow disjunction and functionality constraints (assert that certain relations are functions), however,… ▽ More Query answering under existential rules -- implications with existential quantifiers in the head -- is known to be decidable when imposing restrictions on the rule bodies such as frontier-guardedness [BLM10, BLMS11]. Query answering is also decidable for description logics [Baa03], which further allow disjunction and functionality constraints (assert that certain relations are functions), however, they are focused on ER-type schemas, where relations have arity two. This work investigates how to get the best of both worlds: having decidable existential rules on arbitrary arity relations, while allowing rich description logics, including functionality constraints, on arity-two relations. We first show negative results on combining such decidable languages. Second, we introduce an expressive set of existential rules (frontier-one rules with a certain restriction) which can be combined with powerful constraints on arity-two relations (e.g. GC 2, ALCQIb) while retaining decidable query answering. Further, we provide conditions to add functionality constraints on the higher-arity relations. △ Less

Submitted 2 May, 2015; originally announced May 2015.

Comments: 32 pages. To appear in IJCAI 2015. Extended version including proofs

Journal ref: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), 2015, pages 2691-2697

arXiv:1404.3131 [pdf, ps, other]

doi 10.3166/isi.20.5.53-75

The Possibility Problem for Probabilistic XML (Extended Version)

Authors: Antoine Amarilli

Abstract: We consider the possibility problem of determining if a document is a possible world of a probabilistic document, in the setting of probabilistic XML. This basic question is a special case of query answering or tree automata evaluation, but it has specific practical uses, such as checking whether an user-provided probabilistic document outcome is possible or sufficiently likely. In this paper, we… ▽ More We consider the possibility problem of determining if a document is a possible world of a probabilistic document, in the setting of probabilistic XML. This basic question is a special case of query answering or tree automata evaluation, but it has specific practical uses, such as checking whether an user-provided probabilistic document outcome is possible or sufficiently likely. In this paper, we study the complexity of the possibility problem for probabilistic XML models of varying expressiveness. We show that the decision problem is often tractable in the absence of long-distance dependencies, but that its computation variant is intractable on unordered documents. We also introduce an explicit matches variant to generalize practical situations where node labels are unambiguous; this ensures tractability of the possibility problem, even under long-distance dependencies, provided event conjunctions are disallowed. Our results entirely classify the tractability boundary over all considered problem variants. △ Less

Submitted 22 July, 2014; v1 submitted 11 April, 2014; originally announced April 2014.

Comments: 20 pages, 1 table, 2 figures. This is the complete version (including proofs) of work initially submitted as an extended abstract (without proofs) at the AMW 2014 workshop and subsequently submitted (with proofs) at the BDA 2014 conference (no formal proceedings). This version integrates the feedback from both rounds of reviews

ACM Class: H.2.3; E.1

arXiv:1403.0783 [pdf, ps, other]

doi 10.1007/978-3-662-43984-5_27

Uncertainty in Crowd Data Sourcing under Structural Constraints

Authors: Antoine Amarilli, Yael Amsterdamer, Tova Milo

Abstract: Applications extracting data from crowdsourcing platforms must deal with the uncertainty of crowd answers in two different ways: first, by deriving estimates of the correct value from the answers; second, by choosing crowd questions whose answers are expected to minimize this uncertainty relative to the overall data collection goal. Such problems are already challenging when we assume that questio… ▽ More Applications extracting data from crowdsourcing platforms must deal with the uncertainty of crowd answers in two different ways: first, by deriving estimates of the correct value from the answers; second, by choosing crowd questions whose answers are expected to minimize this uncertainty relative to the overall data collection goal. Such problems are already challenging when we assume that questions are unrelated and answers are independent, but they are even more complicated when we assume that the unknown values follow hard structural constraints (such as monotonicity). In this vision paper, we examine how to formally address this issue with an approach inspired by [Amsterdamer et al., 2013]. We describe a generalized setting where we model constraints as linear inequalities, and use them to guide the choice of crowd questions and the processing of answers. We present the main challenges arising in this setting, and propose directions to solve them. △ Less

Submitted 4 March, 2014; originally announced March 2014.

Comments: 8 pages, vision paper. To appear at UnCrowd 2014

ACM Class: H.2.8

arXiv:1312.3248 [pdf, other]

doi 10.5441/002/icdt.2014.06

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

Authors: Antoine Amarilli, Yael Amsterdamer, Tova Milo

Abstract: We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequ… ▽ More We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequent itemsets. In the spirit of data mining with oracles, we analyze the complexity of this problem in terms of (i) crowd complexity, that measures the number of crowd questions required to identify the frequent itemsets; and (ii) computational complexity, that measures the computational effort required to choose the questions. We provide lower and upper complexity bounds in terms of the size and structure of the input taxonomy, as well as the size of a concise description of the output itemsets. We also provide constructive algorithms that achieve the upper bounds, and consider more efficient variants for practical situations. △ Less

Submitted 16 December, 2013; v1 submitted 11 December, 2013; originally announced December 2013.

Comments: 18 pages, 2 figures. To be published to ICDT'13. Added missing acknowledgement

ACM Class: H.2.8

Showing 1–50 of 51 results for author: Amarilli, A