subscribe to arXiv mailings

Lightweight Correlation-Aware Table Compression

Authors: Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf

Abstract: The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to… ▽ More The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet. △ Less

Submitted 21 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

Comments: Third Table Representation Learning Workshop (TRL @ NeurIPS 2024)

arXiv:2409.08013 [pdf, other]

DPconv: Super-Polynomially Faster Join Ordering

Authors: Mihail Stoian, Andreas Kipf

Abstract: We revisit the join ordering problem in query optimization. The standard exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is prohibitively expensive for large queries, which are not that uncommon anymore. We develop a new algorithmic framework based on subset convolution. DPconv achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$ time-barrier for the first… ▽ More We revisit the join ordering problem in query optimization. The standard exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is prohibitively expensive for large queries, which are not that uncommon anymore. We develop a new algorithmic framework based on subset convolution. DPconv achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$ time-barrier for the first time. We show that the instantiation of our framework for the $C_\max$ cost function is up to 30x faster than DPccp for large clique queries. △ Less

Submitted 12 September, 2024; originally announced September 2024.

Comments: To appear at SIGMOD 2025

arXiv:2403.17229 [pdf, other]

Corra: Correlation-Aware Column Compression

Authors: Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf

Abstract: Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-c… ▽ More Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is due to the neglect of cross-column correlations. For instance, consider the column pair ($\texttt{city}$, $\texttt{zip_code}$). Typically, cities have only a few dozen unique zip codes. If this information is properly exploited, it can significantly reduce the space consumption of the latter column. In this work, we depart from the established path of compressing data using only single-column encoding schemes and introduce several what we call $\textit{horizontal}$, correlation-aware encoding schemes. We demonstrate their advantages over single-column encoding schemes on the well-known TPC-H's $\texttt{lineitem}$, LDBC's $\texttt{message}$, DMV, and Taxi datasets. Our correlation-aware encoding schemes save up to 58.3% of the compressed size over single-column schemes for $\texttt{lineitem}$'s $\texttt{receiptdate}$, 53.7% for DMV's $\texttt{zip_code}$, and 85.16% for Taxi's $\texttt{total_amount}$. △ Less

Submitted 17 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: Submitted to CloudDB'24

arXiv:2309.06354 [pdf, other]

Enhancing In-Memory Spatial Indexing with Learned Search

Authors: Varun Pandey, Alexander van Renen, Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Volker Markl, Alfons Kemper

Abstract: Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community… ▽ More Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community to build systems and applications for efficient spatial data processing. In this study, we apply a recently developed machine-learned search technique for single-dimensional sorted data to spatial indexing. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. Adhering to the latest research trends, we tune the partitioning techniques to be instance-optimized. By tuning each partitioning technique for optimal performance, we demonstrate that: (i) grid-based index structures outperform tree-based index structures (from 1.23$\times$ to 2.47$\times$), (ii) learning-enhanced variants of commonly used spatial index structures outperform their original counterparts (from 1.44$\times$ to 53.34$\times$ faster), (iii) machine-learned search within a partition is faster than binary search by 11.79% - 39.51% when filtering on one dimension, (iv) the benefit of machine-learned search diminishes in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and (v) index lookup is the bottleneck for tree-based structures, which could potentially be reduced by linearizing the indexed partitions. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: arXiv admin note: text overlap with arXiv:2008.10349

arXiv:2205.05769 [pdf, other]

LSI: A Learned Secondary Index Structure

Authors: Andreas Kipf, Dominik Horn, Pascal Pfeil, Ryan Marcus, Tim Kraska

Abstract: Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce… ▽ More Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2022)

arXiv:2111.14905 [pdf, other]

Bounding the Last Mile: Efficient Learned String Indexing

Authors: Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

Abstract: We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches… ▽ More We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches which index the entire string. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. We benchmark RSS on several real-world string datasets against ART and HOT. Our experiments suggest this line of research may be promising for future memory-intensive database applications. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

arXiv:2108.05117 [pdf, other]

Towards Practical Learned Indexing

Authors: Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska

Abstract: Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter $ε$ (maximum prediction error) and offers a better trade-off between build and lookup time than… ▽ More Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter $ε$ (maximum prediction error) and offers a better trade-off between build and lookup time than state-of-the-art approaches. Similar to RadixSpline, PLEX consists of a spline and a (multi-level) radix layer. It first builds a spline satisfying the given $ε$ and then performs an ad-hoc analysis of the distribution of spline points to quickly tune the radix layer. △ Less

Submitted 6 November, 2021; v1 submitted 11 August, 2021; originally announced August 2021.

Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

arXiv:2107.01464 [pdf, other]

When Are Learned Models Better Than Hash Functions?

Authors: Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Tim Kraska

Abstract: In this work, we aim to study when learned models are better hash functions, particular for hash-maps. We use lightweight piece-wise linear models to replace the hash functions as they have small inference times and are sufficiently general to capture complex distributions. We analyze the learned models in terms of: the model inference time and the number of collisions. Surprisingly, we found that… ▽ More In this work, we aim to study when learned models are better hash functions, particular for hash-maps. We use lightweight piece-wise linear models to replace the hash functions as they have small inference times and are sufficiently general to capture complex distributions. We analyze the learned models in terms of: the model inference time and the number of collisions. Surprisingly, we found that learned models are not much slower to compute than hash functions if optimized correctly. However, it turns out that learned models can only reduce the number of collisions (i.e., the number of times different keys have the same hash value) if the model is able to over-fit to the data; otherwise, it can not be better than an ordinary hash function. Hence, how much better a learned model is in avoiding collisions highly depends on the data and the ability of the model to over-fit. To evaluate the effectiveness of learned models, we used them as hash functions in the bucket chaining and Cuckoo hash tables. For bucket chaining hash table, we found that learned models can achieve 30% smaller sizes and 10% lower probe latency. For Cuckoo hash tables, in some datasets, learned models can increase the ratio of keys stored in their primary locations by around 10%. In summary, we found that learned models can indeed outperform hash functions but only for certain data distributions and with a limited margin. △ Less

Submitted 3 July, 2021; originally announced July 2021.

arXiv:2105.08830 [pdf, other]

doi 10.1145/3464509.3464885

LEA: A Learned Encoding Advisor for Column Stores

Authors: Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska

Abstract: Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encodi… ▽ More Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encoding Advisor (LEA), a learned approach to column encoding selection. LEA is trained on synthetic datasets with various distributions on the target system. Once trained, LEA uses sample data and statistics (such as cardinality) from the user's database to predict the optimal column encodings. LEA can optimize for encoded size, query performance, or a combination of the two. Compared to the heuristic-based encoding advisor of a commercial column store on TPC-H, LEA achieves 19% lower query latency while using 26% less space. △ Less

Submitted 18 May, 2021; originally announced May 2021.

arXiv:2101.04964 [pdf, other]

Flow-Loss: Learning Cardinality Estimates That Matter

Authors: Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh

Abstract: Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the… ▽ More Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data. △ Less

Submitted 13 January, 2021; originally announced January 2021.

arXiv:2010.12548 [pdf, other]

The Case for Distance-Bounded Spatial Approximations

Authors: Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Varun Pandey, Harish Doraiswamy, Volker Markl

Abstract: Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query c… ▽ More Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query condition, which is typically an expensive operation. Nevertheless, many emerging applications (e.g., visualization tools) require interactive responses, while only needing approximate results. Besides, real-world geospatial data is inherently imprecise, which makes exact data processing unnecessary. Given the uncertainty associated with spatial data and the relaxed precision requirements of many applications, this vision paper advocates for approximate spatial data processing techniques that omit exact geometric tests and provide final answers solely on the basis of (fine-grained) approximations. Thanks to recent hardware advances, this vision can be realized today. Furthermore, our approximate techniques employ a distance-based error bound, i.e., a bound on the maximum spatial distance between false (or missing) and exact results which is crucial for meaningful analyses. This bound allows to control the precision of the approximation and trade accuracy for performance. △ Less

Submitted 21 January, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: 11th Annual Conference on Innovative Data Systems Research (CIDR'21)

arXiv:2008.10349 [pdf, other]

The Case for Learned Spatial Indexes

Authors: Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Alfons Kemper

Abstract: Spatial data is ubiquitous. Massive amounts of data are generated every day from billions of GPS-enabled devices such as cell phones, cars, sensors, and various consumer-based applications such as Uber, Tinder, location-tagged posts in Facebook, Twitter, Instagram, etc. This exponential growth in spatial data has led the research community to focus on building systems and applications that can pro… ▽ More Spatial data is ubiquitous. Massive amounts of data are generated every day from billions of GPS-enabled devices such as cell phones, cars, sensors, and various consumer-based applications such as Uber, Tinder, location-tagged posts in Facebook, Twitter, Instagram, etc. This exponential growth in spatial data has led the research community to focus on building systems and applications that can process spatial data efficiently. In the meantime, recent research has introduced learned index structures. In this work, we use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) and apply them to five classical multi-dimensional indexes to be able to answer spatial range queries. By tuning each partitioning technique for optimal performance, we show that (i) machine learned search within a partition is faster by 11.79\% to 39.51\% than binary search when using filtering on one dimension, (ii) the bottleneck for tree structures is index lookup, which could potentially be improved by linearizing the indexed partitions (iii) filtering on one dimension and refining using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions, and (iv) learned indexes can have a significant impact on the performance of low selectivity queries while being less effective under higher selectivities. △ Less

Submitted 24 August, 2020; originally announced August 2020.

arXiv:2006.12804 [pdf, other]

doi 10.14778/3421424.3421425

Benchmarking Learned Indexes

Authors: Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska

Abstract: Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can i… ▽ More Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We also investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times. △ Less

Submitted 29 June, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

arXiv:2005.03156 [pdf, other]

doi 10.1109/HPEC43674.2020.9286157

Fast Mapping onto Census Blocks

Authors: Jeremy Kepner, Andreas Kipf, Darren Engwirda, Navin Vembar, Michael Jones, Lauren Milechin, Vijay Gadepally, Chris Hill, Tim Kraska, William Arcand, David Bestor, William Bergeron, Chansup Byun, Matthew Hubbell, Michael Houle, Andrew Kirby, Anna Klein, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Sid Samsi, Charles Yee, Peter Michaleas

Abstract: Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longitude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled… ▽ More Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longitude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled "simple" and "fast". The simple approach can be implemented in any scripting language (Matlab/Octave, Python, Julia, R) and is easily integrated and customized to a variety of research goals. This simple approach uses a novel combination of hierarchy, sparse bounding boxes, polygon crossing-number, vectorization, and parallel processing to achieve 100,000,000+ projections per second on 100 servers. The simple approach is compact, does not increase data storage requirements, and is applicable to any country or region. The fast approach exploits the thread, vector, and memory optimizations that are possible using a low-level language (C++) and achieves similar performance on a single server. This paper details these approaches with the goal of enabling the broader community to quickly integrate location and demographic data. △ Less

Submitted 1 August, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

Comments: 8 pages, 7 figures, 55 references; accepted to IEEE HPEC 2020

arXiv:2004.14541 [pdf, other]

RadixSpline: A Single-Pass Learned Index

Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

Abstract: Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that c… ▽ More Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data and is competitive with state-of-the-art learned index models, like RMI, in size and lookup performance. We evaluate RS using the SOSD benchmark and show that it achieves competitive results on all datasets, despite the fact that it only has two parameters. △ Less

Submitted 22 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

Comments: Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2020)

arXiv:1911.13014 [pdf, other]

SOSD: A Benchmark for Learned Indexes

Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

Abstract: A groundswell of recent work has focused on improving data management systems with learned components. Specifically, work on learned index structures has proposed replacing traditional index structures, such as B-trees, with learned models. Given the decades of research committed to improving index structures, there is significant skepticism about whether learned indexes actually outperform state-… ▽ More A groundswell of recent work has focused on improving data management systems with learned components. Specifically, work on learned index structures has proposed replacing traditional index structures, such as B-trees, with learned models. Given the decades of research committed to improving index structures, there is significant skepticism about whether learned indexes actually outperform state-of-the-art implementations of traditional structures on real-world data. To answer this question, we propose a new benchmarking framework that comes with a variety of real-world datasets and baseline implementations to compare against. We also show preliminary results for selected index structures, and find that learned models indeed often outperform state-of-the-art implementations, and are therefore a promising direction for future research. △ Less

Submitted 29 November, 2019; originally announced November 2019.

Comments: NeurIPS 2019 Workshop on Machine Learning for Systems

arXiv:1908.07753 [pdf, other]

doi 10.5441/002/edbt.2021.16

GeoBlocks: A Query-Cache Accelerated Data Structure for Spatial Aggregation over Polygons

Authors: Christian Winter, Andreas Kipf, Christoph Anneser, Eleni Tzirita Zacharatou, Thomas Neumann, Alfons Kemper

Abstract: As individual traffic and public transport in cities are changing, city authorities need to analyze urban geospatial data to improve transportation and infrastructure. To that end, they highly rely on spatial aggregation queries that extract summarized information from point data (e.g., Uber rides) contained in a given polygonal region (e.g., a city neighborhood). To support such queries, current… ▽ More As individual traffic and public transport in cities are changing, city authorities need to analyze urban geospatial data to improve transportation and infrastructure. To that end, they highly rely on spatial aggregation queries that extract summarized information from point data (e.g., Uber rides) contained in a given polygonal region (e.g., a city neighborhood). To support such queries, current analysis tools either allow only predefined aggregates on predefined regions and are thus unsuitable for exploratory analyses, or access the raw data to compute aggregate results on-the-fly, which severely limits the interactivity. At the same time, existing pre-aggregation techniques are inadequate since they maintain aggregates over rectangular regions. As a result, when applied over arbitrary polygonal regions, they induce an approximation error that cannot be bounded. In this paper, we introduce GeoBlocks, a novel pre-aggregating data structure that supports spatial aggregation over arbitrary polygons. GeoBlocks closely approximate polygons using a set of fine-grained grid cells and, in contrast to prior work, allow to bound the approximation error by adjusting the cell size. Furthermore, GeoBlocks employ a trie-like cache that caches aggregate results of frequently queried regions, thereby dynamically adapting to the skew inherently present in query workloads and improving performance over time. In summary, GeoBlocks outperform on-the-fly aggregation by up to three orders of magnitude, achieving the sub-second query latencies required for interactive exploratory analytics. △ Less

Submitted 16 March, 2021; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: Accepted at EDBT 2021, please cite the EDBT version

arXiv:1906.06085 [pdf, other]

DeepSPACE: Approximate Geospatial Query Processing with Deep Learning

Authors: Dimitri Vorona, Andreas Kipf, Thomas Neumann, Alfons Kemper

Abstract: The amount of the available geospatial data grows at an ever faster pace. This leads to the constantly increasing demand for processing power and storage in order to provide data analysis in a timely manner. At the same time, a lot of geospatial processing is visual and exploratory in nature, thus having bounded precision requirements. We present DeepSPACE, a deep learning-based approximate geospa… ▽ More The amount of the available geospatial data grows at an ever faster pace. This leads to the constantly increasing demand for processing power and storage in order to provide data analysis in a timely manner. At the same time, a lot of geospatial processing is visual and exploratory in nature, thus having bounded precision requirements. We present DeepSPACE, a deep learning-based approximate geospatial query processing engine which combines modest hardware requirements with the ability to answer flexible aggregation queries while keeping the required state to a few hundred KiBs. △ Less

Submitted 14 June, 2019; originally announced June 2019.

arXiv:1904.08223 [pdf, other]

Estimating Cardinalities with Deep Sketches

Authors: Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, Alfons Kemper

Abstract: We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run… ▽ More We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run ad-hoc queries against trained sketches. We also estimate query cardinalities with HyPer and PostgreSQL to visualize the gains over traditional cardinality estimators. △ Less

Submitted 17 April, 2019; originally announced April 2019.

Comments: To appear in SIGMOD'19

arXiv:1809.00677 [pdf, other]

Learned Cardinalities: Estimating Correlated Joins with Deep Learning

Authors: Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Alfons Kemper

Abstract: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our ev… ▽ More We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization. △ Less

Submitted 18 December, 2018; v1 submitted 3 September, 2018; originally announced September 2018.

Comments: CIDR 2019. https://github.com/andreaskipf/learnedcardinalities

arXiv:1802.09488 [pdf, other]

Adaptive Geospatial Joins for Modern Hardware

Authors: Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Peter Boncz, Thomas Neumann, Alfons Kemper

Abstract: Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filterin… ▽ More Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filtering to avoid expensive geometric computations in most cases. Our technique uses a quadtree-based hierarchical grid to approximate polygons and stores these approximations in a specialized radix tree. We emphasize on an approximate version of our algorithm that guarantees a user-defined precision. The exact version of our algorithm can adapt to the expected point distribution by refining the index. We optimized our implementation for modern hardware architectures with wide SIMD vector processing units, including Intel's brand new Knights Landing. Overall, our approach can perform up to two orders of magnitude faster than existing techniques. △ Less

Submitted 26 February, 2018; originally announced February 2018.

Showing 1–21 of 21 results for author: Kipf, A