Skip to main content

Showing 1–21 of 21 results for author: Kipf, A

  1. arXiv:2410.14066  [pdf, other

    cs.DB cs.IR cs.LG

    Lightweight Correlation-Aware Table Compression

    Authors: Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf

    Abstract: The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to… ▽ More

    Submitted 21 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

    Comments: Third Table Representation Learning Workshop (TRL @ NeurIPS 2024)

  2. arXiv:2409.08013  [pdf, other

    cs.DB

    DPconv: Super-Polynomially Faster Join Ordering

    Authors: Mihail Stoian, Andreas Kipf

    Abstract: We revisit the join ordering problem in query optimization. The standard exact algorithm, DPccp, has a worst-case running time of $O(3^n)$. This is prohibitively expensive for large queries, which are not that uncommon anymore. We develop a new algorithmic framework based on subset convolution. DPconv achieves a super-polynomial speedup over DPccp, breaking the $O(3^n)$ time-barrier for the first… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: To appear at SIGMOD 2025

  3. arXiv:2403.17229  [pdf, other

    cs.DB

    Corra: Correlation-Aware Column Compression

    Authors: Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf

    Abstract: Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more important to reduce storage cost on block storage (such as S3) as well as reduce memory pressure in multi-tenant in-memory buffers of cloud databases. However, single-c… ▽ More

    Submitted 17 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Submitted to CloudDB'24

  4. arXiv:2309.06354  [pdf, other

    cs.DB

    Enhancing In-Memory Spatial Indexing with Learned Search

    Authors: Varun Pandey, Alexander van Renen, Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Volker Markl, Alfons Kemper

    Abstract: Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and social media platforms (e.g., location-tagged posts on Facebook, Twitter, and Instagram). This exponential growth in spatial data has led the research community… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: text overlap with arXiv:2008.10349

  5. arXiv:2205.05769  [pdf, other

    cs.DB cs.LG

    LSI: A Learned Secondary Index Structure

    Authors: Andreas Kipf, Dominik Horn, Pascal Pfeil, Ryan Marcus, Tim Kraska

    Abstract: Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Fifth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2022)

  6. arXiv:2111.14905  [pdf, other

    cs.DB cs.LG

    Bounding the Last Mile: Efficient Learned String Indexing

    Authors: Benjamin Spector, Andreas Kipf, Kapil Vaidya, Chi Wang, Umar Farooq Minhas, Tim Kraska

    Abstract: We introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of radix splines each indexing a fixed number of bytes. RSS approaches or exceeds the performance of traditional string indexes while using 7-70$\times$ less memory. RSS achieves this by using the minimal string prefix to sufficiently distinguish the data unlike most learned approaches… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

  7. arXiv:2108.05117  [pdf, other

    cs.DB cs.LG

    Towards Practical Learned Indexing

    Authors: Mihail Stoian, Andreas Kipf, Ryan Marcus, Tim Kraska

    Abstract: Latest research proposes to replace existing index structures with learned models. However, current learned indexes tend to have many hyperparameters, often do not provide any error guarantees, and are expensive to build. We introduce Practical Learned Index (PLEX). PLEX only has a single hyperparameter $ε$ (maximum prediction error) and offers a better trade-off between build and lookup time than… ▽ More

    Submitted 6 November, 2021; v1 submitted 11 August, 2021; originally announced August 2021.

    Comments: 3rd International Workshop on Applied AI for Database Systems and Applications (AIDB'21), August 20, 2021, Copenhagen, Denmark

  8. arXiv:2107.01464  [pdf, other

    cs.DB

    When Are Learned Models Better Than Hash Functions?

    Authors: Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Tim Kraska

    Abstract: In this work, we aim to study when learned models are better hash functions, particular for hash-maps. We use lightweight piece-wise linear models to replace the hash functions as they have small inference times and are sufficiently general to capture complex distributions. We analyze the learned models in terms of: the model inference time and the number of collisions. Surprisingly, we found that… ▽ More

    Submitted 3 July, 2021; originally announced July 2021.

  9. LEA: A Learned Encoding Advisor for Column Stores

    Authors: Lujing Cen, Andreas Kipf, Ryan Marcus, Tim Kraska

    Abstract: Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding scheme for a particular column is an optimization problem that depends on the data, the query workload, and the underlying hardware. We introduce Learned Encodi… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

  10. arXiv:2101.04964  [pdf, other

    cs.DB

    Flow-Loss: Learning Cardinality Estimates That Matter

    Authors: Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh

    Abstract: Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

  11. arXiv:2010.12548  [pdf, other

    cs.DB

    The Case for Distance-Bounded Spatial Approximations

    Authors: Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Varun Pandey, Harish Doraiswamy, Volker Markl

    Abstract: Spatial approximations have been traditionally used in spatial databases to accelerate the processing of complex geometric operations. However, approximations are typically only used in a first filtering step to determine a set of candidate spatial objects that may fulfill the query condition. To provide accurate results, the exact geometries of the candidate objects are tested against the query c… ▽ More

    Submitted 21 January, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: 11th Annual Conference on Innovative Data Systems Research (CIDR'21)

  12. arXiv:2008.10349  [pdf, other

    cs.DB cs.LG

    The Case for Learned Spatial Indexes

    Authors: Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, Alfons Kemper

    Abstract: Spatial data is ubiquitous. Massive amounts of data are generated every day from billions of GPS-enabled devices such as cell phones, cars, sensors, and various consumer-based applications such as Uber, Tinder, location-tagged posts in Facebook, Twitter, Instagram, etc. This exponential growth in spatial data has led the research community to focus on building systems and applications that can pro… ▽ More

    Submitted 24 August, 2020; originally announced August 2020.

  13. Benchmarking Learned Indexes

    Authors: Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, Tim Kraska

    Abstract: Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can i… ▽ More

    Submitted 29 June, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

  14. Fast Mapping onto Census Blocks

    Authors: Jeremy Kepner, Andreas Kipf, Darren Engwirda, Navin Vembar, Michael Jones, Lauren Milechin, Vijay Gadepally, Chris Hill, Tim Kraska, William Arcand, David Bestor, William Bergeron, Chansup Byun, Matthew Hubbell, Michael Houle, Andrew Kirby, Anna Klein, Julie Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Sid Samsi, Charles Yee, Peter Michaleas

    Abstract: Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longitude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled… ▽ More

    Submitted 1 August, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

    Comments: 8 pages, 7 figures, 55 references; accepted to IEEE HPEC 2020

  15. arXiv:2004.14541  [pdf, other

    cs.DB cs.LG

    RadixSpline: A Single-Pass Learned Index

    Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

    Abstract: Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that c… ▽ More

    Submitted 22 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2020)

  16. arXiv:1911.13014  [pdf, other

    cs.DB cs.DS cs.LG

    SOSD: A Benchmark for Learned Indexes

    Authors: Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann

    Abstract: A groundswell of recent work has focused on improving data management systems with learned components. Specifically, work on learned index structures has proposed replacing traditional index structures, such as B-trees, with learned models. Given the decades of research committed to improving index structures, there is significant skepticism about whether learned indexes actually outperform state-… ▽ More

    Submitted 29 November, 2019; originally announced November 2019.

    Comments: NeurIPS 2019 Workshop on Machine Learning for Systems

  17. GeoBlocks: A Query-Cache Accelerated Data Structure for Spatial Aggregation over Polygons

    Authors: Christian Winter, Andreas Kipf, Christoph Anneser, Eleni Tzirita Zacharatou, Thomas Neumann, Alfons Kemper

    Abstract: As individual traffic and public transport in cities are changing, city authorities need to analyze urban geospatial data to improve transportation and infrastructure. To that end, they highly rely on spatial aggregation queries that extract summarized information from point data (e.g., Uber rides) contained in a given polygonal region (e.g., a city neighborhood). To support such queries, current… ▽ More

    Submitted 16 March, 2021; v1 submitted 21 August, 2019; originally announced August 2019.

    Comments: Accepted at EDBT 2021, please cite the EDBT version

  18. arXiv:1906.06085  [pdf, other

    cs.DB

    DeepSPACE: Approximate Geospatial Query Processing with Deep Learning

    Authors: Dimitri Vorona, Andreas Kipf, Thomas Neumann, Alfons Kemper

    Abstract: The amount of the available geospatial data grows at an ever faster pace. This leads to the constantly increasing demand for processing power and storage in order to provide data analysis in a timely manner. At the same time, a lot of geospatial processing is visual and exploratory in nature, thus having bounded precision requirements. We present DeepSPACE, a deep learning-based approximate geospa… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

  19. arXiv:1904.08223  [pdf, other

    cs.DB

    Estimating Cardinalities with Deep Sketches

    Authors: Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, Alfons Kemper

    Abstract: We introduce Deep Sketches, which are compact models of databases that allow us to estimate the result sizes of SQL queries. Deep Sketches are powered by a new deep learning approach to cardinality estimation that can capture correlations between columns, even across tables. Our demonstration allows users to define such sketches on the TPC-H and IMDb datasets, monitor the training process, and run… ▽ More

    Submitted 17 April, 2019; originally announced April 2019.

    Comments: To appear in SIGMOD'19

  20. arXiv:1809.00677  [pdf, other

    cs.DB

    Learned Cardinalities: Estimating Correlated Joins with Deep Learning

    Authors: Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Alfons Kemper

    Abstract: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our ev… ▽ More

    Submitted 18 December, 2018; v1 submitted 3 September, 2018; originally announced September 2018.

    Comments: CIDR 2019. https://github.com/andreaskipf/learnedcardinalities

  21. arXiv:1802.09488  [pdf, other

    cs.DB

    Adaptive Geospatial Joins for Modern Hardware

    Authors: Andreas Kipf, Harald Lang, Varun Pandey, Raul Alexandru Persa, Peter Boncz, Thomas Neumann, Alfons Kemper

    Abstract: Geospatial joins are a core building block of connected mobility applications. An especially challenging problem are joins between streaming points and static polygons. Since points are not known beforehand, they cannot be indexed. Nevertheless, points need to be mapped to polygons with low latencies to enable real-time feedback. We present an adaptive geospatial join that uses true hit filterin… ▽ More

    Submitted 26 February, 2018; originally announced February 2018.