research-article

Learning to optimize federated queries

Authors:

Richard L. Cole,

Daniel TingAuthors Info & Claims

aiDM '19: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

Article No.: 2, Pages 1 - 7

https://doi.org/10.1145/3329859.3329873

Published: 05 July 2019 Publication History

Abstract

Query optimization is challenging for any database system, even with a clear understanding of its inner workings. Consider then, query planning for a federation of third-party data sources where little detail is known. This is exactly the challenge of orchestrating data execution and movement faced by Tableau's cross-database joins feature, where the data of a query originates from two or more data sources. In this paper, we present our work on using machine learning techniques to address one of the most fundamental challenges in federated query optimization: the dynamic designation of a federation engine. Our machine learning model learns the performance and data characteristics of a system by extracting features from query plans. We further extend the ability of our model to manipulate database settings on a per query level. Our experimental results demonstrate that we can achieve a speedup of up to 10.7x compared to an existing federated query optimizer.

References

[1]

2019. Apache AsterixDB. (2019). Retrieved 2019-03-18 from https://asterixdb.apache.org/

[2]

2019. Integrate your data with cross-database joins in Tableau 10. (2019). Retrieved 2019-03-12 from https://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724

[3]

2019. Join Your Data - Tableau. (2019). Retrieved 2019-03-12 from https://onlinehelp.tableau.com/current/pro/desktop/en-us/joining_tables.htm#about-queries-and-crossdatabase-joins

[4]

2019. PostgreSQL: Documentation: 10: F.34.Âăpostgres_fdw. (2019). Retrieved 2019-03-06 from https://www.postgresql.org/docs/10/postgres-fdw.html

[5]

2019. PostgreSQL: The world's most advanced open source database. (2019). Retrieved 2019-03-06 from https://www.postgresql.org/

[6]

2019. Presto | Distributed SQL Query Engine for Big Data. (2019). Retrieved 2019-03-18 from http://prestodb.github.io/

[7]

2019. TPC-H - Homepage. (2019). Retrieved 2019-03-16 from http://www.tpc.org/tpch/

[8]

Divyakant Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. PVLDB 11 (2018), 1414--1427.

Digital Library

[9]

Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 390--401.

Digital Library

[10]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, 1383--1394.

Digital Library

[11]

Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. 1994. The TSIMMIS project: Integration of heterogenous information sources. (1994).

[12]

Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Samuel Madden, David Maier, Timothy G. Mattson, and Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44 (2015), 11--16.

Digital Library

[13]

Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on. IEEE, 592--603.

Digital Library

[14]

Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimizing queries across diverse data sources. (1997).

Digital Library

[15]

Michael Hausenblas and Jacques Nadeau. 2013. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big data 1 2 (2013), 100--4.

[16]

Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, and Johannes Gehrke. 2018. Cuttlefish: A lightweight primitive for adaptive query processing. arXiv preprint arXiv:1802.09180 (2018).

[17]

Manos Karpathiotakis, Avrilia Floratou, Fatma Özcan, and Anastasia Ailamaki. 2017. No data left behind: real-time insights from a complex data ecosystem. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, 108--120.

Digital Library

[18]

Alfons Kemper and Thomas Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 195--206.

Digital Library

[19]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf

[20]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489--504.

Digital Library

[21]

Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to Optimize Join Queries With Deep Reinforcement Learning. (2018). arXiv:cs.DB/1808.03196

[22]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.

Digital Library

[23]

Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment 5, 11 (2012), 1555--1566.

Digital Library

[24]

Tanu Malik, Randal C Burns, and Nitesh V Chawla. 2007. A Black-Box Approach to Query Cardinality Estimation. In CIDR. 56--67.

[25]

Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2018, Houston, TX, USA, June 10, 2018, Rajesh Bordawekar and Oded Shmueli (Eds.). ACM, 3:1--3:4.

Digital Library

[26]

Mary Tork Roth, Laura M Haas, and Fatma Ozcan. 1999. Cost models do matter: Providing cost information for diverse data sources in a federated system. IBM Thomas J. Watson Research Division.

[27]

Mary Tork Roth and Peter M Schwarz. 1997. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, Vol. 97. 25--29.

[28]

Ran Tan, Rada Chirkova, Vijay Gadepally, and Timothy G Mattson. 2017. Enabling query processing across heterogeneous data models: A survey. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 3211--3220.

[29]

Anthony Tomasic, Louiqa Raschid, and Patrick Valduriez. 1998. Scaling access to heterogeneous data sources with DISCO. IEEE Transactions on knowledge and Data Engineering 10, 5 (1998), 808--823.

Digital Library

[30]

Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1009--1024.

Digital Library

[31]

Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.

Cited By

Gu RZhang YYin LSong LHuang WYuan CWang ZZhu GHuang Y(2023)Coral: federated query join order optimization based on deep reinforcement learningWorld Wide Web10.1007/s11280-023-01156-026:5(3093-3118)Online publication date: 12-Jun-2023
https://doi.org/10.1007/s11280-023-01156-0

Recommendations

Decomposing federated queries in presence of replicated fragments

Federated query engines allow for linked data consumption using SPARQL endpoints. Replicating data fragments from different sources enables data re-organization and provides the basis for more effective and efficient federated query processing. However, ...
PFed: Recommending Plausible Federated SPARQL Queries
Database and Expert Systems Applications
Abstract
Federated SPARQL queries allow to query multiple interlinked datasets hosted by remote SPARQL endpoints. However, finding federated queries over a growing number of datasets is challenging. In this paper, we propose PFed, an approach to recommend ...
Exploiting maximal redundancy to optimize SQL queries

Detecting and dealing with redundancy is an ubiquitous problem in query optimization, which manifests itself in many areas of research such as materialized views, multi-query optimization, and query-containment algorithms. In this paper, we focus on the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

aiDM '19: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

July 2019

53 pages

ISBN:9781450368025

DOI:10.1145/3329859

Conference Chairs:
Rajesh Bordawekar
IBM T. J. Watson Research Center
,
Oded Shmueli
Technion

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SIGMOD/PODS '19

Sponsor:

SIGMOD

SIGMOD/PODS '19: International Conference on Management of Data

July 5, 2019

Amsterdam, Netherlands

Acceptance Rates

aiDM '19 Paper Acceptance Rate 8 of 12 submissions, 67%;

Overall Acceptance Rate 19 of 26 submissions, 73%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gu RZhang YYin LSong LHuang WYuan CWang ZZhu GHuang Y(2023)Coral: federated query join order optimization based on deep reinforcement learningWorld Wide Web10.1007/s11280-023-01156-026:5(3093-3118)Online publication date: 12-Jun-2023
https://doi.org/10.1007/s11280-023-01156-0

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents