skip to main content
research-article

Learning to optimize federated queries

Published: 05 July 2019 Publication History

Abstract

Query optimization is challenging for any database system, even with a clear understanding of its inner workings. Consider then, query planning for a federation of third-party data sources where little detail is known. This is exactly the challenge of orchestrating data execution and movement faced by Tableau's cross-database joins feature, where the data of a query originates from two or more data sources. In this paper, we present our work on using machine learning techniques to address one of the most fundamental challenges in federated query optimization: the dynamic designation of a federation engine. Our machine learning model learns the performance and data characteristics of a system by extracting features from query plans. We further extend the ability of our model to manipulate database settings on a per query level. Our experimental results demonstrate that we can achieve a speedup of up to 10.7x compared to an existing federated query optimizer.

References

[1]
2019. Apache AsterixDB. (2019). Retrieved 2019-03-18 from https://asterixdb.apache.org/
[2]
2019. Integrate your data with cross-database joins in Tableau 10. (2019). Retrieved 2019-03-12 from https://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724
[3]
2019. Join Your Data - Tableau. (2019). Retrieved 2019-03-12 from https://onlinehelp.tableau.com/current/pro/desktop/en-us/joining_tables.htm#about-queries-and-crossdatabase-joins
[4]
2019. PostgreSQL: Documentation: 10: F.34.Âăpostgres_fdw. (2019). Retrieved 2019-03-06 from https://www.postgresql.org/docs/10/postgres-fdw.html
[5]
2019. PostgreSQL: The world's most advanced open source database. (2019). Retrieved 2019-03-06 from https://www.postgresql.org/
[6]
2019. Presto | Distributed SQL Query Engine for Big Data. (2019). Retrieved 2019-03-18 from http://prestodb.github.io/
[7]
2019. TPC-H - Homepage. (2019). Retrieved 2019-03-16 from http://www.tpc.org/tpch/
[8]
Divyakant Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. PVLDB 11 (2018), 1414--1427.
[9]
Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 390--401.
[10]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, 1383--1394.
[11]
Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. 1994. The TSIMMIS project: Integration of heterogenous information sources. (1994).
[12]
Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska, Bill Howe, Jeremy Kepner, Samuel Madden, David Maier, Timothy G. Mattson, and Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44 (2015), 11--16.
[13]
Archana Ganapathi, Harumi Kuno, Umeshwar Dayal, Janet L Wiener, Armando Fox, Michael Jordan, and David Patterson. 2009. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on. IEEE, 592--603.
[14]
Laura Haas, Donald Kossmann, Edward Wimmers, and Jun Yang. 1997. Optimizing queries across diverse data sources. (1997).
[15]
Michael Hausenblas and Jacques Nadeau. 2013. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big data 1 2 (2013), 100--4.
[16]
Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, and Johannes Gehrke. 2018. Cuttlefish: A lightweight primitive for adaptive query processing. arXiv preprint arXiv:1802.09180 (2018).
[17]
Manos Karpathiotakis, Avrilia Floratou, Fatma Özcan, and Anastasia Ailamaki. 2017. No data left behind: real-time insights from a complex data ecosystem. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, 108--120.
[18]
Alfons Kemper and Thomas Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 195--206.
[19]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org. http://cidrdb.org/cidr2019/papers/p101-kipf-cidr19.pdf
[20]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. ACM, 489--504.
[21]
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to Optimize Join Queries With Deep Reinforcement Learning. (2018). arXiv:cs.DB/1808.03196
[22]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.
[23]
Jiexing Li, Arnd Christian König, Vivek Narasayya, and Surajit Chaudhuri. 2012. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment 5, 11 (2012), 1555--1566.
[24]
Tanu Malik, Randal C Burns, and Nitesh V Chawla. 2007. A Black-Box Approach to Query Cardinality Estimation. In CIDR. 56--67.
[25]
Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2018, Houston, TX, USA, June 10, 2018, Rajesh Bordawekar and Oded Shmueli (Eds.). ACM, 3:1--3:4.
[26]
Mary Tork Roth, Laura M Haas, and Fatma Ozcan. 1999. Cost models do matter: Providing cost information for diverse data sources in a federated system. IBM Thomas J. Watson Research Division.
[27]
Mary Tork Roth and Peter M Schwarz. 1997. Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. In VLDB, Vol. 97. 25--29.
[28]
Ran Tan, Rada Chirkova, Vijay Gadepally, and Timothy G Mattson. 2017. Enabling query processing across heterogeneous data models: A survey. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 3211--3220.
[29]
Anthony Tomasic, Louiqa Raschid, and Patrick Valduriez. 1998. Scaling access to heterogeneous data sources with DISCO. IEEE Transactions on knowledge and Data Engineering 10, 5 (1998), 808--823.
[30]
Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1009--1024.
[31]
Jingjing Wang, Tobin Baker, Magdalena Balazinska, Daniel Halperin, Brandon Haynes, Bill Howe, Dylan Hutchison, Shrainik Jain, Ryan Maas, Parmita Mehta, Dominik Moritz, Brandon Myers, Jennifer Ortiz, Dan Suciu, Andrew Whitaker, and Shengliang Xu. 2017. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR.

Cited By

View all
  • (2023)Coral: federated query join order optimization based on deep reinforcement learningWorld Wide Web10.1007/s11280-023-01156-026:5(3093-3118)Online publication date: 12-Jun-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
aiDM '19: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management
July 2019
53 pages
ISBN:9781450368025
DOI:10.1145/3329859
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '19
Sponsor:

Acceptance Rates

aiDM '19 Paper Acceptance Rate 8 of 12 submissions, 67%;
Overall Acceptance Rate 19 of 26 submissions, 73%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)2
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Coral: federated query join order optimization based on deep reinforcement learningWorld Wide Web10.1007/s11280-023-01156-026:5(3093-3118)Online publication date: 12-Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media