Serving Hybrid-Cloud SQL Interactive Queries at Twitter

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13365))

Included in the following conference series:

European Conference on Software Architecture

580 Accesses
1 Citations

Abstract

The demand for data analytics has been consistently increasing in the past years at Twitter. In order to fulfill the requirements and provide a highly scalable and available query experience, a large-scale in-house SQL system is heavily relied on. Recently, we evolved the SQL system into a hybrid-cloud SQL federation system, compliant with Twitter’s Partly Cloudy strategy. The hybrid-cloud SQL federation system is capable of processing queries across Twitter’s data centers and the public cloud, interacting with around 10PB of data per day.

In this paper, the design of the hybrid-cloud SQL federation system is presented, which consists of query, cluster, and storage federations. We identify challenges in a modern SQL system and demonstrate how our system addresses them with some important design decisions. We also conduct qualitative examinations and summarize instructive lessons learned from the development and operation of such a SQL system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CloudMdsQL: querying heterogeneous cloud data stores with a common language

Article 25 September 2015

Open-Source Search Engines in the Cloud

Big Data Analytics Using SQL: Quo Vadis?

Notes

1.
At Twitter, SQL system users cannot create or update datasets except exclusive temporary tables under personal accounts. Due to the requirements for data lineage and governance, only data pipeline system accounts have write access to public datasets. SQL individual users only have read access.
2.
From an analysis of a typical Twitter OLAP workload in three months, 19.2% of queries consume more than 1 TB peak memory.

References

Aurora configuration (2017). http://aurora.apache.org/documentation/latest/reference/configuration-tutorial/
Apache Beam SQL (2021). https://beam.apache.org/documentation/dsls/sql/overview/
Apache Druid SQL (2021). https://druid.apache.org/docs/latest/querying/sql.html
Apache Zeppelin (2021). https://zeppelin.apache.org/
Google BigQuery (2021). https://cloud.google.com/bigquery
Hadoop ViewFs (2021). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html
Helium packages (2021). https://zeppelin.apache.org/helium_packages.html
Jupyter project (2021). https://jupyter.org/
TPC-H benchmark (2021). http://www.tpc.org/tpch/
Agrawal, P.: A new collaboration with Google Cloud (2018). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/a-new-collaboration-with-google-cloud.html
Aguilar-Saborit, J., et al.: POLARIS: the distributed SQL engine in azure synapse. Proc. VLDB Endow. 13(12), 3204–3216 (2020)
Article Google Scholar
Aleyasen, A., Soliman, M.A., Antova, L., Waas, F.M., Winslett, M.: High-throughput adaptive data virtualization via context-aware query routing. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1709–1718. IEEE (2018)
Google Scholar
Armbrust, M., et al.: Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Google Scholar
Barga, R.: Hadoop filesystem at Twitter (2015). https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter
Chattopadhyay, B., et al.: Procella: unifying serving and analytical data at YouTube. Proc. VLDB Endow. 12(12), 2022–2034 (2019)
Article Google Scholar
Dageville, B., et al.: The snowflake elastic data warehouse. In: Proceedings of the 2016 International Conference on Management of Data, pp. 215–226 (2016)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dem, J.L.: Graduating apache parquet (2015). https://blog.twitter.com/engineering/en_us/a/2015/graduating-apache-parquet.html
Gupta, A., et al.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)
Google Scholar
Hashemi, M.: The infrastructure behind Twitter: efficiency and optimization (2016). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/the-infrastructure-behind-twitter-efficiency-and-optimization
Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, p. 22 (2011)
Google Scholar
Krishnan, S.: Discovery and consumption of analytics data at Twitter (2016). https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
Lamb, A., et al.: The vertica analytic database: C-store 7 years later. Proc. VLDB Endow. 5(12), 1790–1801 (2012)
Article MathSciNet Google Scholar
Lawrence, R.: Integration and virtualization of relational SQL and NoSQL systems including MySQL and MongoDB. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 285–290. IEEE (2014)
Google Scholar
Lawrence, R.: Faster querying for database integration and virtualization with distributed semi-joins. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1406–1410. IEEE (2017)
Google Scholar
Li, Y., et al.: A performance evaluation of spark graphframes for fast and scalable graph analytics at Twitter. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 5959–5959. IEEE (2021)
Google Scholar
Luo, Z., et al.: From batch processing to real time analytics: running presto at scale. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE (2022) (in press)
Google Scholar
Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehmann, J.: Uniform access to multiform data lakes using semantic technologies. In: Proceedings of the 21st International Conference on Information Integration and Web-Based Applications & Services, pp. 313–322 (2019)
Google Scholar
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Article Google Scholar
Melnik, S., et al.: Dremel: a decade of interactive SQL analysis at web scale. Proc. VLDB Endow. 13(12), 3461–3472 (2020)
Article Google Scholar
Mousa, A.H., Shiratuddin, N.: Data warehouse and data virtualization comparative study. In: 2015 International Conference on Developments of E-Systems Engineering (DeSE), pp. 369–372. IEEE (2015)
Google Scholar
Mucchetti, M.: BigQuery ML. In: Mucchetti, M. (ed.) BigQuery for Data Warehousing, pp. 419–468. Springer, Berkeley (2020). https://doi.org/10.1007/978-1-4842-6186-6_19
Chapter Google Scholar
Rottinghuis, J.: Partly Cloudy: the start of a journey into the cloud (2019). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/the-start-of-a-journey-into-the-cloud.html
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364 (2013)
Google Scholar
Sethi, R., et al.: Presto: SQL on everything. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1802–1813. IEEE (2019)
Google Scholar
Tan, J., et al.: Choosing a cloud DBMS: architectures and tradeoffs. Proc. VLDB Endow. 12(12), 2170–2182 (2019)
Article Google Scholar
Tang, C., et al.: Twine: a unified cluster management system for shared infrastructure. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 787–803 (2020)
Google Scholar
Tang, C., et al.: Taming hybrid-cloud fast and scalable graph analytics at Twitter. arXiv preprint arXiv:2204.11338 (2022)
Tang, C., et al.: Forecasting SQL query cost at Twitter. In: 2021 IEEE International Conference on Cloud Engineering (IC2E), pp. 154–160. IEEE (2021)
Google Scholar
Tang, C., et al.: Hybrid-cloud SQL federation system at Twitter. In: ECSA (Companion) (2021)
Google Scholar
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Tirmazi, M., et al.: Borg: the next generation. In: Proceedings of the Fifteenth European Conference on Computer Systems, pp. 1–14 (2020)
Google Scholar
Vathy-Fogarassy, Á., Hugyák, T.: Uniform data access platform for SQL and NoSQL database systems. Inf. Syst. 69, 93–105 (2017)
Article Google Scholar
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1–16 (2013)
Google Scholar
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, pp. 1–17 (2015)
Google Scholar
VijayaRenu, L., Wang, Z., Rottinghuis, J.: Scaling event aggregation at Twitter to handle billions of events per minute. In: 2020 IEEE Infrastructure Conference, pp. 1–4. IEEE (2020)
Google Scholar
Wei, C., et al.: AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proc. VLDB Endow. 13(12), 3152–3165 (2020)
Article Google Scholar
Wu, H., et al.: Migrate on-premises real-time data analytics jobs into the cloud. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–2. IEEE (2021)
Google Scholar
Wu, H., et al.: Move real-time data analytics to the cloud: a case study on heron to dataflow migration. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 2064–2067. IEEE (2021)
Google Scholar
Zhan, C., et al.: AnalyticDB: real-time OLAP database system at Alibaba cloud. Proc. VLDB Endow. 12(12), 2059–2070 (2019)
Article Google Scholar

Download references

Acknowledgment

Twitter’s SQL federation system is a complicated project that has evolved for years. We would like to express our gratitude to everyone who has served on Twitter’s Interactive Query team, including former team members Hao Luo, Yaliang Wang, Da Cheng, Fred Dai, and Maosong Fu. We also appreciate Prateek Mukhedkar, Vrushali Channapattan, Daniel Lipkin, Derek Lyon, Srikanth Thiagarajan, Jeremy Zogg, and Sudhir Srinivas for their strategic vision, direction, and support to the team. Finally, we thank Erica Hessel, Alex Angarita Rosales, and the anonymous ECSA reviewers for their informative comments, which considerably improved our paper.

Author information

Authors and Affiliations

Twitter, Inc., San Francisco, USA
Chunxu Tang, Beinan Wang, Huijun Wu, Zhenzhao Wang, Yao Li, Vrushali Channapattan, Zhenxiao Luo, Ruchin Kabra, Mainak Ghosh, Nikhil Kantibhai Navadiya, Prachi Mishra, Prateek Mukhedkar & Anneliese Lu

Authors

Chunxu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Beinan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huijun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenzhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yao Li
View author publications
You can also search for this author in PubMed Google Scholar
Vrushali Channapattan
View author publications
You can also search for this author in PubMed Google Scholar
Zhenxiao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Ruchin Kabra
View author publications
You can also search for this author in PubMed Google Scholar
Mainak Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Kantibhai Navadiya
View author publications
You can also search for this author in PubMed Google Scholar
Prachi Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Prateek Mukhedkar
View author publications
You can also search for this author in PubMed Google Scholar
Anneliese Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunxu Tang .

Editor information

Editors and Affiliations

Università di Bergamo, Dalmine, Italy
Patrizia Scandurra
University of Canterbury, Christchurch, New Zealand
Matthias Galster
Politecnico di Milano, Milan, Italy
Raffaela Mirandola
KU Leuven, Kortrijk, Belgium
Danny Weyns

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, C. et al. (2022). Serving Hybrid-Cloud SQL Interactive Queries at Twitter. In: Scandurra, P., Galster, M., Mirandola, R., Weyns, D. (eds) Software Architecture. ECSA 2021. Lecture Notes in Computer Science, vol 13365. Springer, Cham. https://doi.org/10.1007/978-3-031-15116-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-15116-3_1
Published: 19 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15115-6
Online ISBN: 978-3-031-15116-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Serving Hybrid-Cloud SQL Interactive Queries at Twitter

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CloudMdsQL: querying heterogeneous cloud data stores with a common language

Open-Source Search Engines in the Cloud

Big Data Analytics Using SQL: Quo Vadis?

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Serving Hybrid-Cloud SQL Interactive Queries at Twitter

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CloudMdsQL: querying heterogeneous cloud data stores with a common language

Open-Source Search Engines in the Cloud

Big Data Analytics Using SQL: Quo Vadis?

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation