Skip to main content

Serving Hybrid-Cloud SQL Interactive Queries at Twitter

  • Conference paper
  • First Online:
Software Architecture (ECSA 2021)

Abstract

The demand for data analytics has been consistently increasing in the past years at Twitter. In order to fulfill the requirements and provide a highly scalable and available query experience, a large-scale in-house SQL system is heavily relied on. Recently, we evolved the SQL system into a hybrid-cloud SQL federation system, compliant with Twitter’s Partly Cloudy strategy. The hybrid-cloud SQL federation system is capable of processing queries across Twitter’s data centers and the public cloud, interacting with around 10PB of data per day.

In this paper, the design of the hybrid-cloud SQL federation system is presented, which consists of query, cluster, and storage federations. We identify challenges in a modern SQL system and demonstrate how our system addresses them with some important design decisions. We also conduct qualitative examinations and summarize instructive lessons learned from the development and operation of such a SQL system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 39.99
Price excludes VAT (USA)
Softcover Book
USD 54.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    At Twitter, SQL system users cannot create or update datasets except exclusive temporary tables under personal accounts. Due to the requirements for data lineage and governance, only data pipeline system accounts have write access to public datasets. SQL individual users only have read access.

  2. 2.

    From an analysis of a typical Twitter OLAP workload in three months, 19.2% of queries consume more than 1 TB peak memory.

References

  1. Aurora configuration (2017). http://aurora.apache.org/documentation/latest/reference/configuration-tutorial/

  2. Apache Beam SQL (2021). https://beam.apache.org/documentation/dsls/sql/overview/

  3. Apache Druid SQL (2021). https://druid.apache.org/docs/latest/querying/sql.html

  4. Apache Zeppelin (2021). https://zeppelin.apache.org/

  5. Google BigQuery (2021). https://cloud.google.com/bigquery

  6. Hadoop ViewFs (2021). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html

  7. Helium packages (2021). https://zeppelin.apache.org/helium_packages.html

  8. Jupyter project (2021). https://jupyter.org/

  9. TPC-H benchmark (2021). http://www.tpc.org/tpch/

  10. Agrawal, P.: A new collaboration with Google Cloud (2018). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/a-new-collaboration-with-google-cloud.html

  11. Aguilar-Saborit, J., et al.: POLARIS: the distributed SQL engine in azure synapse. Proc. VLDB Endow. 13(12), 3204–3216 (2020)

    Article  Google Scholar 

  12. Aleyasen, A., Soliman, M.A., Antova, L., Waas, F.M., Winslett, M.: High-throughput adaptive data virtualization via context-aware query routing. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1709–1718. IEEE (2018)

    Google Scholar 

  13. Armbrust, M., et al.: Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)

    Google Scholar 

  14. Barga, R.: Hadoop filesystem at Twitter (2015). https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter

  15. Chattopadhyay, B., et al.: Procella: unifying serving and analytical data at YouTube. Proc. VLDB Endow. 12(12), 2022–2034 (2019)

    Article  Google Scholar 

  16. Dageville, B., et al.: The snowflake elastic data warehouse. In: Proceedings of the 2016 International Conference on Management of Data, pp. 215–226 (2016)

    Google Scholar 

  17. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  18. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  19. Dem, J.L.: Graduating apache parquet (2015). https://blog.twitter.com/engineering/en_us/a/2015/graduating-apache-parquet.html

  20. Gupta, A., et al.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1917–1923 (2015)

    Google Scholar 

  21. Hashemi, M.: The infrastructure behind Twitter: efficiency and optimization (2016). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2016/the-infrastructure-behind-twitter-efficiency-and-optimization

  22. Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, p. 22 (2011)

    Google Scholar 

  23. Krishnan, S.: Discovery and consumption of analytics data at Twitter (2016). https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html

  24. Lamb, A., et al.: The vertica analytic database: C-store 7 years later. Proc. VLDB Endow. 5(12), 1790–1801 (2012)

    Article  MathSciNet  Google Scholar 

  25. Lawrence, R.: Integration and virtualization of relational SQL and NoSQL systems including MySQL and MongoDB. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 285–290. IEEE (2014)

    Google Scholar 

  26. Lawrence, R.: Faster querying for database integration and virtualization with distributed semi-joins. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1406–1410. IEEE (2017)

    Google Scholar 

  27. Li, Y., et al.: A performance evaluation of spark graphframes for fast and scalable graph analytics at Twitter. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 5959–5959. IEEE (2021)

    Google Scholar 

  28. Luo, Z., et al.: From batch processing to real time analytics: running presto at scale. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE (2022) (in press)

    Google Scholar 

  29. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehmann, J.: Uniform access to multiform data lakes using semantic technologies. In: Proceedings of the 21st International Conference on Information Integration and Web-Based Applications & Services, pp. 313–322 (2019)

    Google Scholar 

  30. Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  31. Melnik, S., et al.: Dremel: a decade of interactive SQL analysis at web scale. Proc. VLDB Endow. 13(12), 3461–3472 (2020)

    Article  Google Scholar 

  32. Mousa, A.H., Shiratuddin, N.: Data warehouse and data virtualization comparative study. In: 2015 International Conference on Developments of E-Systems Engineering (DeSE), pp. 369–372. IEEE (2015)

    Google Scholar 

  33. Mucchetti, M.: BigQuery ML. In: Mucchetti, M. (ed.) BigQuery for Data Warehousing, pp. 419–468. Springer, Berkeley (2020). https://doi.org/10.1007/978-1-4842-6186-6_19

    Chapter  Google Scholar 

  34. Rottinghuis, J.: Partly Cloudy: the start of a journey into the cloud (2019). https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/the-start-of-a-journey-into-the-cloud.html

  35. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364 (2013)

    Google Scholar 

  36. Sethi, R., et al.: Presto: SQL on everything. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1802–1813. IEEE (2019)

    Google Scholar 

  37. Tan, J., et al.: Choosing a cloud DBMS: architectures and tradeoffs. Proc. VLDB Endow. 12(12), 2170–2182 (2019)

    Article  Google Scholar 

  38. Tang, C., et al.: Twine: a unified cluster management system for shared infrastructure. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020), pp. 787–803 (2020)

    Google Scholar 

  39. Tang, C., et al.: Taming hybrid-cloud fast and scalable graph analytics at Twitter. arXiv preprint arXiv:2204.11338 (2022)

  40. Tang, C., et al.: Forecasting SQL query cost at Twitter. In: 2021 IEEE International Conference on Cloud Engineering (IC2E), pp. 154–160. IEEE (2021)

    Google Scholar 

  41. Tang, C., et al.: Hybrid-cloud SQL federation system at Twitter. In: ECSA (Companion) (2021)

    Google Scholar 

  42. Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  43. Tirmazi, M., et al.: Borg: the next generation. In: Proceedings of the Fifteenth European Conference on Computer Systems, pp. 1–14 (2020)

    Google Scholar 

  44. Vathy-Fogarassy, Á., Hugyák, T.: Uniform data access platform for SQL and NoSQL database systems. Inf. Syst. 69, 93–105 (2017)

    Article  Google Scholar 

  45. Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp. 1–16 (2013)

    Google Scholar 

  46. Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, pp. 1–17 (2015)

    Google Scholar 

  47. VijayaRenu, L., Wang, Z., Rottinghuis, J.: Scaling event aggregation at Twitter to handle billions of events per minute. In: 2020 IEEE Infrastructure Conference, pp. 1–4. IEEE (2020)

    Google Scholar 

  48. Wei, C., et al.: AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proc. VLDB Endow. 13(12), 3152–3165 (2020)

    Article  Google Scholar 

  49. Wu, H., et al.: Migrate on-premises real-time data analytics jobs into the cloud. In: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–2. IEEE (2021)

    Google Scholar 

  50. Wu, H., et al.: Move real-time data analytics to the cloud: a case study on heron to dataflow migration. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 2064–2067. IEEE (2021)

    Google Scholar 

  51. Zhan, C., et al.: AnalyticDB: real-time OLAP database system at Alibaba cloud. Proc. VLDB Endow. 12(12), 2059–2070 (2019)

    Article  Google Scholar 

Download references

Acknowledgment

Twitter’s SQL federation system is a complicated project that has evolved for years. We would like to express our gratitude to everyone who has served on Twitter’s Interactive Query team, including former team members Hao Luo, Yaliang Wang, Da Cheng, Fred Dai, and Maosong Fu. We also appreciate Prateek Mukhedkar, Vrushali Channapattan, Daniel Lipkin, Derek Lyon, Srikanth Thiagarajan, Jeremy Zogg, and Sudhir Srinivas for their strategic vision, direction, and support to the team. Finally, we thank Erica Hessel, Alex Angarita Rosales, and the anonymous ECSA reviewers for their informative comments, which considerably improved our paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunxu Tang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, C. et al. (2022). Serving Hybrid-Cloud SQL Interactive Queries at Twitter. In: Scandurra, P., Galster, M., Mirandola, R., Weyns, D. (eds) Software Architecture. ECSA 2021. Lecture Notes in Computer Science, vol 13365. Springer, Cham. https://doi.org/10.1007/978-3-031-15116-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15116-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15115-6

  • Online ISBN: 978-3-031-15116-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics