skip to main content
abstract

Fast Concurrent Data Sketches

Published: 16 July 2019 Publication History

Abstract

Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability we prove our algorithm's correctness and analyse the error it induces in two specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.

References

[1]
Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS '12). ACM, New York, NY, USA, 23--34.
[2]
Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. 2015. The SprayList: A Scalable Relaxed Priority Queue. SIGPLAN Not., Vol. 50, 8 (Jan. 2015), 11--20.
[3]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting Distinct Elements in a Data Stream. In Randomization and Approximation Techniques in Computer Science, Jos'e D. P. Rolim and Salil Vadhan (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--10.
[4]
Edith Cohen. 2014. All-distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '14). ACM, New York, NY, USA, 88--99.
[5]
Graham Cormode. 2017. Data Sketching. Queue, Vol. 15, 2, Article 60 (April 2017), bibinfonumpages19 pages.
[6]
Graham Cormode, S Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG), Vol. 7, 2 (2011), 21.
[7]
Druid. {n. d.}. How We Scaled HyperLogLog: Three Real-World Optimizations . http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html .
[8]
Github. {n. d.}. ArrayIndexOutOfBoundsException during serialization. https://github.com/DataSketches/sketches-core/issues/178##issuecomment-365673204. .
[9]
Wojciech Golab, Lisa Higham, and Philipp Woelfel. 2011. Linearizable implementations do not suffice for randomized distributed computation. In Proceedings of the forty-third annual ACM symposium on Theory of computing. ACM, 373--382.
[10]
Thomas A Henzinger, Christoph M Kirsch, Hannes Payer, Ali Sezgin, and Ana Sokolova. 2013. Quantitative relaxation of concurrent data structures. In ACM SIGPLAN Notices, Vol. 48. ACM, 317--328.
[11]
Stefan Heule, Marc Nunkesser, and Alex Hall. 2013. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In Proceedings of the EDBT 2013 Conference . Genoa, Italy.
[12]
Lee Rhodes. {n. d.}. SketchesArgumentException: Key not found and no empty slot in table. https://groups.google.com/d/msg/sketches-user/S1PEAneLmhk/dI8RbN6iBAAJ. .
[13]
Hamza Rihani, Peter Sanders, and Roman Dementiev. 2014. Multiqueues: Simpler, faster, and better relaxed concurrent priority queues. arXiv preprint arXiv:1411.1209 (2014).
[14]
Arik Rinberg, Alexander Spiegelman, Edward Bortnikov, Eshcar Hillel, Idit Keidar, and Hadar Serviansky. 2019. Fast Concurrent Data Sketches. arXiv preprint arXiv:1902.10995 (2019).
[15]
Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching Linear Classifiers over Data Streams. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). ACM, New York, NY, USA, 757--772.
[16]
VMware. {n. d.}. Hillview: A Big Data Spreadsheet . https://github.com/vmware/hillview .
[17]
Yahoo. {n. d.}. DataSketches: Concurrent Theta Sketch Implementation . https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/theta/ConcurrentDirectQuickSelectSketch.java .
[18]
Yahoo! {n. d.}. DataSketches: sketches library from Yahoo! https://datasketches.github.io/.
[19]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic Sketch: Adaptive and Fast Network-wide Measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '18). ACM, New York, NY, USA, 561--575.

Cited By

View all
  • (2020)SwiShmemProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425946(160-167)Online publication date: 4-Nov-2020
  • (2020)Fast concurrent data sketchesProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374512(117-129)Online publication date: 19-Feb-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODC '19: Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing
July 2019
563 pages
ISBN:9781450362177
DOI:10.1145/3293611
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 July 2019

Check for updates

Author Tags

  1. analysis of distributed algorithms
  2. concurrency
  3. design
  4. persistence
  5. synchronization

Qualifiers

  • Abstract

Conference

PODC '19
Sponsor:
PODC '19: ACM Symposium on Principles of Distributed Computing
July 29 - August 2, 2019
Toronto ON, Canada

Acceptance Rates

PODC '19 Paper Acceptance Rate 48 of 173 submissions, 28%;
Overall Acceptance Rate 740 of 2,477 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)SwiShmemProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425946(160-167)Online publication date: 4-Nov-2020
  • (2020)Fast concurrent data sketchesProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374512(117-129)Online publication date: 19-Feb-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media