Abstract
In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Generally in functional programming, higher-order functions might also produce functions as their outputs, but this does not appear in stream processing.
- 2.
Mind that sum in this example is a pre-defined aggregation function, however, a UDF can also be typically provided to declare an incremental computation.
References
Apache Hadoop project, https://hadoop.apache.org/
Apache Kafka project, http://kafka.apache.org/
Apache Samza project, http://samza.apache.org/
Apache Spark project, http://spark.apache.org/
Apache Storm project, http://storm.apache.org/
D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture for data stream management, in VLDBJ (2003)
D.J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., The design of the Borealis stream processing engine, in CIDR (2005)
K.J. Ahn, S. Guha, A. McGregor, Graph sketches: sparsification, spanners, and subgraphs, in Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012), pp. 5–14
T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: Fault-tolerant stream processing at internet scale, in VLDB (2013)
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt et al, The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing, in VLDB (2015)
A. Alexandrov, R. Bergmann, S. Ewen, J.C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., The Stratosphere platform for big data analytics. VLDB J. - Int. J. Very Large Data Bases 23(6), 939–964 (2014)
A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, V. Markl, Implicit parallelism through deep language embedding, in ACM SIGMOD (2015), pp. 47–61
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Stream: The stanford data stream management system, Book chapter (2004)
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A.S. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts, Linear road: a stream data management benchmark. in Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB Endowment, vol. 30 (2004), pp. 480–491
A. Arasu, S. Babu, J. Widom, The CQL continuous query language: semantic foundations and query execution, in VLDBJ (2006)
M. Balazinska, H. Balakrishnan, S.R. Madden, M. Stonebraker, Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
M. Balazinska, J.H. Hwang, M.A. Shah, Fault-tolerance and high availability in data stream management systems., in Encyclopedia of Database Systems (Springer, 2009), pp. 1109–1115
L. Becchetti, P. Boldi, C. Castillo, A. Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2008), pp. 16–24
Benchmarking streaming computation engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
T. Bernhardt, A. Vasseur, Esper: Event Stream Processing and Correlation. ON-Java (O’Reilly, Springfield, 2007)
A. Bifet, R. Gavaldà, Adaptive learning from evolving data streams, in Advances in Intelligent Data Analysis VIII (Springer, Berlin, 2009), pp. 249–260
A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
I. Botan, R. Derakhshan, N. Dindar, L. Haas, R.J. Miller, N. Tatbul, Secret: A model for analysis of the execution semantics of stream processing systems, in VLDB (2010)
L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, B. Panda, M. Riedewald, M. Thatte, W. White, Cayuga: a high-performance event processing engine, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM, 2007), pp. 1100–1102
P. Carbone, K. Vandikas, F. Zaloshnja, Towards highly available complex event processing deployments in the cloud, in Seventh International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST) (IEEE, 2013), pp. 153–158
P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, K. Tzoumas, Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin (2015)
P. Carbone, G. Fóra, S. Ewen, S. Haridi, K. Tzoumas, Lightweight asynchronous snapshots for distributed dataflows (2015). arXiv preprint arXiv:1506.08603
P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, V. Markl, Cutty: Aggregate sharing for user-defined windows, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016)
A. Carzaniga, D.S. Rosenblum, A.L. Wolf, Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. (TOCS) 19(3), 332–383 (2001)
R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Integrating scale out and fault tolerance in stream processing using operator state management, in Proceedings of the 2013 ACM SIGMOD international conference on Management of data (ACM, 2013), pp. 725–736
U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul et al., S-store: A streaming newSQL system for big velocity applications. Proc. VLDB Endow. 7(13), 1633–1636 (2014)
C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in ACM Sigplan Notices, vol. 45 (ACM, 2010), pp. 363–375
B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J.C. Platt, J.F. Terwilliger, J. Wernsing, Trill: A high-performance incremental query processor for diverse analytics. Proc. VLDB Endow. 8(4), 401–412 (2014)
S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, F. Reiss, M.A. Shah, TelegraphCQ: continuous dataflow processing, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, 2003), pp. 668–668
K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
J. Chen, D.J. DeWitt, F. Tian, Y. Wang, Niagaracq: A scalable continuous query system for internet databases, in SIGMOD Record (ACM, 2000)
M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S.B. Zdonik, Scalable distributed stream processing. CIDR. 3, 257–268 (2003)
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online. NSDI. 10, 20 (2010)
G. Cugola, A. Margara, Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
U. Dayal, B. Blaustein, A. Buchmann, U. Chakravarthy, M. Hsu, R. Ledin, D. McCarthy, A. Rosenthal, S. Sarin, M.J. Carey et al., The HiPAC project: Combining active databases and timing constraints. ACM Sigmod Rec. 17(1), 51–70 (1988)
G. De Francisci Morales, A. Bifet, Samoa: Scalable advanced massive online analysis. J. Mach. Learn. Res. 16(1), 149–153 (2015)
J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
N. Dindar, N. Tatbul, R.J. Miller, L.M. Haas, I. Botan, Modeling the execution semantics of stream processing engines with secret. VLDB J. 22(4), 421–446 (2013)
D. Elin, T. Risch, Amos II java interfaces. Uppsala University report (2000)
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)
R.C. Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Making state explicit for imperative big data processing, in Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14) (2014), pp. 49–60
S. Gatziu, K.R. Dittrich, Samos: An active object-oriented database system. IEEE Data Eng. Bull. 15(1–4), 23–26 (1992)
B. Gedik, Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)
Google Cloud Dataflow, https://cloud.google.com/dataflow/
W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, E. Chen, Chronos: a graph engine for temporal graph analysis, in Proceedings of the Ninth European Conference on Computer Systems (ACM, 2014), p. 1
B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, L. Zhou, Comet: batched stream processing for data intensive distributed computing, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 63–74
M. Hirzel, H. Andrade, B. Gedik, V. Kumar, G. Losa, M. Nasgaard, R. Soule, K. Wu, SPL stream processing language specification. NewYork: IBMResearchDivisionTJ. WatsonResearchCenter, IBM ResearchReport: RC24897 (W0911–044) (2009)
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé et al., IBM streams processing language: analyzing big data in motion. IBM J. Res. Develop. 57(3/4), 7–1 (2013)
M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations. ACM Comput. Surv. (CSUR) 46(4), 46 (2014)
Introduction to Kafka Streams, http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
A. Iyer, L.E. Li, I. Stoica, CellIQ: real-time cellular network analytics at scale, in 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (2015), pp. 309–322
R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E.P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endow. 1(2), 1496–1499 (2008)
K. Karanasos, A. Katsifodimos, I. Manolescu, Delta: Scalable data dissemination under capacity constraints. Proc. VLDB Endow. 7(4), 217–228 (2013)
J. Kreps, N. Narkhede, J. Rao et al, Kafka: A distributed messaging system for log processing. NetDB (2011)
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron: Stream processing at scale, in ACM SIGMOD (2015)
A. Kyrola, G. Blelloch, C. Guestrin, Graphchi: Large-scale graph computation on just a pc, in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (2012), pp. 31–46
A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, in ACM SIGMOD (2005)
L. Liu, C. Pu, W. Tang, Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)
Y. Liu, B. Plale et al., Survey of publish subscribe event systems. Computer Science Dept, Indian University 16 (2003)
D. Logothetis, C. Olston, B. Reed, K.C. Webb, K. Yocum, Stateful bulk processing for incremental analytics, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 51–62
D. Luckham, The power of events, vol. 204 (Addison-Wesley Reading, Boston, 2002)
G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, 2010), pp. 135–146
N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning Publications Co., Greenwich, 2015)
D. Mishra, SNOOP: an event specification language for active database systems. Ph.D. thesis, University of Florida (1991)
S.S. Muchnick, Advanced Compiler Design Implementation (Morgan Kaufmann, Burlington, 1997)
D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi, Naiad: a timely dataflow system, in ACM SOSP (2013)
L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed stream computing platform, in Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (IEEE, 2010), pp. 170–177
K. Patroumpas, T. Sellis, Window specification over data streams, in Current Trends in Database Technology–EDBT 2006 (Springer, Berlin, 2006), pp. 445–464
D. Peleg, A.A. Schäffer, Graph spanners. J. Graph Theory 13(1), 99–116 (1989)
M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An adaptive partitioning operator for continuous query systems, in Proceedings of the 19th International Conference on Data Engineering (IEEE, 2003), pp. 25–36
M.A. Shah, J.M. Hellerstein, E. Brewer, Highly available, fault-tolerant, parallel dataflows, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (ACM, 2004), pp. 827–838
U. Srivastava, J. Widom, Flexible time management in data stream systems. in Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM, 2004), pp. 263–274
StreamBase I: Streambase: Real-time, low latency data processing with a stream processing engine (2006)
J. Thaler, Semi-streaming algorithms for annotated graph streams (2014). arXiv preprint arXiv:1407.3462
The Apache APEX project, https://www.datatorrent.com/apex/
The Apache Beam System, https://wiki.apache.org/incubator/BeamProposal
The Kappa Architecture by Jay Kreps, http://milinda.pathirage.org/kappa-architecture.com/
The Trident Stream Processing Programming Model, http://storm.apache.org/releases/0.10.0/Trident-tutorial.html
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al, Storm @ Twitter, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (ACM, 2014), pp. 147–156
J. Webber, A programmatic introduction to Neo4j, in Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software For Humanity (ACM, 2012), pp. 217–218
R.S. Xin, J.E. Gonzalez, M.J. Franklin, I. Stoica, GraphX: A resilient distributed graph system on Spark, in First International Workshop on Graph Data Management Experiences and Systems (ACM, 2013), p. 2
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010)
M. Zaharia, T. Das, H. Li, S. Shenker, I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters, in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (USENIX Association, 2012), pp. 10–10
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Carbone, P. et al. (2017). Large-Scale Data Stream Processing Systems. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-49340-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)