research-article

Challenges and experiences in building an efficient apache beam runner for IBM streams

Authors:

John MacMillan,

Daniel Debrunner,

William Marshall,

Kun-Lung WuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 12

Pages 1742 - 1754

https://doi.org/10.14778/3229863.3229864

Published: 01 August 2018 Publication History

Abstract

This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface for multiple computing engines. Each participating engine implements a runner to translate Beam applications into engine-specific programs. Hence, applications written with the Beam SDK can be executed on different underlying stream computing engines, with negligible migration penalty. IBM Streams is a widely-used enterprise streaming platform. It has a rich set of connectors and toolkits for easy integration of streaming applications with other enterprise applications. It also supports a broad range of programming language interfaces, including Java, C++, Python, Stream Processing Language (SPL) and Apache Beam. This paper focuses on our solutions to efficiently support the Beam programming abstractions in IBM Streams runner. Beam organizes data into discrete event time windows. This design, on the one hand, supports out-of-order data arrivals, but on the other hand, forces runners to maintain more states, which leads to higher space and computation overhead. IBM Streams runner mitigates this problem by efficiently indexing inter-dependent states, garbage-collecting stale keys, and enforcing bundle sizes. We also share performance concerns in Beam that could potentially impact applications. Evaluations show that IBM Streams runner outperforms Flink runner and Spark runner in most scenarios when running the Beam NEXMark benchmarks. IBM Streams runner is available for download from IBM Cloud Streaming Analytics service console.

References

[1]

Alibaba JStorm: an enterprise fast and stable streaming process engine. http://jstorm.io/. Retrieved Feb, 2018.

[2]

Apache Apex: Enterprise-grade unified stream and batch processing engine. https://apex.apache.org/. Retrieved Feb, 2018.

[3]

Apache Beam: An advanced unified programming model. https://beam.apache.org. Retrieved Feb, 2018.

[4]

Apache Beam Programming Guide. https://beam.apache.org/documentation/programming-guide/. Retrieved Feb, 2018.

[5]

Apache Flink: an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. https://flink.apache.org/. Retrieved Feb, 2018.

[6]

Apache Gearpump: a real-time big data streaming engine. https://gearpump.apache.org. Retrieved Feb, 2018.

[7]

Apache Hadoop. http://hadoop.apache.org/. Retrieved Feb, 2018.

[8]

Apache Samza: a distributed stream processing framework. https://samza.apache.org/. Retrieved Feb, 2018.

[9]

Apache Spark: a fast and general engine for large-scale data processing. https://spark.apache.org/. Retrieved Feb, 2018.

[10]

AthenaX: SQL-based streaming analytics platform at scale. http://athenax.readthedocs.io/. Retrieved Feb, 2018.

[11]

Google Dataflow: Simplified stream and batch data processing, with equal reliability and expressiveness. https://cloud.google.com/dataflow/. Retrieved Feb, 2018.

[12]

IBM Stream Analytics: Leverage continuously available data from all sources to discover opportunities faster. https://www.ibm.com/cloud/streaming-analytics. Retrieved Feb, 2018.

[13]

IBM Streams Runner for Apache Beam. http://ibmstreams.github.io/streamsx.documentation/docs/beamrunner/. Retrieved Feb, 2018.

[14]

IBM Streams Topology Toolkit. http://ibmstreams.github.io/streamsx.topology/. Retrieved Feb, 2018.

[15]

Nexmark benchmark suite. https://beam.apache.org/documentation/sdks/java/nexmark/. Retrieved May, 2018.

[16]

Stateful processing with Apache Beam. https://beam.apache.org/blog/2017/02/13/stateful-processing.html. Retrieved Feb, 2018.

[17]

StreamsDev: IBM Streams Developer Community. https://developer.ibm.com/streamsdev/. Retrieved Feb, 2018.

[18]

Twitter Heron: A realtime, distributed, fault-tolerant stream processing engine from Twitter. https://twitter.github.io/heron/. Retrieved Feb, 2018.

[19]

T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: fault-tolerant stream processing at internet scale. PVLDB, 6(11):1033--1044, 2013.

Digital Library

[20]

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB, 8(12):1792--1803, 2015.

Digital Library

[21]

A. Arasu, S. Babu, and J. Widom. The cql continuous query language: semantic foundations and query execution. PVLDB, 15(2):121--142, 2006.

Digital Library

[22]

P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in apache flink: consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.

Digital Library

[23]

K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63--75, 1985.

Digital Library

[24]

A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy. Dhalion: self-regulating stream processing in heron. PVLDB, 10(12):1825--1836, 2017.

Digital Library

[25]

T. R. Fulford-Jones, G.-Y. Wei, and M. Welsh. A portable, low-power, wireless two-lead ekg system. In Engineering in Medicine and Biology Society, 2004. IEMBS'04. 26th Annual International Conference of the IEEE, volume 1, pages 2141--2144, 2004.

[26]

B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems, 25(6):1447--1463, 2014.

Digital Library

[27]

M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, et al. Ibm streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4):7--1, 2013.

Digital Library

[28]

G. Jacques-Silva, F. Zheng, D. Debrunner, K.-L. Wu, V. Dogaru, E. Johnson, M. Spicer, and A. E. Sariyüce. Consistent regions: Guaranteed tuple processing in ibm streams. PVLDB, 9(13):1341--1352, 2016.

Digital Library

[29]

S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Stream processing at scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015.

Digital Library

[30]

S. Li, S. Hu, R. Ganti, M. Srivatsa, and T. Abdelzaher. Pyro: A spatial-temporal big-data storage system. In USENIX Annual Technical Conference (USENIX ATC), pages 97--109, 2015.

Digital Library

[31]

S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: stateful scalable stream processing at linkedin. PVLDB, 10(12):1634--1645, 2017.

Digital Library

[32]

S. Schneider, M. Hirzel, B. Gedik, and K.-L. Wu. Safe data parallelism for general streaming. IEEE transactions on computers, 64(2):504--517, 2015.

[33]

S. Schneider, J. Wolf, K. Hildrum, R. Khandekar, and K.-L. Wu. Dynamic load balancing for ordered data-parallel regions in distributed streaming systems. In Proceedings of the 17th International Middleware Conference, Middleware '16, 2016.

Digital Library

[34]

S. Schneider and K.-L. Wu. Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 648--661, 2017.

Digital Library

[35]

K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. PVLDB, 8(7):702--713, 2015.

Digital Library

[36]

A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014.

Digital Library

[37]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.

Digital Library

[38]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Digital Library

Cited By

Šprem ŠTomažin NMatečić JHorvat M(2024)Building Advanced Web Applications Using Data Ingestion and Data Processing ToolsElectronics10.3390/electronics1304070913:4(709)Online publication date: 9-Feb-2024
https://doi.org/10.3390/electronics13040709
Shah SKoranne VAhmed KAmbawade D(2024)Bifrost : No-Code ETL Tool2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511886(1-6)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10511886
Zhang TLiu HLiu YChen W(2024)DataLakeIO: A Connector Between Apache Beam and Data Lake2024 9th International Conference on Electronic Technology and Information Science (ICETIS)10.1109/ICETIS61828.2024.10593666(790-793)Online publication date: 17-May-2024
https://doi.org/10.1109/ICETIS61828.2024.10593666
Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 12

August 2018

426 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018

Published in PVLDB Volume 11, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Šprem ŠTomažin NMatečić JHorvat M(2024)Building Advanced Web Applications Using Data Ingestion and Data Processing ToolsElectronics10.3390/electronics1304070913:4(709)Online publication date: 9-Feb-2024
https://doi.org/10.3390/electronics13040709
Shah SKoranne VAhmed KAmbawade D(2024)Bifrost : No-Code ETL Tool2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511886(1-6)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10511886
Zhang TLiu HLiu YChen W(2024)DataLakeIO: A Connector Between Apache Beam and Data Lake2024 9th International Conference on Electronic Technology and Information Science (ICETIS)10.1109/ICETIS61828.2024.10593666(790-793)Online publication date: 17-May-2024
https://doi.org/10.1109/ICETIS61828.2024.10593666
Demirezen MNavruz T(2023)Performance Analysis of Lambda Architecture-Based Big-Data Systems on Air/Ground Surveillance Application with ADS-B DataSensors10.3390/s2317758023:17(7580)Online publication date: 31-Aug-2023
https://doi.org/10.3390/s23177580
Lee GMaeng JPark JSeo JCho HYang YUm TLee JLee JChun BFedorova ANarayanan DDi Luna GQuerzoni L(2023)FlowKV: A Semantic-Aware Store for Large-Scale State Management of Stream Processing EnginesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567493(768-783)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567493
Pineda-Jaramillo JViti F(2023)MLOps in freight rail operationsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106222123:PAOnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.engappai.2023.106222
Omoregbee PForshaw MThomas N(2023)A State-Size Inclusive Approach to Optimizing Stream Processing ApplicationsComputer Performance Engineering and Stochastic Modelling10.1007/978-3-031-43185-2_22(325-339)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1007/978-3-031-43185-2_22
Huang XLi JYan JWang L(2022)An adaptive geographic meshing and coding method for remote sensing dataIOP Conference Series: Earth and Environmental Science10.1088/1755-1315/1004/1/0120061004:1(012006)Online publication date: 1-Mar-2022
https://doi.org/10.1088/1755-1315/1004/1/012006
Oluyisola OBhalla SSgarbossa FStrandhagen J(2022)Designing and developing smart production planning and control systems in the industry 4.0 era: a methodology and case studyJournal of Intelligent Manufacturing10.1007/s10845-021-01808-w33:1(311-332)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s10845-021-01808-w
Tsitsipas AEisenhart GSeybold DWesner S(2022)Scalable Shapeoid Recognition on Multivariate Data Streams with Apache BeamIntelligent Computing10.1007/978-3-031-10461-9_48(695-714)Online publication date: 7-Jul-2022
https://doi.org/10.1007/978-3-031-10461-9_48
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents