Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Challenges and experiences in building an efficient apache beam runner for IBM streams

Published: 01 August 2018 Publication History

Abstract

This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface for multiple computing engines. Each participating engine implements a runner to translate Beam applications into engine-specific programs. Hence, applications written with the Beam SDK can be executed on different underlying stream computing engines, with negligible migration penalty. IBM Streams is a widely-used enterprise streaming platform. It has a rich set of connectors and toolkits for easy integration of streaming applications with other enterprise applications. It also supports a broad range of programming language interfaces, including Java, C++, Python, Stream Processing Language (SPL) and Apache Beam. This paper focuses on our solutions to efficiently support the Beam programming abstractions in IBM Streams runner. Beam organizes data into discrete event time windows. This design, on the one hand, supports out-of-order data arrivals, but on the other hand, forces runners to maintain more states, which leads to higher space and computation overhead. IBM Streams runner mitigates this problem by efficiently indexing inter-dependent states, garbage-collecting stale keys, and enforcing bundle sizes. We also share performance concerns in Beam that could potentially impact applications. Evaluations show that IBM Streams runner outperforms Flink runner and Spark runner in most scenarios when running the Beam NEXMark benchmarks. IBM Streams runner is available for download from IBM Cloud Streaming Analytics service console.

References

[1]
Alibaba JStorm: an enterprise fast and stable streaming process engine. http://jstorm.io/. Retrieved Feb, 2018.
[2]
Apache Apex: Enterprise-grade unified stream and batch processing engine. https://apex.apache.org/. Retrieved Feb, 2018.
[3]
Apache Beam: An advanced unified programming model. https://beam.apache.org. Retrieved Feb, 2018.
[4]
Apache Beam Programming Guide. https://beam.apache.org/documentation/programming-guide/. Retrieved Feb, 2018.
[5]
Apache Flink: an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. https://flink.apache.org/. Retrieved Feb, 2018.
[6]
Apache Gearpump: a real-time big data streaming engine. https://gearpump.apache.org. Retrieved Feb, 2018.
[7]
Apache Hadoop. http://hadoop.apache.org/. Retrieved Feb, 2018.
[8]
Apache Samza: a distributed stream processing framework. https://samza.apache.org/. Retrieved Feb, 2018.
[9]
Apache Spark: a fast and general engine for large-scale data processing. https://spark.apache.org/. Retrieved Feb, 2018.
[10]
AthenaX: SQL-based streaming analytics platform at scale. http://athenax.readthedocs.io/. Retrieved Feb, 2018.
[11]
Google Dataflow: Simplified stream and batch data processing, with equal reliability and expressiveness. https://cloud.google.com/dataflow/. Retrieved Feb, 2018.
[12]
IBM Stream Analytics: Leverage continuously available data from all sources to discover opportunities faster. https://www.ibm.com/cloud/streaming-analytics. Retrieved Feb, 2018.
[13]
IBM Streams Runner for Apache Beam. http://ibmstreams.github.io/streamsx.documentation/docs/beamrunner/. Retrieved Feb, 2018.
[14]
IBM Streams Topology Toolkit. http://ibmstreams.github.io/streamsx.topology/. Retrieved Feb, 2018.
[15]
Nexmark benchmark suite. https://beam.apache.org/documentation/sdks/java/nexmark/. Retrieved May, 2018.
[16]
Stateful processing with Apache Beam. https://beam.apache.org/blog/2017/02/13/stateful-processing.html. Retrieved Feb, 2018.
[17]
StreamsDev: IBM Streams Developer Community. https://developer.ibm.com/streamsdev/. Retrieved Feb, 2018.
[18]
Twitter Heron: A realtime, distributed, fault-tolerant stream processing engine from Twitter. https://twitter.github.io/heron/. Retrieved Feb, 2018.
[19]
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: fault-tolerant stream processing at internet scale. PVLDB, 6(11):1033--1044, 2013.
[20]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB, 8(12):1792--1803, 2015.
[21]
A. Arasu, S. Babu, and J. Widom. The cql continuous query language: semantic foundations and query execution. PVLDB, 15(2):121--142, 2006.
[22]
P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. State management in apache flink: consistent stateful distributed stream processing. PVLDB, 10(12):1718--1729, 2017.
[23]
K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63--75, 1985.
[24]
A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy. Dhalion: self-regulating stream processing in heron. PVLDB, 10(12):1825--1836, 2017.
[25]
T. R. Fulford-Jones, G.-Y. Wei, and M. Welsh. A portable, low-power, wireless two-lead ekg system. In Engineering in Medicine and Biology Society, 2004. IEMBS'04. 26th Annual International Conference of the IEEE, volume 1, pages 2141--2144, 2004.
[26]
B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu. Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems, 25(6):1447--1463, 2014.
[27]
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, et al. Ibm streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4):7--1, 2013.
[28]
G. Jacques-Silva, F. Zheng, D. Debrunner, K.-L. Wu, V. Dogaru, E. Johnson, M. Spicer, and A. E. Sariyüce. Consistent regions: Guaranteed tuple processing in ibm streams. PVLDB, 9(13):1341--1352, 2016.
[29]
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Stream processing at scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2015.
[30]
S. Li, S. Hu, R. Ganti, M. Srivatsa, and T. Abdelzaher. Pyro: A spatial-temporal big-data storage system. In USENIX Annual Technical Conference (USENIX ATC), pages 97--109, 2015.
[31]
S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell. Samza: stateful scalable stream processing at linkedin. PVLDB, 10(12):1634--1645, 2017.
[32]
S. Schneider, M. Hirzel, B. Gedik, and K.-L. Wu. Safe data parallelism for general streaming. IEEE transactions on computers, 64(2):504--517, 2015.
[33]
S. Schneider, J. Wolf, K. Hildrum, R. Khandekar, and K.-L. Wu. Dynamic load balancing for ordered data-parallel regions in distributed streaming systems. In Proceedings of the 17th International Middleware Conference, Middleware '16, 2016.
[34]
S. Schneider and K.-L. Wu. Low-synchronization, mostly lock-free, elastic scheduling for streaming runtimes. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 648--661, 2017.
[35]
K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. PVLDB, 8(7):702--713, 2015.
[36]
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014.
[37]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.
[38]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Cited By

View all
  • (2024)Building Advanced Web Applications Using Data Ingestion and Data Processing ToolsElectronics10.3390/electronics1304070913:4(709)Online publication date: 9-Feb-2024
  • (2024)Bifrost : No-Code ETL Tool2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511886(1-6)Online publication date: 1-Mar-2024
  • (2024)DataLakeIO: A Connector Between Apache Beam and Data Lake2024 9th International Conference on Electronic Technology and Information Science (ICETIS)10.1109/ICETIS61828.2024.10593666(790-793)Online publication date: 17-May-2024
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 12
August 2018
426 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018
Published in PVLDB Volume 11, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Building Advanced Web Applications Using Data Ingestion and Data Processing ToolsElectronics10.3390/electronics1304070913:4(709)Online publication date: 9-Feb-2024
  • (2024)Bifrost : No-Code ETL Tool2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10511886(1-6)Online publication date: 1-Mar-2024
  • (2024)DataLakeIO: A Connector Between Apache Beam and Data Lake2024 9th International Conference on Electronic Technology and Information Science (ICETIS)10.1109/ICETIS61828.2024.10593666(790-793)Online publication date: 17-May-2024
  • (2023)Performance Analysis of Lambda Architecture-Based Big-Data Systems on Air/Ground Surveillance Application with ADS-B DataSensors10.3390/s2317758023:17(7580)Online publication date: 31-Aug-2023
  • (2023)FlowKV: A Semantic-Aware Store for Large-Scale State Management of Stream Processing EnginesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567493(768-783)Online publication date: 8-May-2023
  • (2023)MLOps in freight rail operationsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.106222123:PAOnline publication date: 1-Aug-2023
  • (2023)A State-Size Inclusive Approach to Optimizing Stream Processing ApplicationsComputer Performance Engineering and Stochastic Modelling10.1007/978-3-031-43185-2_22(325-339)Online publication date: 20-Jun-2023
  • (2022)An adaptive geographic meshing and coding method for remote sensing dataIOP Conference Series: Earth and Environmental Science10.1088/1755-1315/1004/1/0120061004:1(012006)Online publication date: 1-Mar-2022
  • (2022)Designing and developing smart production planning and control systems in the industry 4.0 era: a methodology and case studyJournal of Intelligent Manufacturing10.1007/s10845-021-01808-w33:1(311-332)Online publication date: 1-Jan-2022
  • (2022)Scalable Shapeoid Recognition on Multivariate Data Streams with Apache BeamIntelligent Computing10.1007/978-3-031-10461-9_48(695-714)Online publication date: 7-Jul-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media