Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

IBM streams processing language: analyzing big data in motion

Published: 01 May 2013 Publication History

Abstract

The IBM Streams Processing Language (SPL) is the programming language for IBM InfoSphere® Streams, a platform for analyzing Big Data in motion. By "Big Data in motion," we mean continuous data streams at high data-transfer rates. InfoSphere Streams processes such data with both high throughput and short response times. To meet these performance demands, it deploys each application on a cluster of commodity servers. SPL abstracts away the complexity of the distributed system, instead exposing a simple graph-of-operators view to the user. SPL has several innovations relative to prior streaming languages. For performance and code reuse, SPL provides a code-generation interface to C++ and Java®. To facilitate writing well-structured and concise applications, SPL provides higher-order composite operators that modularize stream sub-graphs. Finally, to enable static checking while exposing optimization opportunities, SPL provides a strong type system and user-defined operator models. This paper provides a language overview, describes the implementation including optimizations such as fusion, and explains the rationale behind the language design.

References

[1]
M. Hirzel, H. Andrade, B. Gedik, V. Kumar, G. Losa, M. Mendell, H. Nasgaard, R. Soulé, and K.-L. Wu, "SPL stream processing language specification," IBM Research, Yorktown Heights, NY, USA, Tech. Rep. RC24 897, 2009.
[2]
IBM Corporation, InfoSphere Streams. [Online]. Available: http://www.ibm.com/software/data/infosphere/streams/
[3]
N. Seyfer, R. Tibbetts, and N. Mishkin, "Capture fields: Modularity in a stream-relational event processing language," in Proc. Conf. DEBS, 2011, pp. 15-22.
[4]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, "S4: Distributed stream computing platform," in 2010 IEEE Int. Conf. Data Mining Workshops (ICDMW), 2010, pp. 170-177.
[5]
J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," in Proc. Operat. Syst. Des. Impl. (OSDI), 2004, p. 10.
[6]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "DRYAD: Distributed data-parallel programs from sequential building blocks," in Proc. Eur. Conf. Comput. Syst. (EuroSys), 2007, pp. 59-72.
[7]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A not-so-foreign language for data processing," in Proc. Int. Conf. Manage. Data (SIGMOD), 2008, pp. 1099-1110.
[8]
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita, "Jaql: a scripting language for large scale semistructured data analysis," in Proc. Conf. Very Large Data Bases (VLDB), 2011, pp. 1272-1283.
[9]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, "MapReduce online," in Proc. 7th USENIX Conf. Netw. Syst. Des. Impl. (NSDI'10), 2010, p. 21.
[10]
Y. Ahmad and C. Koch, "DBToaster: A SQL compiler for high-performance delta processing in main-memory databases," Proc. Very Large Data Bases (VLDB-Demo), vol. 2, no. 2, pp. 1566-1569, Aug. 2009.
[11]
R. Soulé, M. Hirzel, R. Grimm, B. Gedik, H. Andrade, V. Kumar, and K.-L. Wu, "A universal calculus for stream processing languages," in Proc. Eur. Symp. Progr. (ESOP), 2010, pp. 507-528.
[12]
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud, "The synchronous data flow programming language LUSTRE," Proc. IEEE, vol. 79, no. 9, pp. 1305-1320, Sep. 1991.
[13]
G. Berry and G. Gonthier, "The ESTEREL synchronous programming language: Design, semantics, implementation," Sci. Comput. Programm., vol. 19, no. 2, pp. 87-152, Nov. 1992.
[14]
M. I. Gordon, W. Thies, and S. Amarasinghe, "Exploiting coarse-grained task, data, and pipeline parallelism in stream programs," in Proc. Arch. Supp. Programm. Lang. Op. Syst. (ASPLOS), 2006, pp. 151-162.
[15]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah, "TelegraphCQ: Continuous dataflow processing for an uncertain world," in Proc. Conf. Innov. Data Syst. Res. (CIDR), 2003, p. 668.
[16]
A. Arasu, S. Babu, and J. Widom, "The CQL continuous query language: Semantic foundations and query execution," VLDB J., vol. 15, no. 2, pp. 121-142, Jun. 2006.
[17]
D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, "Aurora: A new model and architecture for data stream management," VLDB J., vol. 12, no. 2, pp. 120-139, Aug. 2003.
[18]
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, "The design of the Borealis stream processing engine," in Proc. Conf. Innov. Data Syst. Res. (CIDR), 2005, pp. 277-289.
[19]
R. S. Barga, J. Goldstein, M. Ali, and M. Hong, "Consistent streaming through time: A vision for event stream processing," in Proc. Conf. Innov. Data Syst. Res. (CIDR), 2007, pp. 363-373.
[20]
B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and M. Doo, "SPADE: The System S declarative stream processing engine," in Proc. Int. Conf. Manage. Data (SIGMOD), 2008, pp. 1123-1134.
[21]
J. Chen, D. J. DeWitt, F. Tian, and Y. Wang, "NiagaraCQ: A scalable continuous query system for internet databases," in Proc. Int. Conf. Manage. Data (SIGMOD), 2000, pp. 379-390.
[22]
J. Agrawal, Y. Diao, D. Gyllstrom, and N. Immerman, "Efficient pattern matching over event streams," in Proc. Int. Conf. Manage. Data (SIGMOD), 2008, pp. 147-160.
[23]
A. Demers, J. Gehrke, B. Panda, M. Riedewald, V. Sharma, and W. White, "Cayuga: A general purpose event monitoring system," in Proc. Conf. Innov. Data Syst. Res. (CIDR), 2007, pp. 412-422.
[24]
IBM Corporation, WebSphere Operational Decision Management. [Online]. Available: http://www.ibm.com/software/decision-management/operational-decision-management/websphere-operational-decision-management/
[25]
Progress Software, Progress Apama. [Online]. Available: http://www.progress.com/en/apama/index.html
[26]
TIBCO, TIBCO BusinessEvents. [Online]. Available: http://www. tibco.com/multimedia/ds-businessevents_tcm8-796.pdf
[27]
O. Etzion and P. Niblett, Event Processing in Action. Greenwich, CT, USA: Manning Publ., 2010.
[28]
M. Hirzel, "Partition and compose: Parallel complex event processing," in Proc. Conf. DEBS, 2012, pp. 191-200.
[29]
N. Marz, Storm: Distributed and fault-tolerant real-time computing. [Online]. Available: http://storm-project.net/
[30]
M. Mendell, H. Nasgaard, E. Bouillet, M. Hirzel, and B. Gedik, "Extending a general-purpose streaming system for XML," in Proc. Int. Conf. Extend. Database Technol. (EDBT), 2012, pp. 534-539.
[31]
B. Gedik and H. Andrade, "A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams," Softw. Pract. Exp. (SP'E), vol. 42, no. 11, pp. 1363-1391, Nov. 2012.
[32]
E. Kohlbecker, D. P. Friedman, M. Felleisen, and B. Duba, "Hygienic macro expansion," in Proc. LISP Funct. Programm. (LFP), 1986, pp. 151-161.
[33]
M. Hirzel and B. Gedik, "Streams that compose using macros that oblige," in Proc. Workshop Partial Eval. Progr. Manipul. (PEPM), 2012, pp. 141-150.
[34]
R. Khandekar, K. Hildrum, S. Parekh, D. Rajan, J. Wolf, K.-L. Wu, H. Andrade, and B. Gedik, "COLA: Optimizing stream processing applications via graph partitioning," in Proc. Int. Middleware Conf. (MIDDLEWARE), 2009, pp. 308-327.
[35]
S. Schneider, M. Hirzel, B. Gedik, and K.-L. Wu, "Auto-parallelizing stateful distributed streaming applications," in Proc. Int. Conf. Parallel Arch. Compil. Techn. (PACT), 2012, pp. 53-64.
[36]
IBM Corporation, Streams Exchange: A place for InfoSphere Streams application developers to share code and ideas with others. [Online]. Available: https://www.ibm.com/developerworks/wikis/display/streams/Home
[37]
W. De Pauw, M. Letia, B. Gedik, H. Andrade, A. Frenkiel, M. Pfeifer, and D. Sow, "Visual debugging for stream processing applications," in Proc. Int. Conf. Runtime Verific., 2010, pp. 18-35.
[38]
Y. Park, R. King, S. Nathan, W. Most, and H. Andrade, "Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware," Softw. Pract. Exp., vol. 42, no. 1, pp. 37-56, Jan. 2012.

Cited By

View all
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • (2022)SciStreamProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531475(185-198)Online publication date: 27-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IBM Journal of Research and Development
IBM Journal of Research and Development  Volume 57, Issue 3-4
May/July 2013
175 pages
ISSN:0018-8646
  • Editor:
  • Aya Soffer
Issue’s Table of Contents

Publisher

IBM Corp.

United States

Publication History

Published: 01 May 2013
Accepted: 07 August 2012
Received: 11 July 2012

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • (2022)SciStreamProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531475(185-198)Online publication date: 27-Jun-2022
  • (2021)Synchronization SchemasProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458317(1-18)Online publication date: 20-Jun-2021
  • (2021)Distributed Stream KNN JoinProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457269(1597-1609)Online publication date: 9-Jun-2021
  • (2019)Analyzing efficient stream processing on modern hardwareProceedings of the VLDB Endowment10.14778/3303753.330375812:5(516-530)Online publication date: 1-Jan-2019
  • (2019)Orchestrating Big Data Analysis Workflows in the CloudACM Computing Surveys10.1145/333230152:5(1-41)Online publication date: 13-Sep-2019
  • (2019)Data-trace types for distributed stream processing systemsProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314580(670-685)Online publication date: 8-Jun-2019
  • (2019)Automated multi-dimensional elasticity for streaming runtimesProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3301492(427-428)Online publication date: 16-Feb-2019
  • (2019)Big data and rule-based recommendation system in Internet of ThingsCluster Computing10.1007/s10586-017-1078-y22:1(1837-1846)Online publication date: 1-Jan-2019
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media