Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SPL: An Extensible Language for Distributed Stream Processing

Published: 06 March 2017 Publication History

Abstract

Big data is revolutionizing how all sectors of our economy do business, including telecommunication, transportation, medical, and finance. Big data comes in two flavors: data at rest and data in motion. Processing data in motion is stream processing. Stream processing for big data analytics often requires scale that can only be delivered by a distributed system, exploiting parallelism on many hosts and many cores. One such distributed stream processing system is IBM Streams. Early customer experience with IBM Streams uncovered that another core requirement is extensibility, since customers want to build high-performance domain-specific operators for use in their streaming applications. Based on these two core requirements of distribution and extensibility, we designed and implemented the Streams Processing Language (SPL). This article describes SPL with an emphasis on the language design, distributed runtime, and extensibility mechanism. SPL is now the gateway for the IBM Streams platform, used by our customers for stream processing in a broad range of application domains.

References

[1]
Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uğur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Alexander Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the Borealis stream processing engine. In Conference on Innovative Data Systems Research (CIDR). 277--289.
[2]
Daniel J. Abadi, Don Carney, Uğur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2 (2003), 120--139.
[3]
Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In International Conference on Management of Data (SIGMOD). 147--160.
[4]
Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for high-performance delta processing in main-memory databases. In Demonstration at Very Large Data Bases (VLDB-Demo). 1566--1569.
[5]
Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases (VLDB) Industrial Track. 734--746.
[6]
Mohamed Ali, Badrish Chandramouli, Jonathan Goldstein, and Roman Schindlauer. 2011. The extensibility framework in Microsoft streaminsight. In International Conference on Data Engineering (ICDE). 1242--1253.
[7]
Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL continuous query language: Semantic foundations and query execution. VLDB J. 15, 2 (2006), 121--142.
[8]
Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: A stream data management benchmark. In Conference on Very Large Data Bases (VLDB). 480--491.
[9]
Matthew Arnold, David Grove, Benjamin Herta, Michael Hind, Martin Hirzel, Arun Iyengar, Louis Mandel, V. A. Saraswat, Avraham Shinnar, Jérôme Siméon, Mikio Takeuchi, Olivier Tardieu, and Wei Zhang. 2016. META: Middleware for events, transactions, and analytics. IBM J. Res. Dev. 60, 2--3 (2016), 15:1--15:10.
[10]
Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 89--108.
[11]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Principles of Database Systems (PODS). 1--16.
[12]
Jonathan Bachrach and Keith Playford. 2001. The Java Syntactic Extender (JSE). In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 31--42.
[13]
Roger S. Barga, Jonathan Goldstein, Mohamed Ali, and Mingsheng Hong. 2007. Consistent streaming through time: A vision for event stream processing. In Conference on Innovative Data Systems Research (CIDR). 363--373.
[14]
Gérard Berry and Georges Gonthier. 1992. The Esterel synchronous programming language: Design, semantics, implementation. Sci. Comput. Program. 19, 2 (1992), 87--152.
[15]
Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J. Shekita. 2011. Jaql: A scripting language for large scale semistructured data analysis. In Conference on Very Large Data Bases (VLDB). 1272--1283.
[16]
Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, Haris Koutsopoulos, and Carlos Moran. 2010a. IBM infosphere streams for scalable, real-time, intelligent transportation services. In International Conference on Management of Data (SIGMOD). 1093--1104.
[17]
Alain Biem, Bruce Elmegreen, Olivier Verscheure, Deepak Turaga, Henrique Andrade, and Tim Cornwell. 2010b. A streaming approach to radio astronomy imaging. In Acoustics, Speech, and Signal Processing (ICASSP). 1654--1657.
[18]
Jeffrey Bosboom, Sumanaruban Rajadurai, Weng-Fai Wong, and Saman Amarasinghe. 2014. StreamJIT: A commensal compiler for high-performance stream programming. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 177--195.
[19]
Eric Bouillet, Ravi Kothari, Vibhore Kumar, Laurent Mignet, Senthil Nathan, Anand Ranganathan, Deepak S. Turaga, Octavian Udrea, and Olivier Verscheure. 2012. Experience report: Processing 6 billion CDRs/day: from research to production. In Conference on Distributed Event-Based Systems (DEBS). 264--267.
[20]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Programming Language Design and Implementation (PLDI). 363--375.
[21]
Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel Madden, Vijayshankar Raman, Frederick Reiss, and Mehul A. Shah. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Conference on Innovative Data Systems Research (CIDR).
[22]
Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In International Conference on Management of Data (SIGMOD). 379--390.
[23]
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Networked Systems Design and Implementation (NSDI). 313--328.
[24]
Corinna Cortes, Kathleen Fisher, Daryl Pregibon, and Anne Rogers. 2000. Hancock: A language for extracting signatures from data streams. In Knowledge Discovery and Data Mining (KDD). 9--17.
[25]
Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: A stream database for network applications. In International Conference on Management of Data (SIGMOD) Industrial Track. 647--651.
[26]
Wim De Pauw, Mihai Letia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, Michael Pfeifer, and Daby Sow. 2010. Visual debugging for stream processing applications. In International Conference on Runtime Verification (RV). 18--35.
[27]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation (OSDI). 137--150.
[28]
Alan Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker White. 2007. Cayuga: A general purpose event monitoring system. In Conference on Innovative Data Systems Research (CIDR). 412--422.
[29]
Esper. 2014. Event processing with esper and nesper. Retrieved June 2014 from http://esper.codehaus.org/.
[30]
Buğra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and MyungCheol Doo. 2008. SPADE: The system s declarative stream processing engine. In International Conference on Management of Data (SIGMOD). 1123--1134.
[31]
Buğra Gedik, Scott Schneider, Martin Hirzel, and Kun-Lung Wu. 2014. Elastic scaling for data stream processing. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1447--1463.
[32]
Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). 151--162.
[33]
Nicholas Halbwachs, Paul Caspi, Pascal Raymond, and Daniel Pilaud. 1991. The synchronous data flow programming language LUSTRE. Proc. IEEE 79, 9 (1991), 1305--1320.
[34]
Martin Hirzel. 2012. Partition and compose: Parallel complex event processing. In Conference on Distributed Event-Based Systems (DEBS). 191--200.
[35]
Martin Hirzel, Henrique Andrade, Buğra Gedik, Gabriela Jacques-Silva, Rohit Khandekar, Vibhore Kumar, Mark Mendell, Howard Nasgaard, Scott Schneider, Robert Soulé, and Kun-Lung Wu. 2013. IBM streams processing language: Analyzing big data in motion. IBM J. Res. Dev. 57, 3/4 (2013), 7:1--7:11.
[36]
Martin Hirzel, Henrique Andrade, Buğra Gedik, Vibhore Kumar, Giuliano Losa, Mark Mendell, Howard Nasgaard, Robert Soulé, and Kun-Lung Wu. 2009. SPL Streams Processing Language Specification. Technical Report RC24897. IBM Research.
[37]
Martin Hirzel and Buğra Gedik. 2012. Streams that compose using macros that oblige. In Workshop on Partial Evaluation and Program Manipulation (PEPM). 141--150.
[38]
Martin Hirzel and Robert Grimm. 2007. Jeannie: Granting Java native interface developers their wishes. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 19--38.
[39]
Martin Hirzel, Rodric Rabbah, Philippe Suter, Olivier Tardieu, and Mandana Vaziri. 2016. Spreadsheets for stream processing with unbounded windows and partitions. In Conference on Distributed Event-Based Systems (DEBS). 49--60.
[40]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. ACM Computing Surveys (CSUR) 46, 4 (April 2014).
[41]
Paul Hudak. 1998. Modular domain specific languages and tools. In International Conference on Software Reuse (ICSR). 134--142.
[42]
Paul Hudak, Antony Courtney, Henrik Nilsson, and John Peterson. 2003. Arrows, robots, and functional reactive programming. In Summer School on Advanced Functional Programming, Oxford University.
[43]
Westley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. ACM Comput. Surv. 36, 1 (2004), 1--34.
[44]
Gilles Kahn. 1974. The semantics of a simple language for parallel processing. In Information Processing. 471--475.
[45]
Rohit Khandekar, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Joel Wolf, Kun-Lung Wu, Henrique Andrade, and Buğra Gedik. 2009. COLA: Optimizing stream processing applications via graph partitioning. In Middleware Conference. 308--327.
[46]
Romeo Kienzler, Rémy Bruggmann, Anand Ranganathan, and Nesime Tatbul. 2012. Incremental DNA sequence analysis in the cloud. In Scientific and Statistical Database Management (SSDBM) Demonstration. 640--645.
[47]
Byeongcheol Lee, Robert Grimm, Martin Hirzel, and Kathryn S. McKinley. 2012. Marco: Safe, expressive macros for any language. In European Conference on Object-Oriented Programming (ECOOP). 589--613.
[48]
E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.
[49]
Peng Li, Kunal Agrawal, Jeremy Buhler, and Roger D. Chamberlain. 2010. Deadlock avoidance for streaming computations with filtering. In Symposium on Parallelism in Algorithms and Architectures (SPAA). 243--252.
[50]
LogMon. 2014. SPL LogAnalysisBenchmark on StreamsExchange. Retrieved June 2014 from https://www.ibm.com/developerworks/community/files/app?lang=en#/file/fe90e883-3025-4eb1-a78f-87469a3d4d53.
[51]
Mark P. Mendell, Howard Nasgaard, Eric Bouillet, Martin Hirzel, and Buğra Gedik. 2012. Extending a general-purpose streaming system for XML. In Conference on Extending Database Technology (EDBT). 534--539.
[52]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Symposium on Operating Systems Principles (SOSP). 439--455.
[53]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In International Conference on Management of Data (SIGMOD). 1099--1110.
[54]
OpenMP. 2014. The OpenMP API specificaiton for parallel programming. Retrieved June 2014 from http://openmp.org/.
[55]
Yoonho Park, Richard King, Senthil Nathan, Wesley Most, and Henrique Andrade. 2012. Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware. Softw. Prac. Exp. 42, 1 (2012), 37--56.
[56]
Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Comput. 13, 4 (2005), 277--298.
[57]
Massimiliano Poletto, Wilson C. Hsieh, Dawson R. Engler, and M. Frans Kaashoek. 1999. ’C and tcc: A language and compiler for dynamic code generation. Trans. Program. Lang. Syst. 21, 2 (1999), 324--369.
[58]
Anton V. Riabov, Eric Bouillet, Mark D. Feblowitz, Zhen Liu, and Anand Ranganathan. 2008. Wishful search: Interactive composition of data mashups. In International World Wide Web Conferences (WWW). 775--784.
[59]
Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2012. Auto-parallelizing stateful distributed streaming applications. In Parallel Architectures and Compilation Techniques (PACT). 53--64.
[60]
Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015. Safe data parallelism for general streaming. IEEE Trans. Comput. 64, 2 (2015), 504--517.
[61]
Naomi Seyfer, Richard Tibbetts, and Nathaniel Mishkin. 2011. Capture fields: Modularity in a stream-relational event processing language. In Conference on Distributed Event-Based Systems (DEBS). 15--22.
[62]
Robert Soulé, Michael I. Gordon, Saman Amarasinghe, Robert Grimm, and Martin Hirzel. 2013. Dynamic expressivity with static optimization for streaming languages. In Conference on Distributed Event-Based Systems (DEBS). 159--170.
[63]
Robert Soulé, Martin Hirzel, Robert Grimm, Buğra Gedik, Henrique Andrade, Vibhore Kumar, and Kun-Lung Wu. 2010. A universal calculus for stream processing languages. In European Symposium on Programming (ESOP). 507--528.
[64]
Daby M. Sow, Jimeng Sun, Alain Biem, Jianying Hu, Marion Blount, and Shahram Ebadollahi. 2012. Real-time analysis for short-term prognosis in intensive care. IBM J. Res. Dev. 56, 5 (2012), 3:1--3:10.
[65]
Robert Stephens. 1997. A survey of stream processing. Acta Inform. 34, 7 (1997), 491--541.
[66]
Walid Taha and Tim Sheard. 1997. Multi-stage programming with explicit annotation. In Workshop on Partial Evaluation and Program Manipulation (PEPM). 203--217.
[67]
Yuzhe Tang and Buğra Gedik. 2013. Autopipelining for data stream processing. IEEE Trans. Parallel Distrib. Syst. 24, 11 (2013), 2344--2354.
[68]
Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. In Conference on Very Large Data Bases (VLDB). 702--713.
[69]
Ashish Thusoo, Sen Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive—A warehousing solution over a map-reduce framework. In Demo at Very Large Data Bases (VLDB-Demo). 1626--1629.
[70]
Sam Tobin-Hochstadt, Vincent St-Amour, Ryan Culpepper, Matthew Flatt, and Matthias Felleisen. 2011. Languages as libraries. In Programming Language Design and Implementation (PLDI). 132--141.
[71]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm @Twitter. In International Conference on Management of Data (SIGMOD). 147--156.
[72]
Mandana Vaziri, Olivier Tardieu, Rodric Rabbah, Philippe Suter, and Martin Hirzel. 2014. Stream processing with a spreadsheet. In European Conference on Object-Oriented Programming (ECOOP). 360--384.
[73]
Zhihong Xu, Martin Hirzel, Gregg Rothermel, and Kun-Lung Wu. 2013. Testing properties of dataflow program operators. In Conference on Automated Software Engineering (ASE). 103--113.
[74]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Operating System Design and Implementation (OSDI). 1--14.
[75]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP). 423--438.
[76]
Qiong Zou, Buğra Gedik, and Kun Wang. 2011. SpamWatcher: A streaming social network analytic on the IBM wire-speed processor. In Conference on Distributed Event-Based Systems (DEBS). 267--278.

Cited By

View all
  • (2023)Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and InsertionsProceedings of the VLDB Endowment10.14778/3611479.361152116:11(3227-3239)Online publication date: 1-Jul-2023
  • (2022)Ephemeral data handling in microservices with TqueryPeerJ Computer Science10.7717/peerj-cs.10378(e1037)Online publication date: 22-Jul-2022
  • (2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 39, Issue 1
March 2017
156 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/3050768
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2017
Accepted: 01 September 2016
Revised: 01 July 2016
Received: 01 July 2014
Published in TOPLAS Volume 39, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. Stream processing

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)16
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and InsertionsProceedings of the VLDB Endowment10.14778/3611479.361152116:11(3227-3239)Online publication date: 1-Jul-2023
  • (2022)Ephemeral data handling in microservices with TqueryPeerJ Computer Science10.7717/peerj-cs.10378(e1037)Online publication date: 22-Jul-2022
  • (2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
  • (2022)Towards a Methodology for Building Dynamic Urgent Applications on Continuum Computing Platforms2022 First Combined International Workshop on Interactive Urgent Supercomputing (CIW-IUS)10.1109/CIW-IUS56691.2022.00009(1-6)Online publication date: Nov-2022
  • (2021)Synchronization SchemasProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458317(1-18)Online publication date: 20-Jun-2021
  • (2021)StreamB: A Declarative Language for Automatically Processing Data Streams in Abstract Environments for Agent PlatformsEngineering Multi-Agent Systems10.1007/978-3-030-97457-2_7(114-136)Online publication date: 3-May-2021
  • (2020)JokerJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.10.012137:C(205-223)Online publication date: 1-Mar-2020
  • (2019)Dagstuhl Seminar on Big Stream ProcessingACM SIGMOD Record10.1145/3316416.331642647:3(36-39)Online publication date: 27-Feb-2019
  • (2019)Stream Query OptimizationEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_261(1607-1615)Online publication date: 20-Feb-2019
  • (2019)Stream Processing Languages and AbstractionsEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_260(1600-1607)Online publication date: 20-Feb-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media