Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3314221.3314580acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Data-trace types for distributed stream processing systems

Published: 08 June 2019 Publication History

Abstract

Distributed architectures for efficient processing of streaming data are increasingly critical to modern information processing systems. The goal of this paper is to develop type-based programming abstractions that facilitate correct and efficient deployment of a logical specification of the desired computation on such architectures. In the proposed model, each communication link has an associated type specifying tagged data items along with a dependency relation over tags that captures the logical partial ordering constraints over data items. The semantics of a (distributed) stream processing system is then a function from input data traces to output data traces, where a data trace is an equivalence class of sequences of data items induced by the dependency relation. This data-trace transduction model generalizes both acyclic synchronous data-flow and relational query processors, and can specify computations over data streams with a rich variety of partial ordering and synchronization characteristics. We then describe a set of programming templates for data-trace transductions: abstractions corresponding to common stream processing tasks. Our system automatically maps these high-level programs to a given topology on the distributed implementation platform Apache Storm while preserving the semantics. Our experimental evaluation shows that (1) while automatic parallelization deployed by existing systems may not preserve semantics, particularly when the computation is sensitive to the ordering of data items, our programming abstractions allow a natural specification of the query that contains a mix of ordering constraints while guaranteeing correct deployment, and (2) the throughput of the automatically compiled distributed code is comparable to that of hand-crafted distributed implementations.

Supplementary Material

WEBM File (p670-mamouras.webm)

References

[1]
Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley Zdonik. 2005. The Design of the Borealis Stream Processing Engine. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR ’05). 277–289. http://cidrdb.org/cidr2005/ papers/P23.pdf
[2]
Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A New Model and Architecture for Data Stream Management. The VLDB Journal 12, 2 (2003), 120–139.
[3]
Houssam Abbas, Rajeev Alur, Konstantinos Mamouras, Rahul Mangharam, and Alena Rodionova. 2018. Real-time Decision Policies with Predictable Performance. Proc. IEEE 106, 9 (Sep. 2018), 1593–1615.
[4]
Houssam Abbas, Alena Rodionova, Konstantinos Mamouras, Ezio Bartocci, Scott A. Smolka, and Radu Grosu. 2018. Quantitative Regular Expressions for Arrhythmia Detection. To appear in the IEEE/ACM Transactions on Computational Biology and Bioinformatics (2018).
[5]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033–1044.
[6]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing. Proceedings of the VLDB Endowment 8, 12 (Aug. 2015), 1792– 1803.
[7]
Mohamed Ali, Badrish Chandramouli, Jonathan Goldstein, and Roman Schindlauer. 2011. The Extensibility Framework in Microsoft StreamInsight. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE ’11). 1242–1253.
[8]
Rajeev Alur, Dana Fisman, Konstantinos Mamouras, Mukund Raghothaman, and Caleb Stanford. 2018. Streamable Regular Transductions. CoRR abs/1807.03865 (2018). http://arxiv.org/abs/1807.03865
[9]
Rajeev Alur, Dana Fisman, and Mukund Raghothaman. 2016. Regular Programming for Quantitative Properties of Data Streams. In Proceedings of the 25th European Symposium on Programming (ESOP ’16). 15–40.
[10]
Rajeev Alur and Konstantinos Mamouras. 2017. An Introduction to the StreamQRE Language. Dependable Software Systems Engineering 50 (2017), 1.
[11]
Rajeev Alur, Konstantinos Mamouras, and Caleb Stanford. 2017. Automata-Based Stream Processing. In Proceedings of the 44th International Colloquium on Automata, Languages, and Programming (ICALP ’17) (Leibniz International Proceedings in Informatics (LIPIcs)), Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl (Eds.), Vol. 80. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 112:1–112:15.
[12]
Rajeev Alur, Konstantinos Mamouras, and Caleb Stanford. 2019. Modular Quantitative Monitoring. Proceedings of the ACM on Programming Languages 3, POPL, Article 50 (Jan. 2019), 31 pages.
[13]
Rajeev Alur, Konstantinos Mamouras, Caleb Stanford, and Val Tannen. 2018. Interfaces for Stream Processing Systems. In Principles of Modeling: Essays Dedicated to Edward A. Lee on the Occasion of His 60th Birthday, Marten Lohstroh, Patricia Derler, and Marjan Sirjani (Eds.). Lecture Notes in Computer Science, Vol. 10760. Springer, Cham, 38–60.
[14]
Rajeev Alur, Konstantinos Mamouras, and Dogan Ulus. 2017. Derivatives of Quantitative Regular Expressions. In Models, Algorithms, Logics and Tools: Essays Dedicated to Kim Guldstrand Larsen on the Occasion of His 60th Birthday, Luca Aceto, Giorgio Bacci, Giovanni Bacci, Anna Ingólfsdóttir, Axel Legay, and Radu Mardare (Eds.). Lecture Notes in Computer Science, Vol. 10460. Springer International Publishing, Cham, 75–95.
[15]
Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL Continuous Query Language: Semantic Foundations and Query Execution. The VLDB Journal 15, 2 (2006), 121–142.
[16]
Arvind Arasu and Jennifer Widom. 2004. Resource Sharing in Continuous Sliding-window Aggregates. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB ’04). VLDB Endowment, 336–347. http://dl.acm.org/citation.cfm?id=1316689.1316720
[17]
Roger S. Barga, Jonathan Goldstein, Mohamed Ali, and Mingsheng Hong. 2007. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR ’07). 363–374. http://cidrdb. org/cidr2007/papers/cidr07p42.pdf
[18]
Albert Benveniste, Paul Caspi, Stephen A. Edwards, Nicolas Halbwachs, Paul Le Guernic, and Robert de Simone. 2003. The Synchronous Languages 12 Years Later. Proc. IEEE 91, 1 (2003), 64–83.
[19]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015). http: //sites.computer.org/debull/A15dec/p28.pdf
[20]
Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss, and Mehul Shah. 2003. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In Proceedings of the First Biennial Conference on Innovative Data Systems Research (CIDR ’03). http://cidrdb.org/cidr2003/program/p24.pdf
[21]
S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1789–1792.
[22]
DEBS Conference. 2014. DEBS 2014 Grand Challenge: Smart homes. http://debs.org/debs-2014-smart-homes/. (2014). {Online; accessed November 16, 2018}.
[23]
Apache Software Foundation. 2019. Apache Beam. https://beam. apache.org/. (2019). {Online; accessed March 31, 2019}.
[24]
Apache Software Foundation. 2019. Apache Derby. https://db.apache. org/derby/. (2019). {Online; accessed March 31, 2019}.
[25]
Apache Software Foundation. 2019. Apache Flink. https://flink.apache. org/ . (2019). {Online; accessed March 31, 2019}.
[26]
Apache Software Foundation. 2019. Apache Samza. http://samza. apache.org/ . (2019). {Online; accessed March 31, 2019}.
[27]
Apache Software Foundation. 2019. Apache Spark Streaming. https: //spark.apache.org/streaming/ . (2019). {Online; accessed March 31, 2019}.
[28]
Apache Software Foundation. 2019. Apache Storm. http://storm. apache.org/ . (2019). {Online; accessed March 31, 2019}.
[29]
Apache Software Foundation. 2019. Apache Storm: Concepts. http: //storm.apache.org/releases/1.2.2/Concepts.html . (2019). {Online; accessed March 31, 2019}.
[30]
Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 151–162.
[31]
Stéphane Grumbach and Tova Milo. 1999. An Algebra for Pomsets. Information and Computation 150, 2 (1999), 268–306.
[32]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations Newsletter 11, 1 (Nov. 2009), 10–18.
[33]
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, and K. L. Wu. 2013. IBM Streams Processing Language: Analyzing Big Data in motion. IBM Journal of Research and Development 57, 3/4 (2013), 7:1–7:11.
[34]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM Computing Surveys (CSUR) 46, 4, Article 46 (March 2014), 34 pages.
[35]
Yahoo Inc. 2017. Reference implementation of the Yahoo Streaming Benchmark. https://github.com/yahoo/streaming-benchmarks . (2017). {Online; accessed March 31, 2019}.
[36]
Gilles Kahn. 1974. The Semantics of a Simple Language for Parallel Programming. Information Processing 74 (1974), 471–475.
[37]
Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics over Discontinuous Streams. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). ACM, New York, NY, USA, 1081–1092.
[38]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). ACM, 239–250.
[39]
Edward A. Lee and David G. Messerschmitt. 1987. Synchronous Data Flow. Proc. IEEE 75, 9 (1987), 1235–1245.
[40]
Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD ’05). ACM, 311–322.
[41]
Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proceedings of the VLDB Endowment 1, 1 (Aug. 2008), 274–288.
[42]
Samuel Madden, Mehul Shah, Joseph M. Hellerstein, and Vijayshankar Raman. 2002. Continuously Adaptive Continuous Queries over Streams. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD ’02). ACM, New York, NY, USA, 49–60.
[43]
Konstantinos Mamouras, Mukund Raghothaman, Rajeev Alur, Zachary G. Ives, and Sanjeev Khanna. 2017. StreamQRE: Modular Specification and Efficient Evaluation of Quantitative Queries over Streaming Data. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’17). ACM, New York, NY, USA, 693–708.
[44]
Antoni Mazurkiewicz. 1987. Trace theory. In Petri Nets: Applications and Relationships to Other Models of Concurrency (LNCS), W. Brauer, W. Reisig, and G. Rozenberg (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 278–324.
[45]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 439–455.
[46]
Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed Stream Computing Platform. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops. 170–177.
[47]
Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proceedings of the VLDB Endowment 10, 12 (Aug. 2017), 1634–1645.
[48]
Vaughan Pratt. 1986. Modeling Concurrency with Partial Orders. International Journal of Parallel Programming 15, 1 (Feb 1986), 33–71.
[49]
Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015. Safe Data Parallelism for General Streaming. IEEE Trans. Comput. 64, 2 (Feb 2015), 504–517.
[50]
Utkarsh Srivastava and Jennifer Widom. 2004. Flexible Time Management in Data Stream Systems. In PODS (PODS ’04). ACM, New York, NY, USA, 263–274.
[51]
Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2017. LowLatency Sliding-Window Aggregation in Worst-Case Constant Time. In Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems (DEBS ’17). ACM, New York, NY, USA, 66–77.
[52]
Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General Incremental Sliding-window Aggregation. Proceedings of the VLDB Endowment 8, 7 (2015), 702–713.
[53]
William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction (CC ’02) (Lecture Notes in Computer Science), R. Nigel Horspool (Ed.), Vol. 2304. Springer Berlin Heidelberg, Berlin, Heidelberg, 179–196.
[54]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm @ Twitter. In Proceedings of the 2014 ACM SIG-MOD International Conference on Management of Data (SIGMOD ’14). ACM, 147–156.
[55]
Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 555–568.
[56]
Twitter. 2019. Heron. https://apache.github.io/incubator-heron/ . (2019). {Online; accessed March 31, 2019}.
[57]
Jun Yang and Jennifer Widom. 2003. Incremental Computation and Maintenance of Temporal Aggregates. The VLDB Journal 12, 3 (Oct. 2003), 262–283.
[58]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 423–438.

Cited By

View all
  • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
  • (2023)A Robust Theory of Series Parallel GraphsProceedings of the ACM on Programming Languages10.1145/35712307:POPL(1058-1088)Online publication date: 11-Jan-2023
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2019
1162 pages
ISBN:9781450367127
DOI:10.1145/3314221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed data stream processing
  2. types

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)89
  • Downloads (Last 6 weeks)17
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
  • (2023)A Robust Theory of Series Parallel GraphsProceedings of the ACM on Programming Languages10.1145/35712307:POPL(1058-1088)Online publication date: 11-Jan-2023
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • (2021)SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream SynthesizingApplied Sciences10.3390/app1117805711:17(8057)Online publication date: 30-Aug-2021
  • (2021)An order-aware dataflow model for parallel Unix pipelinesProceedings of the ACM on Programming Languages10.1145/34735705:ICFP(1-28)Online publication date: 19-Aug-2021
  • (2021)Synchronization SchemasProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458317(1-18)Online publication date: 20-Jun-2021
  • (2021)PaShProceedings of the Sixteenth European Conference on Computer Systems10.1145/3447786.3456228(49-66)Online publication date: 21-Apr-2021
  • (2020)StreamQL: a query language for processing streaming time seriesProceedings of the ACM on Programming Languages10.1145/34282514:OOPSLA(1-32)Online publication date: 13-Nov-2020
  • (2020)DiffStream: differential output testing for stream processing programsProceedings of the ACM on Programming Languages10.1145/34282214:OOPSLA(1-29)Online publication date: 13-Nov-2020
  • (2020)Online Signal Monitoring With Bounded LagIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.301305339:11(3868-3880)Online publication date: Nov-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media