Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503221.3508413acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

Stream processing with dependency-guided synchronization

Published: 28 March 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Real-time data processing applications with low latency requirements have led to the increasing popularity of stream processing systems. While such systems offer convenient APIs that can be used to achieve data parallelism automatically, they offer limited support for computations that require synchronization between parallel nodes. In this paper, we propose dependency-guided synchronization (DGS), an alternative programming model for stateful streaming computations with complex synchronization requirements. In the proposed model, the input is viewed as partially ordered, and the program consists of a set of parallelization constructs which are applied to decompose the partial order and process events independently. Our programming model maps to an execution model called synchronization plans which supports synchronization between parallel nodes. Our evaluation shows that APIs offered by two widely used systems---Flink and Timely Dataflow---cannot suitably expose parallelism in some representative applications. In contrast, DGS enables implementations with scalable performance, the resulting synchronization plans offer throughput improvements when implemented manually in existing systems, and the programming overhead is small compared to writing sequential code.

    References

    [1]
    Lorenzo Affetti, Alessandro Margara, and Gianpaolo Cugola. 2020. TSpoon: Transactions on a stream processor. J. Parallel and Distrib. Comput. 140 (2020), 65--79.
    [2]
    Adil Akhter, Marios Fragkoulis, and Asterios Katsifodimos. 2019. Stateful Functions as a Service in Action. Proc. VLDB Endow. 12, 12 (2019), 1890--1893.
    [3]
    Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033--1044.
    [4]
    Rajeev Alur, Phillip Hilliard, Zachary G Ives, Konstantinos Kallas, Konstantinos Mamouras, Filip Niksic, Caleb Stanford, Val Tannen, and Anton Xue. 2021. Synchronization Schemas. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 1--18.
    [5]
    Apache. 2019. Apache Flink. https://flink.apache.org/. [Online; accessed March 31, 2019].
    [6]
    Apache. 2019. Apache Storm. http://storm.apache.org/. [Online; accessed March 31, 2019].
    [7]
    Apache. 2021. Apache Beam. https://beam.apache.org/. [Online; accessed April 16, 2021].
    [8]
    Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL Continuous Query Language: Semantic Foundations and Query Execution. The VLDB Journal 15, 2 (2006), 121--142.
    [9]
    Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured streaming: A declarative api for real-time applications in apache spark. In Proceedings of the 2018 International Conference on Management of Data. 601--613.
    [10]
    Joe Armstrong, Robert Virding, Claes Wikström, and Mike Williams. 1993. Concurrent Programming in Erlang.
    [11]
    Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, and Kenneth Knowles. 2019. One SQL to Rule Them All-an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables. In Proceedings of the 2019 International Conference on Management of Data. 1757--1772.
    [12]
    Philip A. Bernstein, Mohammad Dashti, Tim Kiefer, and David Maier. 2017. Indexing in an Actor-Oriented Database. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2017/papers/p29-bernstein-cidr17.pdf
    [13]
    Philip A Bernstein, Todd Porter, Rahul Potharaju, Alejandro Z Tomsic, Shivaram Venkataraman, and Wentao Wu. 2019. Serverless Event-Stream Processing over Virtual Actors. In CIDR.
    [14]
    Sebastian Burckhardt, Alexandro Baldassin, and Daan Leijen. 2010. Concurrent programming with revisions and isolation types. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications. 691--707.
    [15]
    Paris Carbone, Marios Fragkoulis, Vasiliki Kalavri, and Asterios Katsifodimos. 2020. Beyond Analytics: The Evolution of Stream Processing Systems. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2651--2658.
    [16]
    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38 (2015), 28--38.
    [17]
    Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C Platt, James F Terwilliger, and John Wernsing. 2014. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment 8, 4 (2014), 401--412.
    [18]
    Badrish Chandramouli, Jonathan Goldstein, and David Maier. 2010. High-performance dynamic pattern matching over disordered streams. Proceedings of the VLDB Endowment 3, 1--2 (2010), 220--231.
    [19]
    Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Nsdi, Vol. 10. 20.
    [20]
    Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier. 2012. Logic and Lattices for Distributed Programming. In Proceedings of the Third ACM Symposium on Cloud Computing (San Jose, California) (SoCC '12). Association for Computing Machinery, New York, NY, USA, Article 1, 14 pages.
    [21]
    James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, et al. 2013. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems (TOCS) 31, 3 (2013), 1--22.
    [22]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113.
    [23]
    Flink 2020. FLIP-8: Rescalable Non-Partitioned State - Apache Flink - Apache Software Foundation. https://cwiki.apache.org/confluence/display/FLINK/FLIP-8%3A+Rescalable+Non-Partitioned+State. https://cwiki.apache.org/confluence/display/FLINK/FLIP-8%3A+Rescalable+Non-Partitioned+State
    [24]
    Flink 2020. Stateful Functions 2.0 - An Event-driven Database on Apache Flink. https://flink.apache.org/news/2020/04/07/release-statefun-2.0.0.html.
    [25]
    Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation. 212--223.
    [26]
    Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and Myungcheol Doo. 2008. SPADE: The System s Declarative Stream Processing Engine. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD '08). ACM, New York, NY, USA, 1123--1134.
    [27]
    Kahn Gilles. 1974. The semantics of a simple language for parallel programming. Information processing 74 (1974), 471--475.
    [28]
    Alexey Gotsman, Hongseok Yang, Carla Ferreira, Mahsa Najafzadeh, and Marc Shapiro. 2016. 'Cause I'm Strong Enough: Reasoning about Consistency Choices in Distributed Systems. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL '16). Association for Computing Machinery, New York, NY, USA, 371--384.
    [29]
    M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, and K. L. Wu. 2013. IBM Streams Processing Language: Analyzing Big Data in motion. IBM Journal of Research and Development 57, 3/4 (2013), 7:1--7:11.
    [30]
    Charles Antony Richard Hoare. 1978. Communicating sequential processes. Commun. ACM 21, 8 (1978), 666--677.
    [31]
    Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodik, Leana Golubchik, Minlan Yu, Paramvir Bahl, and Matthai Philipose. 2018. Videoedge: Processing camera streams using hierarchical clusters. In 2018 IEEE/ACM Symposium on Edge Computing (SEC). IEEE, 115--131.
    [32]
    Namit Jain, Shailendra Mishra, Anand Srinivasan, Johannes Gehrke, Jennifer Widom, Hari Balakrishnan, Uǧur Çetintemel, Mitch Cherniack, Richard Tibbetts, and Stan Zdonik. 2008. Towards a streaming SQL standard. Proceedings of the VLDB Endowment 1, 2 (2008), 1379--1390.
    [33]
    Theodore Johnson, Shanmugavelayutham Muthukrishnan, Vladislav Shkapenyuk, and Oliver Spatscheck. 2005. A heartbeat mechanism and its application in gigascope. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 1079--1088.
    [34]
    Kafka 2020. KTable state stores and improved semantics - Apache Kafka - Apache Software Foundation. https://cwiki.apache.org/confluence/display/KAFKA/KIP-114%3A+KTable+state+stores+and+improved+semantics.
    [35]
    Konstantinos Kallas, Filip Niksic, Caleb Stanford, and Rajeev Alur. 2020. DiffStream: differential output testing for stream processing programs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1--29.
    [36]
    Konstantinos Kallas, Filip Niksic, Caleb Stanford, and Rajeev Alur. 2021. Stream Processing with Dependency-Guided Synchronization (Extended Version). arXiv:2104.04512 [cs.PL]
    [37]
    Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). ACM, New York, NY, USA, 239--250.
    [38]
    Lindsey Kuper and Ryan R. Newton. 2013. LVars: Lattice-Based Data Structures for Deterministic Parallelism. In Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing (Boston, Massachusetts, USA) (FHPC '13). Association for Computing Machinery, New York, NY, USA, 71--84.
    [39]
    Lindsey Kuper, Aaron Turon, Neelakantan R. Krishnaswami, and Ryan R. Newton. 2014. Freeze after Writing: Quasi-Deterministic Parallel Programming with LVars. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, California, USA) (POPL '14). Association for Computing Machinery, New York, NY, USA, 257--270.
    [40]
    Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande. 36--43.
    [41]
    Edward A Lee and David G Messerschmitt. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.
    [42]
    Nicholas V Lewchenko, Arjun Radhakrishna, Akash Gaonkar, and Pavol Černỳ. 2019. Sequential programming for replicated data stores. Proceedings of the ACM on Programming Languages 3, ICFP (2019), 1--28.
    [43]
    Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguiça, and Rodrigo Rodrigues. 2012. Making geo-replicated systems fast as possible, consistent when necessary. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 265--278.
    [44]
    Nancy A Lynch. 1996. Distributed algorithms. Elsevier.
    [45]
    Konstantinos Mamouras, Mukund Raghothaman, Rajeev Alur, Zachary G Ives, and Sanjeev Khanna. 2017. StreamQRE: Modular specification and efficient evaluation of quantitative queries over streaming data. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 693--708.
    [46]
    Konstantinos Mamouras, Caleb Stanford, Rajeev Alur, Zachary G Ives, and Val Tannen. 2019. Data-trace types for distributed stream processing systems. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 670--685.
    [47]
    Frank McSherry. 2020. Timely Dataflow (Rust). https://github.com/TimelyDataflow/timely-dataflow/. [Online; accessed September 30, 2020].
    [48]
    John Meehan, Nesime Tatbul, Stan Zdonik, Cansu Aslantas, Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, Andrew Pavlo, et al. 2015. S-Store: streaming meets transaction processing. Proceedings of the VLDB Endowment 8, 13 (2015), 2134--2145.
    [49]
    M. Milano and Andrew C Myers. 2018. MixT: A language for mixing consistency in geodistributed transactions. ACM SIGPLAN Notices 53, 4 (2018), 226--241.
    [50]
    Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP '13). ACM, New York, NY, USA, 439--455.
    [51]
    Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proceedings of the VLDB Endowment 10, 12 (Aug. 2017), 1634--1645.
    [52]
    Matthew Eric Otey, Amol Ghoting, and Srinivasan Parthasarathy. 2006. Fast distributed outlier detection in mixed-attribute data sets. Data mining and knowledge discovery 12, 2--3 (2006), 203--228.
    [53]
    Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale. 2016. SamzaSQL: Scalable fast data management with streaming SQL. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1627--1636.
    [54]
    Samza 2020. Side Inputs for Local Stores - Apache Samza - Apache Software Foundation. https://cwiki.apache.org/confluence/display/SAMZA/SEP-27%3A+Side+Inputs+for+Local+Stores.
    [55]
    S. Schneider, M. Hirzel, B. Gedik, and K. Wu. 2015. Safe Data Parallelism for General Streaming. IEEE Trans. Comput. 64, 2 (Feb 2015), 504--517.
    [56]
    Vivek Shah and Marcos Antonio Vaz Salles. 2018. Reactors: A Case for Predictable, Virtualized Actor Database Systems. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 259--274.
    [57]
    Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. Conflict-free replicated data types. In Symposium on Self-Stabilizing Systems. Springer, 386--400.
    [58]
    Krishnamoorthy C Sivaramakrishnan, Gowtham Kaki, and Suresh Jagannathan. 2015. Declarative programming over eventually consistent data stores. ACM SIGPLAN Notices 50, 6 (2015), 413--424.
    [59]
    Robert Soulé, Martin Hirzel, Robert Grimm, Buğra Gedik, Henrique Andrade, Vibhore Kumar, and Kun-Lung Wu. 2010. A universal calculus for stream processing languages. In European Symposium on Programming. Springer, 507--528.
    [60]
    William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A language for streaming applications. In International Conference on Compiler Construction. Springer, 179--196.
    [61]
    Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 555--568.
    [62]
    Le Xu, Shivaram Venkataraman, Indranil Gupta, Luo Mai, and Rahul Potharaju. 2021. Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21). 389--405.
    [63]
    Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP '13). ACM, New York, NY, USA, 423--438.
    [64]
    Yunjian Zhao, Zhi Liu, Yidi Wu, Guanxian Jiang, James Cheng, Kunlong Liu, and Xiao Yan. 2021. Timestamped State Sharing for Stream Analytics. IEEE Transactions on Parallel and Distributed Systems 32, 11 (2021), 2691--2704.

    Cited By

    View all
    • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
    • (2023)A Robust Theory of Series Parallel GraphsProceedings of the ACM on Programming Languages10.1145/35712307:POPL(1058-1088)Online publication date: 11-Jan-2023
    • (2022)Synchronization Techniques for Multi-threaded Web Server: A Comparative Study2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008307(373-379)Online publication date: 4-Dec-2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    April 2022
    495 pages
    ISBN:9781450392044
    DOI:10.1145/3503221
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 March 2022

    Check for updates

    Badges

    Author Tags

    1. data parallelism
    2. distributed stream processing
    3. sharding
    4. synchronization

    Qualifiers

    • Research-article

    Funding Sources

    • NSF

    Conference

    PPoPP '22

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)196
    • Downloads (Last 6 weeks)16
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Stream TypesProceedings of the ACM on Programming Languages10.1145/36564348:PLDI(1412-1436)Online publication date: 20-Jun-2024
    • (2023)A Robust Theory of Series Parallel GraphsProceedings of the ACM on Programming Languages10.1145/35712307:POPL(1058-1088)Online publication date: 11-Jan-2023
    • (2022)Synchronization Techniques for Multi-threaded Web Server: A Comparative Study2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008307(373-379)Online publication date: 4-Dec-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media