research-article

Exploratory Analysis of Spark Structured Streaming

Authors:

Jason TaaffeAuthors Info & Claims

ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

Pages 141 - 146

https://doi.org/10.1145/3185768.3186360

Published: 02 April 2018 Publication History

Abstract

In the Big Data era, stream processing has become a common requirement for many data-intensive applications. This has lead to many advances in the development and adaption of large scale streaming systems. Spark and Flink have become a popular choice for many developers as they combine both batch and streaming capabilities in a single system. However, introducing the Spark Structured Streaming in version 2.0 opened up completely new features for SparkSQL, which are alternatively only available in Apache Calcite. This work focuses on the new Spark Structured Streaming and analyses it by diving into its internal functionalities. With the help of a micro-benchmark consisting of streaming queries, we perform initial experiments evaluating the technology. Our results show that Spark Structured Streaming is able to run multiple queries successfully in parallel on data with changing velocity and volume sizes.

References

[1]

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Outof- Order Data Processing. PVLDB 8, 12 (2015).

Digital Library

[2]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In the 2015 ACM SIGMOD, Melbourne, Victoria, Australia, May 31 - June 4, 2015.

Digital Library

[3]

Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, and Eduard Ayguadé. 2015. How Data Volume Affects Spark Based Data Analytics on a Scale-up Server. In 6th Workshop, BPOE 2015, Kohala, HI, USA, Aug. 31 - Sept. 4, 2015.

[4]

Ahsan Javed Awan, Vladimir Vlassov, Mats Brorsson, and Eduard Ayguadé. 2016. Node architecture implications for in-memory data analytics on scale-in clusters. In the 3rd IEEE/ACM BDCAT 2016, Shanghai, China, Dec. 6--9, 2016.

Digital Library

[5]

Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and Issues in Data Stream Systems. In the 21st ACM PODS, June 3--5, Madison, Wisconsin, USA.

Digital Library

[6]

Calcite. 2018. calcite.apache.org/. (2018).

[7]

Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IEEE IPDPSWorkshops, Chicago, IL, USA, 2016.

[8]

Structured Streaming Code. 2018. https://github.com/apache/spark/tree/ fa0092bddf695a757f5ddaed539e55e2dc9fccb7/sql/core/src/main/scala/org/ apache/spark/sql/streaming. (2018).

[9]

Srinivas Duvvuri and Bikramaditya Singhal. 2016. Spark for Data Science. Packt Publishing Ltd.

[10]

Flink. 2018. flink.apache.org/. (2018).

[11]

Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In 33rd IEEE ICDE 2017, San Diego, CA, USA, April 19--22, 2017.

[12]

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: towards an industry standard benchmark for big data analytics. In the ACM SIGMOD 2013, New York, NY, USA, June 22--27, 2013.

Digital Library

[13]

Lukasz Golab and M. Tamer Özsu. 2003. Issues in data stream management. SIGMOD Record 32, 2 (2003). {14} Structured Streaming Programming Guide. 2018. spark.apache.org/docs/latest/ structured-streaming-programming-guide.html. (2018).

Digital Library

[14]

HiBench. 2017. github.com/intel-hadoop/HiBench. (2017).

[15]

Jay Kreps. 2014. Questioning the lambda architecture. Online article, July (2014).

[16]

Mayuresh Kunjir and Shivnath Babu. 2017. Thoth in Action: Memory Management in Modern Data Analytics. PVLDB 10, 12 (2017).

Digital Library

[17]

Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. Where does Memory Go?: Study of Memory Management in JVM-based Data Analytics. (2016). https: //pdfs.semanticscholar.org/8590/b5d66e0dc429578cf6ac64b8abda6a125701.pdf

[18]

Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2017. Spark- Bench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Cluster Computing (2017).

Digital Library

[19]

Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks. In the 7th IEEE/ACM UCC 2014, London, United Kingdom, Dec. 8--11, 2014.

Digital Library

[20]

Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, and María S. Pérez- Hernández. 2016. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks. In the IEEE CLUSTER 2016, Taipei, Taiwan, 2016.

[21]

Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co.

Digital Library

[22]

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In 26th SOSP, Shanghai, China, 2017.

Digital Library

[23]

Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, and Scott Shenker. 2017. Performance clarity as a first-class design principle. In the 16th Workshop HotOS 2017, Whistler, BC, Canada, May 8--10, 2017.

Digital Library

[24]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In the 12th USENIX NSDI 15, Oakland, CA, USA, May 4--6, 2015.

Digital Library

[25]

Saeed Shahrivari. 2014. Beyond Batch Processing: Towards Real-Time and Streaming Big Data. Computers 3, 4 (2014).

[26]

Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8, 13 (2015). 28} Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: An IoT benchmark for distributed stream processing systems. Concurrency and Computation: Practice and Experience 29, 21 (2017).

Digital Library

[27]

Spark. 2018. spark.apache.org/. (2018).

[28]

Flink Streaming. 2018. ci.apache.org/projects/flink/flink-docs-release-1.4/dev/ table/streaming.html. (2018).

[29]

Spark Streaming. 2018. spark.apache.org/streaming/. (2018).

[30]

Jason Taaffe. 2018. https://github.com/Taaffy/ Structured-Streaming-Micro-Benchmark. (2018).

[31]

Jorge Veiga, Roberto R. Expósito, Xoan C. Pardo, Guillermo L. Taboada, and Juan Touriño. 2016. Performance evaluation of big data frameworks for large-scale data analytics. In 2016 IEEE BigData 2016, Washington DC, USA, Dec. 5--8, 2016.

[32]

Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and Adaptable Stream Processing at Scale. In 26th SOSP, Shanghai, China, 2017.

Digital Library

[33]

Wolfram Wingerath, Felix Gessert, Steffen Friedrich, and Norbert Ritter. 2016. Real-time stream processing for Big Data. Information Technology 58, 4 (2016).

[34]

Zaharia. 2016. databricks.com/blog/2016/07/28/ structured-streaming-in-apache-spark.html. (2016).

[35]

Yunhao Zhang, Rong Chen, and Haibo Chen. 2017. Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data. In 26th SOSP, Shanghai, China, Oct. 28--31, 2017.

Digital Library

Cited By

K SD SM AS SS S(2023)Twitter Based Complaint Management System for Rail Transit2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA)10.1109/ICAECA56562.2023.10199383(1-6)Online publication date: 16-Jun-2023
https://doi.org/10.1109/ICAECA56562.2023.10199383
Abid AJemili F(2020)Intrusion Detection based on Graph oriented Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2020.08.059176(572-581)Online publication date: 2020
https://doi.org/10.1016/j.procs.2020.08.059
Ivanov TPergolesi M(2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
https://doi.org/10.1002/cpe.5523
Show More Cited By

Index Terms

Exploratory Analysis of Spark Structured Streaming

Recommendations

Spark SQL: Relational Data Processing in Spark
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g. ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

April 2018

212 pages

ISBN:9781450356299

DOI:10.1145/3185768

General Chairs:
Katinka Wolter
Free University of Berlin, Germany
,
Will Knottenbelt
Imperial College London, UK
,
Program Chairs:
André van Hoorn
University of Stuttgart, Germany
,
Manoj Nambiar
Tata Consultancy Services, India
,
Heiko Koziolek
ABB, Germany

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICPE '18

Sponsor:

ICPE '18: ACM/SPEC International Conference on Performance Engineering

April 9 - 13, 2018

Berlin, Germany

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
491
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

K SD SM AS SS S(2023)Twitter Based Complaint Management System for Rail Transit2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA)10.1109/ICAECA56562.2023.10199383(1-6)Online publication date: 16-Jun-2023
https://doi.org/10.1109/ICAECA56562.2023.10199383
Abid AJemili F(2020)Intrusion Detection based on Graph oriented Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2020.08.059176(572-581)Online publication date: 2020
https://doi.org/10.1016/j.procs.2020.08.059
Ivanov TPergolesi M(2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
https://doi.org/10.1002/cpe.5523
Hafsa MJemili F(2018)Comparative Study between Big Data Analysis Techniques in Intrusion DetectionBig Data and Cognitive Computing10.3390/bdcc30100013:1(1)Online publication date: 20-Dec-2018
https://doi.org/10.3390/bdcc3010001
Dhaouadi JAktas M(2018)On the Data Stream Processing Frameworks: A Case Study2018 3rd International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK.2018.8566457(104-109)Online publication date: Sep-2018
https://doi.org/10.1109/UBMK.2018.8566457

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents