Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3185768.3186360acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Exploratory Analysis of Spark Structured Streaming

Published: 02 April 2018 Publication History

Abstract

In the Big Data era, stream processing has become a common requirement for many data-intensive applications. This has lead to many advances in the development and adaption of large scale streaming systems. Spark and Flink have become a popular choice for many developers as they combine both batch and streaming capabilities in a single system. However, introducing the Spark Structured Streaming in version 2.0 opened up completely new features for SparkSQL, which are alternatively only available in Apache Calcite. This work focuses on the new Spark Structured Streaming and analyses it by diving into its internal functionalities. With the help of a micro-benchmark consisting of streaming queries, we perform initial experiments evaluating the technology. Our results show that Spark Structured Streaming is able to run multiple queries successfully in parallel on data with changing velocity and volume sizes.

References

[1]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Outof- Order Data Processing. PVLDB 8, 12 (2015).
[2]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In the 2015 ACM SIGMOD, Melbourne, Victoria, Australia, May 31 - June 4, 2015.
[3]
Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, and Eduard Ayguadé. 2015. How Data Volume Affects Spark Based Data Analytics on a Scale-up Server. In 6th Workshop, BPOE 2015, Kohala, HI, USA, Aug. 31 - Sept. 4, 2015.
[4]
Ahsan Javed Awan, Vladimir Vlassov, Mats Brorsson, and Eduard Ayguadé. 2016. Node architecture implications for in-memory data analytics on scale-in clusters. In the 3rd IEEE/ACM BDCAT 2016, Shanghai, China, Dec. 6--9, 2016.
[5]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and Issues in Data Stream Systems. In the 21st ACM PODS, June 3--5, Madison, Wisconsin, USA.
[6]
Calcite. 2018. calcite.apache.org/. (2018).
[7]
Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IEEE IPDPSWorkshops, Chicago, IL, USA, 2016.
[8]
Structured Streaming Code. 2018. https://github.com/apache/spark/tree/ fa0092bddf695a757f5ddaed539e55e2dc9fccb7/sql/core/src/main/scala/org/ apache/spark/sql/streaming. (2018).
[9]
Srinivas Duvvuri and Bikramaditya Singhal. 2016. Spark for Data Science. Packt Publishing Ltd.
[10]
Flink. 2018. flink.apache.org/. (2018).
[11]
Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In 33rd IEEE ICDE 2017, San Diego, CA, USA, April 19--22, 2017.
[12]
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: towards an industry standard benchmark for big data analytics. In the ACM SIGMOD 2013, New York, NY, USA, June 22--27, 2013.
[13]
Lukasz Golab and M. Tamer Özsu. 2003. Issues in data stream management. SIGMOD Record 32, 2 (2003). {14} Structured Streaming Programming Guide. 2018. spark.apache.org/docs/latest/ structured-streaming-programming-guide.html. (2018).
[14]
HiBench. 2017. github.com/intel-hadoop/HiBench. (2017).
[15]
Jay Kreps. 2014. Questioning the lambda architecture. Online article, July (2014).
[16]
Mayuresh Kunjir and Shivnath Babu. 2017. Thoth in Action: Memory Management in Modern Data Analytics. PVLDB 10, 12 (2017).
[17]
Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. Where does Memory Go?: Study of Memory Management in JVM-based Data Analytics. (2016). https: //pdfs.semanticscholar.org/8590/b5d66e0dc429578cf6ac64b8abda6a125701.pdf
[18]
Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2017. Spark- Bench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Cluster Computing (2017).
[19]
Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks. In the 7th IEEE/ACM UCC 2014, London, United Kingdom, Dec. 8--11, 2014.
[20]
Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, and María S. Pérez- Hernández. 2016. Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks. In the IEEE CLUSTER 2016, Taipei, Taiwan, 2016.
[21]
Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co.
[22]
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In 26th SOSP, Shanghai, China, 2017.
[23]
Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, and Scott Shenker. 2017. Performance clarity as a first-class design principle. In the 16th Workshop HotOS 2017, Whistler, BC, Canada, May 8--10, 2017.
[24]
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In the 12th USENIX NSDI 15, Oakland, CA, USA, May 4--6, 2015.
[25]
Saeed Shahrivari. 2014. Beyond Batch Processing: Towards Real-Time and Streaming Big Data. Computers 3, 4 (2014).
[26]
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8, 13 (2015). 28} Anshu Shukla, Shilpa Chaturvedi, and Yogesh Simmhan. 2017. RIoTBench: An IoT benchmark for distributed stream processing systems. Concurrency and Computation: Practice and Experience 29, 21 (2017).
[27]
Spark. 2018. spark.apache.org/. (2018).
[28]
Flink Streaming. 2018. ci.apache.org/projects/flink/flink-docs-release-1.4/dev/ table/streaming.html. (2018).
[29]
Spark Streaming. 2018. spark.apache.org/streaming/. (2018).
[30]
Jason Taaffe. 2018. https://github.com/Taaffy/ Structured-Streaming-Micro-Benchmark. (2018).
[31]
Jorge Veiga, Roberto R. Expósito, Xoan C. Pardo, Guillermo L. Taboada, and Juan Touriño. 2016. Performance evaluation of big data frameworks for large-scale data analytics. In 2016 IEEE BigData 2016, Washington DC, USA, Dec. 5--8, 2016.
[32]
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and Adaptable Stream Processing at Scale. In 26th SOSP, Shanghai, China, 2017.
[33]
Wolfram Wingerath, Felix Gessert, Steffen Friedrich, and Norbert Ritter. 2016. Real-time stream processing for Big Data. Information Technology 58, 4 (2016).
[34]
Zaharia. 2016. databricks.com/blog/2016/07/28/ structured-streaming-in-apache-spark.html. (2016).
[35]
Yunhao Zhang, Rong Chen, and Haibo Chen. 2017. Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data. In 26th SOSP, Shanghai, China, Oct. 28--31, 2017.

Cited By

View all
  • (2023)Twitter Based Complaint Management System for Rail Transit2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA)10.1109/ICAECA56562.2023.10199383(1-6)Online publication date: 16-Jun-2023
  • (2020)Intrusion Detection based on Graph oriented Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2020.08.059176(572-581)Online publication date: 2020
  • (2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering
April 2018
212 pages
ISBN:9781450356299
DOI:10.1145/3185768
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data benchmarking
  2. spark
  3. spark structured streaming

Qualifiers

  • Research-article

Conference

ICPE '18

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Twitter Based Complaint Management System for Rail Transit2023 2nd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA)10.1109/ICAECA56562.2023.10199383(1-6)Online publication date: 16-Jun-2023
  • (2020)Intrusion Detection based on Graph oriented Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2020.08.059176(572-581)Online publication date: 2020
  • (2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
  • (2018)Comparative Study between Big Data Analysis Techniques in Intrusion DetectionBig Data and Cognitive Computing10.3390/bdcc30100013:1(1)Online publication date: 20-Dec-2018
  • (2018)On the Data Stream Processing Frameworks: A Case Study2018 3rd International Conference on Computer Science and Engineering (UBMK)10.1109/UBMK.2018.8566457(104-109)Online publication date: Sep-2018

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media