Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3209950.3209956acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Adding Velocity to BigBench

Published: 15 June 2018 Publication History

Abstract

BigBench standardized as TPCx-BB is a popular application benchmark that targets Big Data storage and processing systems. BigBench V2 addresses some of the BigBench limitations by introducing a new simplified data model, semi-structured web logs in JSON file format and new queries mandating late binding. However, it still covers only batch processing workloads and the Big Data velocity characteristic is not addressed. This work extends the BigBench V2 benchmark with a data streaming component that simulates typical statistical and predictive analytics queries in a retail business scenario. Our approach is to preserve the existing BigBench design and introduce a new streaming component that supports two data streaming modes: active and passive. In active mode, the data stream generation and processing happen in parallel, whereas in passive mode, the data stream is pre-generated in advance before the actual stream processing. The stream workload consists of five queries inspired by the existing 30 BigBench queries. To validate the proposed streaming extension, the two streaming modes were implemented and tested using Kafka and Spark Streaming. The experimental results prove the feasibility of our benchmark design. Finally, we outline design challenges and future plans for improving the proposed BigBench extension.

References

[1]
Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6 (2013).
[2]
Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Data Management Benchmark. In the 30th VLDB, Toronto, Canada, Aug. 31-Sept. 3, 2004.
[3]
Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, and Ning Liang. 2015. Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database. In Proceedings of the SIGMOD 2015, Melbourne, Victoria, Australia, May 31-June 4, 2015. 251--264.
[4]
Apache Calcite. 2017. https://calcite.apache.org/. (2017).
[5]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38.
[6]
Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IPDPS Workshops, Chicago, IL, USA, May 23-27.
[7]
Apache Drill. 2017. drill.apache.org. (2017).
[8]
Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 1225--1236.
[9]
Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197--1208.
[10]
Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th IEEE Data Engineering Workshops (ICDEW), 2010. IEEE.
[11]
Todor Ivanov and Max-Georg Beer. 2015. Performance evaluation of spark SQL using BigBench. In Workshop on Big Data Benchmarks. Springer, 96--116.
[12]
Apache Kafka. 2017. https://kafka.apache.org/. (2017).
[13]
Andreas Kipf, Varun Pandey, Jan Böttcher, Lucas Braun, Thomas Neumann, and Alfons Kemper. 2017. Analytics on Fast Data: Main-Memory Database Systems vs Modern Streaming Systems. In 20th EDBT 2017, Venice, Italy, March 21-24, 2017.
[14]
TPCx-BB kit. 2017. https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench. (2017).
[15]
Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Spark-Bench: A Comprehensive Benchmarking Suite for In Memory Data Analytic Platform Spark. In 12th ACM International Conference on Computing Frontiers.
[16]
Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks. In Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014, London, United Kingdom, December 8-11, 2014. 69--78.
[17]
Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Stateful Scalable Stream Processing at LinkedIn. PVLDB 10, 12 (2017), 1634--1645.
[18]
Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale. 2016. SamzaSQL: Scalable Fast Data Management with Streaming SQL. In 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016.
[19]
Anshu Shukla and Yogesh Simmhan. 2016. Benchmarking Distributed Stream Processing Platforms for IoT Applications. In 8th TPCTC 2016, New Delhi, India, Sept. 5-9, 2016. 90--106.
[20]
Michael Stonebraker, Ugur Çetintemel, and Stanley B. Zdonik. 2005. The 8 requirements of real-time stream processing. SIGMOD Record 34, 4 (2005), 42--47.
[21]
Flink Streaming. 2017. https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/streaming.html. (2017).
[22]
Spark Streaming. 2017. https://spark.apache.org/streaming/. (2017).
[23]
Spark Structured Streaming. 2017. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html. (2017).
[24]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy V. Ryaboy. 2014. Storm@twitter. In SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 147--156.
[25]
TPCx-BB. 2017. www.tpc.org/tpcx-bb/default.asp. (2017).
[26]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: fault-tolerant streaming computation at scale. In ACM SIGOPS 24th SOSP '13, Farmington, PA, USA, November 3-6, 2013.

Cited By

View all
  • (2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DBTest '18: Proceedings of the Workshop on Testing Database Systems
June 2018
49 pages
ISBN:9781450358262
DOI:10.1145/3209950
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data Benchmarking
  2. BigBench
  3. Streaming Benchmark

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 31 of 56 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media