research-article

Adding Velocity to BigBench

Authors:

Patrick Bedué,

Roberto V. ZicariAuthors Info & Claims

DBTest '18: Proceedings of the Workshop on Testing Database Systems

Article No.: 6, Pages 1 - 6

https://doi.org/10.1145/3209950.3209956

Published: 15 June 2018 Publication History

Abstract

BigBench standardized as TPCx-BB is a popular application benchmark that targets Big Data storage and processing systems. BigBench V2 addresses some of the BigBench limitations by introducing a new simplified data model, semi-structured web logs in JSON file format and new queries mandating late binding. However, it still covers only batch processing workloads and the Big Data velocity characteristic is not addressed. This work extends the BigBench V2 benchmark with a data streaming component that simulates typical statistical and predictive analytics queries in a retail business scenario. Our approach is to preserve the existing BigBench design and introduce a new streaming component that supports two data streaming modes: active and passive. In active mode, the data stream generation and processing happen in parallel, whereas in passive mode, the data stream is pre-generated in advance before the actual stream processing. The stream workload consists of five queries inspired by the existing 30 BigBench queries. To validate the proposed streaming extension, the two streaming modes were implemented and tested using Kafka and Spark Streaming. The experimental results prove the feasibility of our benchmark design. Finally, we outline design challenges and future plans for improving the proposed BigBench extension.

References

[1]

Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6 (2013).

Digital Library

[2]

Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Data Management Benchmark. In the 30th VLDB, Toronto, Canada, Aug. 31-Sept. 3, 2004.

Digital Library

[3]

Lucas Braun, Thomas Etter, Georgios Gasparis, Martin Kaufmann, Donald Kossmann, Daniel Widmer, Aharon Avitzur, Anthony Iliopoulos, Eliezer Levy, and Ning Liang. 2015. Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database. In Proceedings of the SIGMOD 2015, Melbourne, Victoria, Australia, May 31-June 4, 2015. 251--264.

Digital Library

[4]

Apache Calcite. 2017. https://calcite.apache.org/. (2017).

[5]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38.

[6]

Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In 2016 IPDPS Workshops, Chicago, IL, USA, May 23-27.

[7]

Apache Drill. 2017. drill.apache.org. (2017).

[8]

Ahmad Ghazal, Todor Ivanov, Pekka Kostamaa, Alain Crolotte, Ryan Voong, Mohammed Al-Kateb, Waleed Ghazal, and Roberto V. Zicari. 2017. BigBench V2: The New and Improved BigBench. In the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017. 1225--1236.

[9]

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. 2013. BigBench: Towards An Industry Standard Benchmark for Big Data Analytics. In SIGMOD 2013. 1197--1208.

Digital Library

[10]

Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 26th IEEE Data Engineering Workshops (ICDEW), 2010. IEEE.

[11]

Todor Ivanov and Max-Georg Beer. 2015. Performance evaluation of spark SQL using BigBench. In Workshop on Big Data Benchmarks. Springer, 96--116.

[12]

Apache Kafka. 2017. https://kafka.apache.org/. (2017).

[13]

Andreas Kipf, Varun Pandey, Jan Böttcher, Lucas Braun, Thomas Neumann, and Alfons Kemper. 2017. Analytics on Fast Data: Main-Memory Database Systems vs Modern Streaming Systems. In 20th EDBT 2017, Venice, Italy, March 21-24, 2017.

[14]

TPCx-BB kit. 2017. https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench. (2017).

[15]

Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. 2015. Spark-Bench: A Comprehensive Benchmarking Suite for In Memory Data Analytic Platform Spark. In 12th ACM International Conference on Computing Frontiers.

Digital Library

[16]

Ruirui Lu, Gang Wu, Bin Xie, and Jingtong Hu. 2014. Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks. In Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014, London, United Kingdom, December 8-11, 2014. 69--78.

Digital Library

[17]

Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Stateful Scalable Stream Processing at LinkedIn. PVLDB 10, 12 (2017), 1634--1645.

Digital Library

[18]

Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale. 2016. SamzaSQL: Scalable Fast Data Management with Streaming SQL. In 2016 IEEE IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016.

[19]

Anshu Shukla and Yogesh Simmhan. 2016. Benchmarking Distributed Stream Processing Platforms for IoT Applications. In 8th TPCTC 2016, New Delhi, India, Sept. 5-9, 2016. 90--106.

[20]

Michael Stonebraker, Ugur Çetintemel, and Stanley B. Zdonik. 2005. The 8 requirements of real-time stream processing. SIGMOD Record 34, 4 (2005), 42--47.

Digital Library

[21]

Flink Streaming. 2017. https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/streaming.html. (2017).

[22]

Spark Streaming. 2017. https://spark.apache.org/streaming/. (2017).

[23]

Spark Structured Streaming. 2017. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html. (2017).

[24]

Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy V. Ryaboy. 2014. Storm@twitter. In SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014. 147--156.

Digital Library

[25]

TPCx-BB. 2017. www.tpc.org/tpcx-bb/default.asp. (2017).

[26]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: fault-tolerant streaming computation at scale. In ACM SIGOPS 24th SOSP '13, Farmington, PA, USA, November 3-6, 2013.

Digital Library

Cited By

Ivanov TPergolesi M(2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
https://doi.org/10.1002/cpe.5523

Index Terms

Adding Velocity to BigBench
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Information systems
  1. Data management systems

Recommendations

CoreBigBench: Benchmarking big data core operations
DBTest '20: Proceedings of the workshop on Testing Database Systems

Significant effort was put into big data benchmarking with focus on end-to-end applications. While covering basic functionalities implicitly, the details of the individual contributions to the overall performance are hidden. As a result, end-to-end ...
BigBench: towards an industry standard benchmark for big data analytics
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to ...
ABench: Big Data Architecture Stack Benchmark
ICPE '18: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

Distributed big data processing and analytics applications demand a comprehensive end-to-end architecture stack consisting of big data technologies. However, there are many possible architecture patterns (e.g. Lambda, Kappa or Pipeline architectures) to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DBTest '18: Proceedings of the Workshop on Testing Database Systems

June 2018

49 pages

ISBN:9781450358262

DOI:10.1145/3209950

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 15, 2018

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 31 of 56 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
165
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ivanov TPergolesi M(2019)The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and ParquetConcurrency and Computation: Practice and Experience10.1002/cpe.552332:5Online publication date: 9-Sep-2019
https://doi.org/10.1002/cpe.5523

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents