research-article

BigBench: towards an industry standard benchmark for big data analytics

Authors:

Alain Crolotte,

Hans-Arno JacobsenAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 1197 - 1208

https://doi.org/10.1145/2463676.2463712

Published: 22 June 2013 Publication History

Abstract

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.

In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.

We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

References

[1]

Apache Hadoop Project. http://hadoop.apache.org.

[2]

Apache Hive Project. http://hadoop.apache.org/hive.

[3]

Cloudera Distribution Including Apache Hadoop (CDH). http://www.cloudera.com.

[4]

Greenplum Database. http://www.greenplum.com.

[5]

GridMix Benchmark. http://hadoop.apache.org/docs/mapreduce/current/gridmix.html.

[6]

Oracle Database - Oracle. http://www.oracle.com.

[7]

PigMix Benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.

[8]

Teradata Database - Teradata Inc. http://www.teradata.com.

[9]

TwinFin - Netezza, Inc. http://www.netezza.com/.

[10]

TPC Benchmark DS, 2012.

[11]

J. Bentley. Programming Pearls. Addison-Wesley, 2000.

Digital Library

[12]

M. J. Carey, D. J. DeWitt, and J. F. Naughton. The oo7 Benchmark. In P. Buneman and S. Jajodia, editors, SIGMOD'93, pages 12--21. ACM Press, 1993.

Digital Library

[13]

M. J. Carey, D. J. DeWitt, J. F. Naughton, M. Asgarian, P. Brown, J. Gehrke, and D. Shah. The BUCKY Object-Relational Benchmark (Experience Paper). In SIGMOD, pages 135--146, 1997.

Digital Library

[14]

M. J. Carey, L. Ling, M. Nicola, and L. Shao. EXRT: Towards a Simple Benchmark for XML Readiness Testing. In TPCTC, pages 93--109, 2010.

Digital Library

[15]

C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In PLDI, pages 363--375, 2010.

Digital Library

[16]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010.

Digital Library

[17]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[18]

M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012.

Digital Library

[19]

E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A Practical Approach to Self-Describing, Polymorphic, and Parallelizable User-Defined Functions. PVLDB, 2(2):1402--1413, 2009.

Digital Library

[20]

J. Gray. GraySort Benchmark. Sort Benchmark Home Page -- http://sortbenchmark.org.

[21]

D. Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Technical report, Meta Group, 2001.

[22]

J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers. Big data: The Next Frontier for Innovation, Competition, and Productivity. Technical report, McKinsey Global Institute, 2011. http://www.mckinsey.com/insights/mgi/research/technology_and_innovation%/big_data_the_next_frontier_for_innovation.

[23]

R. O. Nambiar and M. Poess. The Making of TPC-DS. In VLDB, pages 1049--1058, 2006.

Digital Library

[24]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008.

Digital Library

[25]

S. Patil, M. Polte, K. Ren, W. Tantisiriroj, L. Xiao, J. Lopez, G. Gibson, A. Fuchs, and B. Rinaldi. YCSB++: benchmarking and performance debugging advanced features in scalable table stores. In SoCC, pages 9:1--9:14, 2011.

Digital Library

[26]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

[27]

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005.

Digital Library

[28]

M. Pöss, R. O. Nambiar, and D. Walrath. Why You Should Run TPC-DS: A Workload Analysis. In VLDB, pages 1138--1149, 2007.

Digital Library

[29]

T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010.

Digital Library

[30]

T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gómez-Villamor, V. Muntés-Mulero, and S. Mankowskii. Solving Big Data Challenges for Enterprise Application Performance Management. PVLDB, 5(12):1724--1735, 2012.

Digital Library

[31]

A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A Benchmark for XML Data Management. In VLDB, pages 974--985, 2002.

Digital Library

[32]

Teradata Aster. Teradata Aster Big Analytics Appliance 3H - Analytics Foundation User Guide, release 5.0.1 edition, 2012. http://www.info.teradata.com/edownload.cfm?itemid=123060004.

[33]

L. Wyatt, B. Caufield, and D. Pol. Principles for an ETL Benchmark. In TPCTC, pages 183--198, 2009.

Digital Library

Cited By

Liang YWu CSong TWu WXia YLiu YOu YLu SJi LMao SWang YShou LGong MDuan N(2024)TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsIntelligent Computing10.34133/icomputing.00633Online publication date: 16-Feb-2024
https://doi.org/10.34133/icomputing.0063
Xiao ZDeng WLam MEslami MKim JLee MLiao Q(2024)Human-Centered Evaluation and Auditing of Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3636302(1-6)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3636302
Abbasian MKhatibi EAzimi IOniani DShakeri Hossein Abad ZThieme ASriram RYang ZWang YLin BGevaert OLi LJain RRahmani A(2024)Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AInpj Digital Medicine10.1038/s41746-024-01074-z7:1Online publication date: 29-Mar-2024
https://doi.org/10.1038/s41746-024-01074-z
Show More Cited By

Index Terms

BigBench: towards an industry standard benchmark for big data analytics

Recommendations

Big SQL systems: an experimental evaluation
Abstract
Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it ...
Hadoop as Big Data Operating System -- The Emerging Approach for Managing Challenges of Enterprise Big Data Platform
BIGDATASERVICE '15: Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications

Over last few years, innovation in Hadoop and other related Big Data technologies in last few years brings on to the table a lot of promises around better management of enterprise data at much lesser cost and with high value business benefits. In this ...
Optimized hadoop map reduce system for strong analytics of cloud big product data on amazon web service
Highlights
- DM models applied using Hadoop map reduce system to forecast rating of product.
Abstract
Because of the rapid increase of data in the cloud of Amazon Web Service (AWS), the traditional methods for analyzing this data are not good and inappropriate, so unconventional methods of analysis have been proposed by many data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

303
Total Citations
View Citations
4,008
Total Downloads

Downloads (Last 12 months)217
Downloads (Last 6 weeks)22

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liang YWu CSong TWu WXia YLiu YOu YLu SJi LMao SWang YShou LGong MDuan N(2024)TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsIntelligent Computing10.34133/icomputing.00633Online publication date: 16-Feb-2024
https://doi.org/10.34133/icomputing.0063
Xiao ZDeng WLam MEslami MKim JLee MLiao Q(2024)Human-Centered Evaluation and Auditing of Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3636302(1-6)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613905.3636302
Abbasian MKhatibi EAzimi IOniani DShakeri Hossein Abad ZThieme ASriram RYang ZWang YLin BGevaert OLi LJain RRahmani A(2024)Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AInpj Digital Medicine10.1038/s41746-024-01074-z7:1Online publication date: 29-Mar-2024
https://doi.org/10.1038/s41746-024-01074-z
Nazir AChakravarthy TCecchini DKhajuria RSharma PMirik AKocaman VTalby D(2024)LangTest: A comprehensive evaluation library for custom LLM and NLP modelsSoftware Impacts10.1016/j.simpa.2024.10061919(100619)Online publication date: Mar-2024
https://doi.org/10.1016/j.simpa.2024.100619
Yang YWang LZhan J(2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
https://doi.org/10.1007/978-981-97-0316-6_8
Wei XWang JSun CTowey DZhang SZuo WYu YRuan RSong G(2024)Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issuesJournal of Software: Evolution and Process10.1002/smr.2650Online publication date: 7-Feb-2024
https://doi.org/10.1002/smr.2650
Brücke CHärtling PPalacios RPatel HRabl T(2023)TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning SystemsProceedings of the VLDB Endowment10.14778/3611540.361155416:12(3649-3661)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611554
Zhu JLu BYu XXu JWo T(2023)An Approach to Workload Generation for Cloud Benchmarking: a View from Alibaba Trace2023 IEEE 15th International Symposium on Autonomous Decentralized System (ISADS)10.1109/ISADS56919.2023.10092039(1-8)Online publication date: 15-Mar-2023
https://doi.org/10.1109/ISADS56919.2023.10092039
Sawadogo PDarmont J(2023)DLBench+Data & Knowledge Engineering10.1016/j.datak.2023.102154145:COnline publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.datak.2023.102154
Poess M(2023)TPC, Where Art Thou?Datenbank-Spektrum10.1007/s13222-022-00428-922:3(241-248)Online publication date: 3-Feb-2023
https://doi.org/10.1007/s13222-022-00428-9
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents