Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2463676.2463712acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

BigBench: towards an industry standard benchmark for big data analytics

Published: 22 June 2013 Publication History

Abstract

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to evaluate and compare the performance of these systems.
In this paper, we present BigBench, an end-to-end big data benchmark proposal. The underlying business model of BigBench is a product retailer. The proposal covers a data model and synthetic data generator that addresses the variety, velocity and volume aspects of big data systems containing structured, semi-structured and unstructured data. The structured part of the BigBench data model is adopted from the TPC-DS benchmark, which is enriched with semi-structured and unstructured data components. The semi-structured part captures registered and guest user clicks on the retailer's website. The unstructured data captures product reviews submitted online. The data generator designed for BigBench provides scalable volumes of raw data based on a scale factor. The BigBench workload is designed around a set of queries against the data model. From a business prospective, the queries cover the different categories of big data analytics proposed by McKinsey. From a technical prospective, the queries are designed to span three different dimensions based on data sources, query processing types and analytic techniques.
We illustrate the feasibility of BigBench by implementing it on the Teradata Aster Database. The test includes generating and loading a 200 Gigabyte BigBench data set and testing the workload by executing the BigBench queries (written using Teradata Aster SQL-MR) and reporting their response times.

References

[1]
Apache Hadoop Project. http://hadoop.apache.org.
[2]
Apache Hive Project. http://hadoop.apache.org/hive.
[3]
Cloudera Distribution Including Apache Hadoop (CDH). http://www.cloudera.com.
[4]
Greenplum Database. http://www.greenplum.com.
[5]
GridMix Benchmark. http://hadoop.apache.org/docs/mapreduce/current/gridmix.html.
[6]
Oracle Database - Oracle. http://www.oracle.com.
[7]
PigMix Benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.
[8]
Teradata Database - Teradata Inc. http://www.teradata.com.
[9]
TwinFin - Netezza, Inc. http://www.netezza.com/.
[10]
TPC Benchmark DS, 2012.
[11]
J. Bentley. Programming Pearls. Addison-Wesley, 2000.
[12]
M. J. Carey, D. J. DeWitt, and J. F. Naughton. The oo7 Benchmark. In P. Buneman and S. Jajodia, editors, SIGMOD'93, pages 12--21. ACM Press, 1993.
[13]
M. J. Carey, D. J. DeWitt, J. F. Naughton, M. Asgarian, P. Brown, J. Gehrke, and D. Shah. The BUCKY Object-Relational Benchmark (Experience Paper). In SIGMOD, pages 135--146, 1997.
[14]
M. J. Carey, L. Ling, M. Nicola, and L. Shao. EXRT: Towards a Simple Benchmark for XML Readiness Testing. In TPCTC, pages 93--109, 2010.
[15]
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In PLDI, pages 363--375, 2010.
[16]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In SoCC, pages 143--154, 2010.
[17]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107--113, 2008.
[18]
M. Frank, M. Poess, and T. Rabl. Efficient Update Data Generation for DBMS Benchmark. In ICPE, 2012.
[19]
E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A Practical Approach to Self-Describing, Polymorphic, and Parallelizable User-Defined Functions. PVLDB, 2(2):1402--1413, 2009.
[20]
J. Gray. GraySort Benchmark. Sort Benchmark Home Page -- http://sortbenchmark.org.
[21]
D. Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Technical report, Meta Group, 2001.
[22]
J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers. Big data: The Next Frontier for Innovation, Competition, and Productivity. Technical report, McKinsey Global Institute, 2011. http://www.mckinsey.com/insights/mgi/research/technology_and_innovation%/big_data_the_next_frontier_for_innovation.
[23]
R. O. Nambiar and M. Poess. The Making of TPC-DS. In VLDB, pages 1049--1058, 2006.
[24]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008.
[25]
S. Patil, M. Polte, K. Ren, W. Tantisiriroj, L. Xiao, J. Lopez, G. Gibson, A. Fuchs, and B. Rinaldi. YCSB++: benchmarking and performance debugging advanced features in scalable table stores. In SoCC, pages 9:1--9:14, 2011.
[26]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.
[27]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005.
[28]
M. Pöss, R. O. Nambiar, and D. Walrath. Why You Should Run TPC-DS: A Workload Analysis. In VLDB, pages 1138--1149, 2007.
[29]
T. Rabl, M. Frank, H. M. Sergieh, and H. Kosch. A Data Generator for Cloud-Scale Benchmarking. In TPCTC, pages 41--56, 2010.
[30]
T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gómez-Villamor, V. Muntés-Mulero, and S. Mankowskii. Solving Big Data Challenges for Enterprise Application Performance Management. PVLDB, 5(12):1724--1735, 2012.
[31]
A. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A Benchmark for XML Data Management. In VLDB, pages 974--985, 2002.
[32]
Teradata Aster. Teradata Aster Big Analytics Appliance 3H - Analytics Foundation User Guide, release 5.0.1 edition, 2012. http://www.info.teradata.com/edownload.cfm?itemid=123060004.
[33]
L. Wyatt, B. Caufield, and D. Pol. Principles for an ETL Benchmark. In TPCTC, pages 183--198, 2009.

Cited By

View all
  • (2024)TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsIntelligent Computing10.34133/icomputing.00633Online publication date: 16-Feb-2024
  • (2024)Human-Centered Evaluation and Auditing of Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3636302(1-6)Online publication date: 11-May-2024
  • (2024)Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AInpj Digital Medicine10.1038/s41746-024-01074-z7:1Online publication date: 29-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. big data
  3. map reduce

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'13
Sponsor:

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)217
  • Downloads (Last 6 weeks)22
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsIntelligent Computing10.34133/icomputing.00633Online publication date: 16-Feb-2024
  • (2024)Human-Centered Evaluation and Auditing of Language ModelsExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3636302(1-6)Online publication date: 11-May-2024
  • (2024)Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AInpj Digital Medicine10.1038/s41746-024-01074-z7:1Online publication date: 29-Mar-2024
  • (2024)LangTest: A comprehensive evaluation library for custom LLM and NLP modelsSoftware Impacts10.1016/j.simpa.2024.10061919(100619)Online publication date: Mar-2024
  • (2024)A Linear Combination-Based Method to Construct Proxy Benchmarks for Big Data WorkloadsBenchmarking, Measuring, and Optimizing10.1007/978-981-97-0316-6_8(120-136)Online publication date: 14-Feb-2024
  • (2024)Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issuesJournal of Software: Evolution and Process10.1002/smr.2650Online publication date: 7-Feb-2024
  • (2023)TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning SystemsProceedings of the VLDB Endowment10.14778/3611540.361155416:12(3649-3661)Online publication date: 1-Aug-2023
  • (2023)An Approach to Workload Generation for Cloud Benchmarking: a View from Alibaba Trace2023 IEEE 15th International Symposium on Autonomous Decentralized System (ISADS)10.1109/ISADS56919.2023.10092039(1-8)Online publication date: 15-Mar-2023
  • (2023)DLBench+Data & Knowledge Engineering10.1016/j.datak.2023.102154145:COnline publication date: 1-May-2023
  • (2023)TPC, Where Art Thou?Datenbank-Spektrum10.1007/s13222-022-00428-922:3(241-248)Online publication date: 3-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media