Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2588555.2595633acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

GenBase: a complex analytics genomics benchmark

Published: 18 June 2014 Publication History

Abstract

This paper introduces a new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks (joins, filters, etc.) and complex analytics (regression, singular value decomposition, etc.) Such mixed workloads are prevalent in a number of application areas including most science workloads and web analytics. As a specific use case, we have chosen genomics data for our benchmark and have constructed a collection of typical tasks in this domain. In addition to being representative of a mixed data management and analytics workload, this benchmark is also meant to scale to large dataset sizes and multiple nodes across a cluster. Besides presenting this benchmark, we have run it on a variety of storage systems including traditional row stores, newer column stores, Hadoop, and an array DBMS. We present performance numbers on all systems on single and multiple nodes, and show that performance differs by orders of magnitude between the various solutions. In addition, we demonstrate that most platforms have scalability issues. We also test offloading the analytics onto a coprocessor. The intent of this benchmark is to focus research interest in this area; to this end, all of our data, data generators, and scripts are available on our web site.

References

[1]
Affymetrix. http://www.affymetrix.com.
[2]
Apache hive(tm). http://hive.apache.org.
[3]
Apache mahout: Scalable machine learning and data mining. http://mahout.apache.org.
[4]
Apache (tm) hadoop (r). http://hadoop.apache.org.
[5]
Blas (basic linear algebra subprograms). http://www.netlib.org/blas/.
[6]
Dna sequencing costs. http://www.genome.gov/sequencingcosts.
[7]
Genbase. http://web.mit.edu/~mvartak/www/genmark.html.
[8]
Madlib. http://madlib.net.
[9]
Postgresql. http://www.postgresql.org.
[10]
The r project for statistical computing. http://www.r-project.org.
[11]
Scalapack - scalable linear algebra package. http://www.netlib.org/scalapack/.
[12]
Scidb. http://www.scidb.org.
[13]
Singular value decomposition for genome-wide expression data processing and modeling. www.pnas.org/content/97/18/10101.abstract.
[14]
Tpc transaction processing performance council. www.tpc.org.
[15]
Intel math kernel library (intel mkl) 11.1, 2013.
[16]
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30, VLDB '04, pages 480--491. VLDB Endowment, 2004.
[17]
D. Bitton, C. Turbyfill, D. J. Dewitt, and D. J. Dewitt. Benchmarking database systems: A systematic approach. pages 8--19, 1983.
[18]
M. J. Carey, D. J. DeWitt, J. F. Naughton, M. Asgarian, P. Brown, J. E. Gehrke, and D. N. Shah. The bucky object-relational benchmark. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, SIGMOD '97, pages 135--146, New York, NY, USA, 1997. ACM.
[19]
M. Clamp, B. Fry, M. Kamal, X. Xie, J. Cuff, M. F. Lin, and E. S. Lander. Distinguishing protein-coding and noncoding genes in the human genome. Proc. National Academy of Sciences of the United States of America, 104(49):19428--33, 2007.
[20]
R. Hazra. Driving industrial innovation on the path to exascale: From vision to reality, 2013. http://newsroom.intel.com/servlet/JiveServlet/download/6314- 25051/Intel_ISC13_keynote_by_Raj_Hazra.pdf.
[21]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: or mad skills, the sql. Proc. VLDB Endow., 5(12):1700--1711, Aug. 2012.
[22]
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. Designing fast architecture-sensitive tree search on modern multicore/many-core processors. ACM Trans. Database Syst., 36(4):22, 2011.
[23]
G. Ostrouchov, W.-C. Chen, D. Schmidt, and P. Patel. Programming with big data in r, 2012. http://r-pbd.org/.
[24]
M. Schatz and B. Langmead. The dna data deluge. IEEE Spectrum, 50(7):28--33, 2013.
[25]
A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. National Academy of Sciences of the United States of America, 102(43):15545--50, 2005.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. array dbms
  2. benchmarking
  3. scientific dbms

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'14
Sponsor:

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media