research-article

Inside "Big Data management": ogres, onions, or parfaits?

Authors:

Vinayak Borkar,

Michael J. Carey, and

Chen LiAuthors Info & Claims

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

March 2012

Pages 3 - 14

https://doi.org/10.1145/2247596.2247598

Published: 27 March 2012 Publication History

Abstract

In this paper we review the history of systems for managing "Big Data" as well as today's activities and architectures from the (perhaps biased) perspective of three "database guys" who have been watching this space for a number of years and are currently working together on "Big Data" problems. Our focus is on architectural issues, and particularly on the components and layers that have been developed recently (in open source and elsewhere) and on how they are being used (or abused) to tackle challenges posed by today's notion of "Big Data". Also covered is the approach we are taking in the ASTERIX project at UC Irvine, where we are developing our own set of answers to the questions of the "right" components and the "right" set of layers for taming the "Big Data" beast. We close by sharing our opinions on what some of the important open questions are in this area as well as our thoughts on how the dataintensive computing community might best seek out answers.

References

[1]

Apache Cassandra website. http://cassandra.apache.org.

[2]

Apache Hadoop website. http://hadoop.apache.org.

[3]

Apache HBase website. http://hbase.apache.org.

[4]

Apache Hive website. http://hive.apache.org.

[5]

jaql: Query language for JavaScript Object Notation (JSON). http://code.google.com/p/jaql/.

[6]

Memorable quotes for Shrek (2001). IMDB.com.

[7]

Jim Gray -- industry leader. Transaction Processing Performance Council (TPC) web site, April 2009. http://www.tpc.org/information/who/gray.asp.

[8]

The big data era: How to succeed. Information Week, August 9, 2010.

[9]

Data, data everywhere. The Economist, February 25, 2010.

[10]

Anon Et Al. A measure of transaction processing power. In Technical Report 85.2. Tandem Computers, February 1985.

[11]

Apache Hadoop, http://hadoop.apache.org.

[12]

Apache Hive, http://hadoop.apache.org/hive.

[13]

C. Baru, G. Fecteau, A. Goyal, H.-I. Hsiao, A. Jhingran, S. Padmanabhan, W. Wilson, and A. G. H.-I. Hsiao. DB2 parallel edition. IBM Systems Journal, 34(2), 1995.

Digital Library

[14]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, New York, NY, USA, 2010. ACM.

Digital Library

[15]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the 2010 International Conference on Management of Data, SIGMOD '10, New York, NY, USA, 2010. ACM.

Digital Library

[16]

V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.

Digital Library

[17]

E. A. Brewer. Combining systems and databases: A search engine retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition. MIT Press, 2005.

[18]

M. Calabresi. The Supreme Court weighs the implications of big data. Time, November 16, 2011.

[19]

R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39:12--27, May 2011.

Digital Library

[20]

R. G. G. Cattell, editor. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.

Digital Library

[21]

R. Chaiken, B. Jenkins, P. A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.

Digital Library

[22]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.

Digital Library

[23]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI '04, pages 137--150, December 2004.

Digital Library

[24]

J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53:72--77, Jan. 2010.

Digital Library

[25]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In SOSP, pages 205--220, 2007.

Digital Library

[26]

D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.

Digital Library

[27]

D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - a high performance dataflow database machine. In VLDB, pages 228--237, 1986.

Digital Library

[28]

C. Freeland. In big data, potential for big division. New York Times, January 12, 2012.

[29]

S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pages 209--219, 1986.

Digital Library

[30]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009.

Digital Library

[31]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proc. 19th ACM Symp. on Operating Systems Principles, SOSP '03, New York, NY, USA, 2003. ACM.

Digital Library

[32]

G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1):120--135, 1994.

Digital Library

[33]

R. B. Hagmann and D. Ferrari. Performance analysis of several back-end database architectures. ACM Trans. Database Syst., 11, March 1986.

Digital Library

[34]

B. Hopkins. Beyond the hype of big data. CIO.com, October 28, 2011.

[35]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.

Digital Library

[36]

JSON. http://www.json.org/.

[37]

W. Kim. Special Issue on Database Machines. IEEE Database Engineering Bulletin, December 1981.

[38]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.

Digital Library

[39]

Object database management systems. http://www.odbms.org/odmg/.

[40]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.

Digital Library

[41]

P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Inf., 33:351--385, June 1996.

Digital Library

[42]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Sotonebrakeotonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

[43]

R. Ramakrishnan and J. Gehrke. Database Management Systems. WCB/McGraw-Hill, 2002.

Digital Library

[44]

J. Shemer and P. Neches. The genesis of a database computer. Computer, 17(11):42--56, Nov. 1984.

Digital Library

[45]

M. Stonebraker. Operating system support for database management. Commun. ACM, 24, July 1981.

Digital Library

[46]

M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53:64--71, Jan. 2010.

Digital Library

[47]

M. Stonebraker and U. Cetintemel. One size fits all: An idea whose time has come and gone. Data Engineering, International Conference on, 0:2--11, 2005.

Digital Library

[48]

The Tandem Database Group. Nonstop SQL: A distributed, high-performance, high-availability implementation of SQL. Second International Workshop on High Performance Transaction Systems, September 1987.

Digital Library

[49]

R. Vernica. Efficient Processing of Set-Similarity Joins on Large Clusters. Ph. D. Thesis, Computer Science Department, University of California-Irvine, 2011.

Digital Library

[50]

R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference, 2010.

Digital Library

[51]

T. Walter. Teradata past, present, and future. In UCI ISG Lecture Series on Scalable Data Management, October 2009. http://isg.ics.uci.edu/scalable_dml_lectures2009-10.html.

[52]

M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in scalops, a higher order cloud computing language. In NIPS 2011 Workshop on parallel and large-scale machine learning (BigLearn), December 2011.

[53]

D. Weinberger. The machine that would predict the future. Scientific American, November 15, 2011.

[54]

XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/.

[55]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008.

Digital Library

Cited By

Patil TAnand KBhateja AJamal KSawant-Patil SPaygude P(2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392270
Singh VBairwa ASinwar D(2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
https://doi.org/10.4018/978-1-6684-3662-2.ch054
Pal PAwasthi CSehgal IMishra P(2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
https://doi.org/10.1007/978-3-030-86749-2_2
Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

March 2012

643 pages

ISBN:9781450307901

DOI:10.1145/2247596

Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

EDBT '12

EDBT '12: 15th International Conference on Extending Database Technology

March 27 - 30, 2012

Berlin, Germany

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

85
Total Citations
View Citations
3,108
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

Cited By

Patil TAnand KBhateja AJamal KSawant-Patil SPaygude P(2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392270
Singh VBairwa ASinwar D(2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
https://doi.org/10.4018/978-1-6684-3662-2.ch054
Pal PAwasthi CSehgal IMishra P(2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
https://doi.org/10.1007/978-3-030-86749-2_2
Schmidt RMöhring MZimmerman A(2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Decision Support Systems and Decision Management in Healthcare, Business, and Engineering10.4018/978-1-7998-9023-2.ch011(240-266)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-9023-2.ch011
Schmidt RMöhring MZimmerman A(2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing10.4018/978-1-7998-5339-8.ch045(957-984)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-5339-8.ch045
Singh VBairwa ASinwar D(2021)An Analysis of Big Data AnalyticsSmart Agricultural Services Using Deep Learning, Big Data, and IoT10.4018/978-1-7998-5003-8.ch011(203-230)Online publication date: 2021
https://doi.org/10.4018/978-1-7998-5003-8.ch011
Haider IHaider MSaeed A(2020)Big Data in Internet of Things: Architecture and Open Research Challenges2020 IEEE 23rd International Multitopic Conference (INMIC)10.1109/INMIC50486.2020.9318203(1-6)Online publication date: 5-Nov-2020
https://doi.org/10.1109/INMIC50486.2020.9318203
Driouche R(2020)Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data AnalysisAdvanced Intelligent Systems for Sustainable Development (AI2SD’2019)10.1007/978-3-030-36664-3_43(385-394)Online publication date: 6-Feb-2020
https://doi.org/10.1007/978-3-030-36664-3_43
Pokorny JStantic B(2019)Big Data Processing and Big AnalyticsEmerging Technologies and Applications in Data Processing and Management10.4018/978-1-5225-8446-9.ch014(285-315)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-8446-9.ch014
Mishra BSahoo A(2019)Application of Big Data in Economic PolicyWeb Services10.4018/978-1-5225-7501-6.ch118(2289-2307)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7501-6.ch118
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents