Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2247596.2247598acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Inside "Big Data management": ogres, onions, or parfaits?

Published: 27 March 2012 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper we review the history of systems for managing "Big Data" as well as today's activities and architectures from the (perhaps biased) perspective of three "database guys" who have been watching this space for a number of years and are currently working together on "Big Data" problems. Our focus is on architectural issues, and particularly on the components and layers that have been developed recently (in open source and elsewhere) and on how they are being used (or abused) to tackle challenges posed by today's notion of "Big Data". Also covered is the approach we are taking in the ASTERIX project at UC Irvine, where we are developing our own set of answers to the questions of the "right" components and the "right" set of layers for taming the "Big Data" beast. We close by sharing our opinions on what some of the important open questions are in this area as well as our thoughts on how the dataintensive computing community might best seek out answers.

    References

    [1]
    Apache Cassandra website. http://cassandra.apache.org.
    [2]
    Apache Hadoop website. http://hadoop.apache.org.
    [3]
    Apache HBase website. http://hbase.apache.org.
    [4]
    Apache Hive website. http://hive.apache.org.
    [5]
    jaql: Query language for JavaScript Object Notation (JSON). http://code.google.com/p/jaql/.
    [6]
    Memorable quotes for Shrek (2001). IMDB.com.
    [7]
    Jim Gray -- industry leader. Transaction Processing Performance Council (TPC) web site, April 2009. http://www.tpc.org/information/who/gray.asp.
    [8]
    The big data era: How to succeed. Information Week, August 9, 2010.
    [9]
    Data, data everywhere. The Economist, February 25, 2010.
    [10]
    Anon Et Al. A measure of transaction processing power. In Technical Report 85.2. Tandem Computers, February 1985.
    [11]
    Apache Hadoop, http://hadoop.apache.org.
    [12]
    Apache Hive, http://hadoop.apache.org/hive.
    [13]
    C. Baru, G. Fecteau, A. Goyal, H.-I. Hsiao, A. Jhingran, S. Padmanabhan, W. Wilson, and A. G. H.-I. Hsiao. DB2 parallel edition. IBM Systems Journal, 34(2), 1995.
    [14]
    D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, New York, NY, USA, 2010. ACM.
    [15]
    S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the 2010 International Conference on Management of Data, SIGMOD '10, New York, NY, USA, 2010. ACM.
    [16]
    V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
    [17]
    E. A. Brewer. Combining systems and databases: A search engine retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition. MIT Press, 2005.
    [18]
    M. Calabresi. The Supreme Court weighs the implications of big data. Time, November 16, 2011.
    [19]
    R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39:12--27, May 2011.
    [20]
    R. G. G. Cattell, editor. The Object Database Standard: ODMG 2.0. Morgan Kaufmann, 1997.
    [21]
    R. Chaiken, B. Jenkins, P. A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.
    [22]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
    [23]
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI '04, pages 137--150, December 2004.
    [24]
    J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53:72--77, Jan. 2010.
    [25]
    G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In SOSP, pages 205--220, 2007.
    [26]
    D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
    [27]
    D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - a high performance dataflow database machine. In VLDB, pages 228--237, 1986.
    [28]
    C. Freeland. In big data, potential for big division. New York Times, January 12, 2012.
    [29]
    S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pages 209--219, 1986.
    [30]
    A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009.
    [31]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proc. 19th ACM Symp. on Operating Systems Principles, SOSP '03, New York, NY, USA, 2003. ACM.
    [32]
    G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng., 6(1):120--135, 1994.
    [33]
    R. B. Hagmann and D. Ferrari. Performance analysis of several back-end database architectures. ACM Trans. Database Syst., 11, March 1986.
    [34]
    B. Hopkins. Beyond the hype of big data. CIO.com, October 28, 2011.
    [35]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
    [36]
    JSON. http://www.json.org/.
    [37]
    W. Kim. Special Issue on Database Machines. IEEE Database Engineering Bulletin, December 1981.
    [38]
    G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM.
    [39]
    Object database management systems. http://www.odbms.org/odmg/.
    [40]
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.
    [41]
    P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Inf., 33:351--385, June 1996.
    [42]
    A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Sotonebrakeotonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.
    [43]
    R. Ramakrishnan and J. Gehrke. Database Management Systems. WCB/McGraw-Hill, 2002.
    [44]
    J. Shemer and P. Neches. The genesis of a database computer. Computer, 17(11):42--56, Nov. 1984.
    [45]
    M. Stonebraker. Operating system support for database management. Commun. ACM, 24, July 1981.
    [46]
    M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53:64--71, Jan. 2010.
    [47]
    M. Stonebraker and U. Cetintemel. One size fits all: An idea whose time has come and gone. Data Engineering, International Conference on, 0:2--11, 2005.
    [48]
    The Tandem Database Group. Nonstop SQL: A distributed, high-performance, high-availability implementation of SQL. Second International Workshop on High Performance Transaction Systems, September 1987.
    [49]
    R. Vernica. Efficient Processing of Set-Similarity Joins on Large Clusters. Ph. D. Thesis, Computer Science Department, University of California-Irvine, 2011.
    [50]
    R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD Conference, 2010.
    [51]
    T. Walter. Teradata past, present, and future. In UCI ISG Lecture Series on Scalable Data Management, October 2009. http://isg.ics.uci.edu/scalable_dml_lectures2009-10.html.
    [52]
    M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in scalops, a higher order cloud computing language. In NIPS 2011 Workshop on parallel and large-scale machine learning (BigLearn), December 2011.
    [53]
    D. Weinberger. The machine that would predict the future. Scientific American, November 15, 2011.
    [54]
    XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/.
    [55]
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008.

    Cited By

    View all
    • (2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
    • (2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
    • (2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
    • Show More Cited By

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology
    March 2012
    643 pages
    ISBN:9781450307901
    DOI:10.1145/2247596
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    EDBT '12

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Real-Time Clickstream Data Processing and Visualization Using Apache Tools2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392270(1-5)Online publication date: 18-Aug-2023
    • (2022)An Analysis of Big Data AnalyticsResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch054(1126-1148)Online publication date: 2022
    • (2022)Big Data Analytics and Big Data Processing for IOT-Based Sensing DevicesTransforming Management with AI, Big-Data, and IoT10.1007/978-3-030-86749-2_2(17-49)Online publication date: 17-Feb-2022
    • (2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Decision Support Systems and Decision Management in Healthcare, Business, and Engineering10.4018/978-1-7998-9023-2.ch011(240-266)Online publication date: 2021
    • (2021)Dynamic Capabilities of Decision-oriented Service SystemsResearch Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing10.4018/978-1-7998-5339-8.ch045(957-984)Online publication date: 2021
    • (2021)An Analysis of Big Data AnalyticsSmart Agricultural Services Using Deep Learning, Big Data, and IoT10.4018/978-1-7998-5003-8.ch011(203-230)Online publication date: 2021
    • (2020)Big Data in Internet of Things: Architecture and Open Research Challenges2020 IEEE 23rd International Multitopic Conference (INMIC)10.1109/INMIC50486.2020.9318203(1-6)Online publication date: 5-Nov-2020
    • (2020)Towards Multi-approaches Bioinformatics Pipeline Based on Big Data and Cloud Computing for Next Generation Sequencing Data AnalysisAdvanced Intelligent Systems for Sustainable Development (AI2SD’2019)10.1007/978-3-030-36664-3_43(385-394)Online publication date: 6-Feb-2020
    • (2019)Big Data Processing and Big AnalyticsEmerging Technologies and Applications in Data Processing and Management10.4018/978-1-5225-8446-9.ch014(285-315)Online publication date: 2019
    • (2019)Application of Big Data in Economic PolicyWeb Services10.4018/978-1-5225-7501-6.ch118(2289-2307)Online publication date: 2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media