Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1807167.1807272acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Integrating hadoop and parallel DBMs

Published: 06 June 2010 Publication History
  • Get Citation Alerts
  • Abstract

    Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.

    References

    [1]
    Teradata Developer Exchange. http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradata.
    [2]
    Teradata Online Documentation http://www.info.teradata.com/.
    [3]
    A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. Endow., 2(1):922--933, 2009.
    [4]
    R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.
    [5]
    L. Chu, H. Tang, and T. Yang. Optimizing data aggregation for cluster-based internet services. In In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2003.
    [6]
    Cloudera. http://www.cloudera.com/.
    [7]
    DBInputFormat. http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/.
    [8]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI '04, pages 137--150.
    [9]
    A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009.
    [10]
    S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03. Google, October 2003.
    [11]
    Hadoop. http://hadoop.apache.org/core/.
    [12]
    J. N. Hoover. Start-ups bring google's parallel processing to data warehousing. 2008.
    [13]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007. Microsoft Research, Silicon Valley.
    [14]
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language data processing. In SIGMOD Conference, pages 1099--1110, 2008.
    [15]
    A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM.
    [16]
    M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010.
    [17]
    A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.
    [18]
    VerticaInputFormat. http://www.vertica.com/mapreduce.
    [19]
    H.-C. Yang, A. Dasdan, R.-L. Hsiao, and S. D. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, New York, NY, USA, 2007. ACM.

    Cited By

    View all
    • (2021)A dataspace-based framework for OLAP analyses in a high-variety multistoreThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00682-530:6(1017-1040)Online publication date: 31-Jul-2021
    • (2020)IoT-Based Data Storage for Cloud Computing ApplicationsAdvances in Artificial Intelligence and Data Engineering10.1007/978-981-15-3514-7_109(1455-1464)Online publication date: 14-Aug-2020
    • (2019)Big Data and Cloud Computing: An Emerging Perspective and Future Trends2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)10.1109/ICICT46931.2019.8977674(1-4)Online publication date: Oct-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
    June 2010
    1286 pages
    ISBN:9781450300322
    DOI:10.1145/1807167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data load
    2. hadoop
    3. mapreduce
    4. parallel computing
    5. parallel dbms
    6. shared nothing

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '10
    Sponsor:
    SIGMOD/PODS '10: International Conference on Management of Data
    June 6 - 10, 2010
    Indiana, Indianapolis, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)A dataspace-based framework for OLAP analyses in a high-variety multistoreThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00682-530:6(1017-1040)Online publication date: 31-Jul-2021
    • (2020)IoT-Based Data Storage for Cloud Computing ApplicationsAdvances in Artificial Intelligence and Data Engineering10.1007/978-981-15-3514-7_109(1455-1464)Online publication date: 14-Aug-2020
    • (2019)Big Data and Cloud Computing: An Emerging Perspective and Future Trends2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)10.1109/ICICT46931.2019.8977674(1-4)Online publication date: Oct-2019
    • (2019)Performance Data Analysis for Parallel Processing Using Bigdata DistributionInformation Technology and Systems10.1007/978-3-030-11890-7_58(602-611)Online publication date: 29-Jan-2019
    • (2018)Classifying Big Data Analytic Approaches: A Generic ArchitectureSoftware Technologies10.1007/978-3-319-93641-3_13(268-295)Online publication date: 8-Jun-2018
    • (2016)Take me to SSDProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851658(965-971)Online publication date: 4-Apr-2016
    • (2016)Parallel implementation of FCM-based volume segmentation of 3D images2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)10.1109/AICCSA.2016.7945811(1-6)Online publication date: Dec-2016
    • (2015)External Data Access And Indexing In AsterixDBProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806428(3-12)Online publication date: 17-Oct-2015
    • (2015)Cracking Down MapReduce Failure Amplification through Analytics Logging and MigrationProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2015.111(261-270)Online publication date: 25-May-2015
    • (2015)Efficient query processing framework for big data warehouseFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-014-4025-69:2(224-236)Online publication date: 1-Apr-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media