research-article

Integrating hadoop and parallel DBMs

Authors:

Pekka Kostamaa,

Like GaoAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 969 - 974

https://doi.org/10.1145/1807167.1807272

Published: 06 June 2010 Publication History

Abstract

Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.

References

[1]

Teradata Developer Exchange. http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradata.

[2]

Teradata Online Documentation http://www.info.teradata.com/.

[3]

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. Endow., 2(1):922--933, 2009.

Digital Library

[4]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008.

Digital Library

[5]

L. Chu, H. Tang, and T. Yang. Optimizing data aggregation for cluster-based internet services. In In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2003.

Digital Library

[6]

Cloudera. http://www.cloudera.com/.

[7]

DBInputFormat. http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/.

[8]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI '04, pages 137--150.

[9]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009.

Digital Library

[10]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03. Google, October 2003.

Digital Library

[11]

Hadoop. http://hadoop.apache.org/core/.

[12]

J. N. Hoover. Start-ups bring google's parallel processing to data warehousing. 2008.

[13]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007. Microsoft Research, Silicon Valley.

Digital Library

[14]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language data processing. In SIGMOD Conference, pages 1099--1110, 2008.

Digital Library

[15]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM.

Digital Library

[16]

M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010.

Digital Library

[17]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009.

Digital Library

[18]

VerticaInputFormat. http://www.vertica.com/mapreduce.

[19]

H.-C. Yang, A. Dasdan, R.-L. Hsiao, and S. D. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, New York, NY, USA, 2007. ACM.

Digital Library

Cited By

Forresi CGallinucci EGolfarelli MHamadou H(2021)A dataspace-based framework for OLAP analyses in a high-variety multistoreThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00682-530:6(1017-1040)Online publication date: 31-Jul-2021
https://dl.acm.org/doi/10.1007/s00778-021-00682-5
Shukla ASomagattu PSingh VKalra M(2020)IoT-Based Data Storage for Cloud Computing ApplicationsAdvances in Artificial Intelligence and Data Engineering10.1007/978-981-15-3514-7_109(1455-1464)Online publication date: 14-Aug-2020
https://doi.org/10.1007/978-981-15-3514-7_109
Yadav PSharma SSingh A(2019)Big Data and Cloud Computing: An Emerging Perspective and Future Trends2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)10.1109/ICICT46931.2019.8977674(1-4)Online publication date: Oct-2019
https://doi.org/10.1109/ICICT46931.2019.8977674
Show More Cited By

Index Terms

Integrating hadoop and parallel DBMs
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Oracle in-database hadoop: when mapreduce meets RDBMS
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Big data is the tar sands of the data world: vast reserves of raw gritty data whose valuable information content can only be extracted at great cost. MapReduce is a popular parallel programming paradigm well suited to the programmatic extraction and ...
A Hadoop based distributed loading approach to parallel data warehouses
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

One critical part of building and running a data warehouse is the ETL (Extraction Transformation Loading) process. In fact, the growing ETL tool market is already a multi-billion-dollar market. Getting data into data warehouses has been a hindering ...
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
2,297
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Forresi CGallinucci EGolfarelli MHamadou H(2021)A dataspace-based framework for OLAP analyses in a high-variety multistoreThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00682-530:6(1017-1040)Online publication date: 31-Jul-2021
https://dl.acm.org/doi/10.1007/s00778-021-00682-5
Shukla ASomagattu PSingh VKalra M(2020)IoT-Based Data Storage for Cloud Computing ApplicationsAdvances in Artificial Intelligence and Data Engineering10.1007/978-981-15-3514-7_109(1455-1464)Online publication date: 14-Aug-2020
https://doi.org/10.1007/978-981-15-3514-7_109
Yadav PSharma SSingh A(2019)Big Data and Cloud Computing: An Emerging Perspective and Future Trends2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)10.1109/ICICT46931.2019.8977674(1-4)Online publication date: Oct-2019
https://doi.org/10.1109/ICICT46931.2019.8977674
Ortiz-Garcés IYánez NVillegas-Ch W(2019)Performance Data Analysis for Parallel Processing Using Bigdata DistributionInformation Technology and Systems10.1007/978-3-030-11890-7_58(602-611)Online publication date: 29-Jan-2019
https://doi.org/10.1007/978-3-030-11890-7_58
Cardinale YGuehis SRukoz M(2018)Classifying Big Data Analytic Approaches: A Generic ArchitectureSoftware Technologies10.1007/978-3-319-93641-3_13(268-295)Online publication date: 8-Jun-2018
https://doi.org/10.1007/978-3-319-93641-3_13
Kim MShin MPark SOssowski S(2016)Take me to SSDProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851658(965-971)Online publication date: 4-Apr-2016
https://dl.acm.org/doi/10.1145/2851613.2851658
AlZu'bi SShehab MAl-Ayyoub MBenkhelifa EJararweh Y(2016)Parallel implementation of FCM-based volume segmentation of 3D images2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)10.1109/AICCSA.2016.7945811(1-6)Online publication date: Dec-2016
https://doi.org/10.1109/AICCSA.2016.7945811
Alamoudi AGrover RCarey MBorkar VBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)External Data Access And Indexing In AsterixDBProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806428(3-12)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806428
Wang YFu HYu W(2015)Cracking Down MapReduce Failure Amplification through Analytics Logging and MigrationProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium10.1109/IPDPS.2015.111(261-270)Online publication date: 25-May-2015
https://dl.acm.org/doi/10.1109/IPDPS.2015.111
Wang HQin XZhou XLi FQin ZZhu QWang S(2015)Efficient query processing framework for big data warehouseFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-014-4025-69:2(224-236)Online publication date: 1-Apr-2015
https://dl.acm.org/doi/10.1007/s11704-014-4025-6
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents