research-article

Clydesdale: structured data processing on MapReduce

Authors:

Eugene J. Shekita,

Sandeep TataAuthors Info & Claims

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Pages 15 - 25

https://doi.org/10.1145/2247596.2247600

Published: 27 March 2012 Publication History

Abstract

MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing importance of processing structured data on MapReduce platforms. MapReduce offers several attractive properties such as the use of low-cost hardware, fault-tolerance, scalability, and elasticity. However, these advantages have required a substantial performance sacrifice.

In this paper we introduce Clydesdale, a novel system for structured data processing on Hadoop -- a popular implementation of MapReduce. We show that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform. Clydesdale is aimed at workloads where the data fits a star schema. It draws on column oriented storage, tailored join-plans, and multi-core execution strategies and carefully fits them into the constraints of a typical MapReduce platform. Using the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive. This demonstrates that MapReduce in general, and Hadoop in particular, is a far more compelling platform for structured data processing than previous results suggest.

References

[1]

Capacity Scheduler. http://hadoop.apache.org/common/docs/r0.20.2/capacity-scheduler.html.

[2]

HDFS Issue 385. https://issues.apache.org/jira/browse/HDFS-385.

[3]

Hive. http://hive.apache.org/.

[4]

MAPREDUCE Issue 2386. https://issues.apache.org/jira/browse/MAPREDUCE-2386.

[5]

Next Generation of Apache Hadoop MapReduce -- The Scheduler. http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler.

[6]

D. Abadi, S. R. Madden, and N. Hachem. Column-Stores vs. Row-Stores: How Different Are They Really? In SIGMOD, pages 967--980, 2008.

Digital Library

[7]

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009.

Digital Library

[8]

F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, pages 99--110, 2010.

Digital Library

[9]

A. Ailamaki, D. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. In VLDB, pages 169--180, 2001.

Digital Library

[10]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In SoCC, pages 119--130, 2010.

Digital Library

[11]

A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a Scalable, Semistructured Data Platform for Evolving-World Models. Distributed and Parallel Databases, 29(3):185--216, 2011.

Digital Library

[12]

K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. PVLDB, 2011.

Digital Library

[13]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In SIGMOD Conference, pages 975--986, 2010.

Digital Library

[14]

P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access. In VLDB, pages 54--65, 1999.

Digital Library

[15]

R. Chaiken, B. Jenkins, P.- Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2):1265--1276, 2008.

Digital Library

[16]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL Implementation on the MapReduce Framework. PVLDB, 4(12):1318--1327, 2011.

[17]

S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(2):1459--1468, 2010.

Digital Library

[18]

X. Chen, P. E. O'Neil, and E. J. O'Neil. Adjoined Dimension Column Clustering to Improve Data Warehouse Query Performance. In ICDE, pages 1409--1411, 2008.

Digital Library

[19]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008.

Digital Library

[20]

J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010.

Digital Library

[21]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011.

Digital Library

[22]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. PVLDB, 2(2):1414--1425, 2009.

Digital Library

[23]

G. Graefe. Volcano -- An Extensible and Parallel Query Evaluation System. IEEE Transactions on Knowledge and Data Engineering, 6:120--135, February 1994.

Digital Library

[24]

Y. He, R. Lee, S. Zheng, N. Jain, Z. Xu, and X. Zhang. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In ICDE, 2011.

Digital Library

[25]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, pages 261--272, 2011.

[26]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Networked Systems Design and Implementation, pages 22--22, 2011.

Digital Library

[27]

S. Idreos, M. L. Kersten, and S. Manegold. Self-Organizing Tuple Reconstruction in Column-Stores. In SIGMOD, pages 297--308, 2009.

Digital Library

[28]

M. Isard and Y. Yu. Distributed Data-Parallel Computing Using a High-Level Programming Language. In SIGMOD Conference, pages 987--994, 2009.

Digital Library

[29]

M. Ivanova, M. L. Kersten, and N. Nes. Self-Organizing Strategies for a Column-Store Database. In EDBT, pages 157--168, 2008.

Digital Library

[30]

E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6), 2011.

Digital Library

[31]

D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010.

Digital Library

[32]

Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In SIGMOD Conference, 2011.

Digital Library

[33]

P. E. O'Neil, E. J. O'Neil, and X. Chen. The Star Schema Benchmark (SSB). http://www.cs.umb.edu/ËIJponeil/StarSchemaB.PDF.

[34]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

[35]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, pages 553--564, 2005.

Digital Library

[36]

M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The End of an Architectural Era (It's Time for a Complete Rewrite). In VLDB, pages 1150--1160, 2007.

Digital Library

[37]

C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, pages 657--668, 2010.

[38]

H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In SIGMOD, pages 1029--1040. ACM, 2007.

Digital Library

[39]

M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Engineering Bulletin, 28(2):17--22, 2005.

Cited By

Barkhordari MNiamanesh M(2018)Hengam a MapReduce-Based Distributed Data Warehouse for Big DataInternational Journal of Artificial Life Research10.4018/IJALR.20180101028:1(16-35)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.4018/IJALR.2018010102
Wang YZhong YMa QYang G(2018)Handling data skew in joins based on cluster cost partitioning for MapReduceMultiagent and Grid Systems10.3233/MGS-18028314:1(103-123)Online publication date: 16-Apr-2018
https://doi.org/10.3233/MGS-180283
Barkhordari MNiamanesh M(2018)Chabok: a Map-Reduce based method to solve data warehouse problemsJournal of Big Data10.1186/s40537-018-0144-55:1Online publication date: 26-Oct-2018
https://doi.org/10.1186/s40537-018-0144-5
Show More Cited By

Clydesdale: structured data processing on MapReduce
1. Information systems
  1. Data management systems

Recommendations

Clydesdale: structured data processing on hadoop
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for ...
Big Data Analytics with R and Hadoop
Big Data Analytics

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

March 2012

643 pages

ISBN:9781450307901

DOI:10.1145/2247596

Editors:
Elke Rundensteiner
Worcester Polytechnic Institute
,
Volker Markl
Technische Universität Berlin, Germany
,
Ioana Manolescu
INRIA, France
,
Sihem Amer-Yahia
QCRI, Doha, Qatar
,
Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
,
Ismail Ari
Ozyegin University, Turkey

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT '12

EDBT '12: 15th International Conference on Extending Database Technology

March 27 - 30, 2012

Berlin, Germany

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
717
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Barkhordari MNiamanesh M(2018)Hengam a MapReduce-Based Distributed Data Warehouse for Big DataInternational Journal of Artificial Life Research10.4018/IJALR.20180101028:1(16-35)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.4018/IJALR.2018010102
Wang YZhong YMa QYang G(2018)Handling data skew in joins based on cluster cost partitioning for MapReduceMultiagent and Grid Systems10.3233/MGS-18028314:1(103-123)Online publication date: 16-Apr-2018
https://doi.org/10.3233/MGS-180283
Barkhordari MNiamanesh M(2018)Chabok: a Map-Reduce based method to solve data warehouse problemsJournal of Big Data10.1186/s40537-018-0144-55:1Online publication date: 26-Oct-2018
https://doi.org/10.1186/s40537-018-0144-5
Barkhordari MNiamanesh M(2017)ArasInternational Journal of Distributed Systems and Technologies10.4018/IJDST.20170401048:2(47-60)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.4018/IJDST.2017040104
Wang HZhang JZhang DPumma SFeng W(2017)PaPar: A Parallel Data Partitioning Framework for Big Data Applications2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.119(605-614)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.119
Singh HBawa S(2017)A MapReduce-based scalable discovery and indexing of structured big dataFuture Generation Computer Systems10.1016/j.future.2017.03.02873:C(32-43)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.1016/j.future.2017.03.028
Barkhordari MNiamanesh M(2017)AtrakThe Journal of Supercomputing10.1007/s11227-017-2037-373:10(4596-4610)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1007/s11227-017-2037-3
Seera NTaruna S(2017)Leveraging MapReduce with Column-Oriented Stores: Study of Solutions and BenefitsBig Data Analytics10.1007/978-981-10-6620-7_5(39-46)Online publication date: 4-Oct-2017
https://doi.org/10.1007/978-981-10-6620-7_5
Eltabakh M(2017)Data Organization and Curation in Big DataHandbook of Big Data Technologies10.1007/978-3-319-49340-4_5(143-178)Online publication date: 26-Feb-2017
https://doi.org/10.1007/978-3-319-49340-4_5
Dede ESendir BKuzlu PWeachock JGovindaraju MRamakrishnan L(2016)Processing Cassandra Datasets with Hadoop-Streaming Based ApproachesIEEE Transactions on Services Computing10.1109/TSC.2015.24448389:1(46-58)Online publication date: 1-Jan-2016
https://doi.org/10.1109/TSC.2015.2444838
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents