research-article

Riding the elephant: managing ensembles with hadoop

Authors:

Madhusudhan Govindaraju,

Lavanya RamakrishnanAuthors Info & Claims

MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers

Pages 49 - 58

https://doi.org/10.1145/2132876.2132888

Published: 14 November 2011 Publication History

Abstract

Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.

References

[1]

MapReduce-MPI Library. http://www.sandia.gov/ sjplimp/mapreduce.html.

[2]

Materials Project. http://www.materialsproject.org/.

[3]

Oozie: Workflow engine for Hadoop. http://yahoo.github.com/oozie/.

[4]

The DAKOTA Project Large-Scale Engineering Optimization and Uncertainty Analysis. http://dakota.sandia.gov.

[5]

Triana The Open Source Problem Solving Environment. http://www.trianacode.org/index.html.

[6]

Uncertainty Quantification. https://computation.llnl.gov/casc/uncertainty quantification/.

[7]

S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.

[8]

R. Brun. Root âĂṰ an object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 389(1-2):81--86, 1997.

[9]

E. Deelman, J. Blythe, A. Gil, C. Kesselman, G. Mehta, S. Patil, M. hui Su, K. Vahi, and M. Livny. Pegasus: Mapping scientific workows onto the grid. pages 11--20, 2004.

[10]

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810--818. ACM, 2010.

Digital Library

[11]

M. Eldred, A. Giunta, and B. van Bloemen Waanders. Multilevel parallel optimization using massively parallel structural dynamics. Structural and Multidisciplinary Optimization, 27:97--109, 2004. 10.1007/s00158-003-0371-y.

[12]

Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Benchmarking mapreduce implementations for application usage scenarios. Grid 2011: 12th IEEE/ACM International Conference on Grid Computing, 0:1--8, 2011.

Digital Library

[13]

Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Mariane: Mapreduce implementation adapted for hpc environments. Grid 2011: 12th IEEE/ACM International Conference on Grid Computing, 0:1--8, 2011.

Digital Library

[14]

M. P. I. Forum. Mpi: A message-passing interface standard, 1994.

[15]

S. Ghemawat and J. Dean. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ~ O04), San Francisco, CA, USA, 2004.

Digital Library

[16]

S. Ghemawat, H. Gobio, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM.

Digital Library

[17]

Y. Gu and R. L. Grossman. Sector and sphere: the design and implementation of a high-performance data cloud. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 367(1897):2429--2445, June 2009.

[18]

D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workows of services. Nucleic Acids Research, 34(Web Server issue):729--732, July 2006.

[19]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, 2007.

Digital Library

[20]

H. Karlo, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 938--948, 2010.

Digital Library

[21]

P. M. Kelly, P. D. Coddington, and A. L. Wendelborn. Lambda Calculus as a Workow Model. Practice, 21(July 2009):1999--2017, 2008.

Digital Library

[22]

J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM.

Digital Library

[23]

H. Liu and D. Orban. Cloud mapreduce: A mapreduce implementation on top of a cloud operating system. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID '11, pages 464--474, 2011.

Digital Library

[24]

V. M. Markowitz, F. Korzeniewski, K. Palaniappan, E. Szeto, N. Ivanova, and N. C. Kyrpides. The integrated microbial genomes (img) system: a case study in biological data management. In Proceedings of the 31st international conference on Very large data bases, VLDB '05, pages 1067--1078. VLDB Endowment, 2005.

Digital Library

[25]

R. K. Menon, G. P. Bhat, and M. C. Schatz. Rapid parallel genome indexing with mapreduce. In Proceedings of the second international workshop on MapReduce and its applications, MapReduce '11, pages 51--58, New York, NY, USA, 2011. ACM.

Digital Library

[26]

I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers. In Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on, pages 1--11, nov. 2008.

[27]

L. Ramakrishnan and B. Plale. Multidimensional classification model for scientific workow characteristics. In 1st International Workshop on Workow Approaches to New Data-centric Science (WANDS'10), Indianapolis, IN, 06/2010 2010.

Digital Library

[28]

L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu. Magellan: experiences from a science cloud. In Proceedings of the 2nd international workshop on Scientific cloud computing, ScienceCloud '11, pages 49--58, 2011.

Digital Library

[29]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[30]

M. C. Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics (Oxford, England), 25(11):1363--1369, June 2009.

Digital Library

[31]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1 --10, May 2010.

Digital Library

[32]

. Sroka, J. Hidders, P. Missier, and C. Goble. A formal semantics for the taverna 2 workow model. Journal of Computer and System Sciences, 76(6):490--508, 2010.

Digital Library

[33]

I. Taylor, M. Shields, I. Wang, and A. Harrison. The Triana Workow Environment: Architecture and Applications. In I. Taylor, E. Deelman, D. Gannon, and M. Shields, editors, Workows for e-Science, pages 320--339. Springer, New York, Secaucus, NJ, USA, 2007.

[34]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, 2010.

Digital Library

[35]

C. Zhang, H. De Sterck, M. Jaatun, G. Zhao, and C. Rong. CloudWF: A Computational Workow System for Clouds Based on Hadoop. In M. G. Jaatun, G. Zhao, and C. Rong, editors, Cloud Computing, volume 5931 of Lecture Notes in Computer Science, pages 393--404. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

Digital Library

Cited By

Caino-Lores SCarretero JNicolae BYildiz OPeterka T(2019)Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIYIEEE Access10.1109/ACCESS.2019.29498367(156929-156955)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2949836
Hendrix VFox JGhoshal DRamakrishnan LVarela CCastro HBarrios C(2016)Tigres workflow libraryProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.54(146-155)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.54
Venkata MPoole S(2015)Parallel Programming Models and Systems for High Performance ComputingEmerging Research in Cloud Distributed Computing Systems10.4018/978-1-4666-8213-9.ch008(254-292)Online publication date: 2015
https://doi.org/10.4018/978-1-4666-8213-9.ch008
Show More Cited By

Index Terms

Riding the elephant: managing ensembles with hadoop
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
epiC: an extensible and scalable system for processing Big Data

The Big Data problem is characterized by the so-called 3V features: volume--a huge amount of data, velocity--a high data ingestion rate, and variety--a mix of structured data, semi-structured data, and unstructured data. The state-of-the-art solutions ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers

November 2011

76 pages

ISBN:9781450311458

DOI:10.1145/2132876

General Chairs:
Ioan Raicu
Illinois Institute of Technology, USA
,
Ian Foster
University of Chicago, USA
,
Yong Zhao
University of Electronic Science and Technology, China

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '11

Sponsor:

SIGARCH

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14, 2011

Washington, Seattle, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
402
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Caino-Lores SCarretero JNicolae BYildiz OPeterka T(2019)Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIYIEEE Access10.1109/ACCESS.2019.29498367(156929-156955)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2949836
Hendrix VFox JGhoshal DRamakrishnan LVarela CCastro HBarrios C(2016)Tigres workflow libraryProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.54(146-155)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.54
Venkata MPoole S(2015)Parallel Programming Models and Systems for High Performance ComputingEmerging Research in Cloud Distributed Computing Systems10.4018/978-1-4666-8213-9.ch008(254-292)Online publication date: 2015
https://doi.org/10.4018/978-1-4666-8213-9.ch008
Murthy DBowman S(2014)Big Data solutions on a small scale: Evaluating accessible high-performance computing for social researchBig Data & Society10.1177/20539517145591051:2(205395171455910)Online publication date: 25-Nov-2014
https://doi.org/10.1177/2053951714559105
Ramakrishnan LPoon SHendrix VGunter DPastorello GAgarwal D(2014)Experiences with User-Centered Design for the Tigres Workflow APIProceedings of the 2014 IEEE 10th International Conference on e-Science - Volume 0110.1109/eScience.2014.56(290-297)Online publication date: 20-Oct-2014
https://dl.acm.org/doi/10.1109/eScience.2014.56
Balderrama JSimonin MRamakrishnan LHendrix VMorin CAgarwal DTedeschi CMontagnat JTaylor I(2014)Combining workflow templates with a shared space-based execution modelProceedings of the 9th Workshop on Workflows in Support of Large-Scale Science10.1109/WORKS.2014.14(50-58)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/WORKS.2014.14
Buck JWatkins NLevin GCrume AIoannidou KBrandt SMaltzahn CPolyzotis NTorres AGropp WMatsuoka S(2013)SIDRProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503241(1-12)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2503210.2503241
Xuan PZheng YSarupria SApon A(2013)SciFlow: A dataflow-driven model architecture for scientific computing using Hadoop2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691725(36-44)Online publication date: Oct-2013
https://doi.org/10.1109/BigData.2013.6691725
Ghoshal DRamakrishnan L(2012)FRIEDAProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.132(1096-1105)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.1109/SC.Companion.2012.132

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents