Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2132876.2132888acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Riding the elephant: managing ensembles with hadoop

Published: 14 November 2011 Publication History
  • Get Citation Alerts
  • Abstract

    Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.

    References

    [1]
    MapReduce-MPI Library. http://www.sandia.gov/ sjplimp/mapreduce.html.
    [2]
    Materials Project. http://www.materialsproject.org/.
    [3]
    Oozie: Workflow engine for Hadoop. http://yahoo.github.com/oozie/.
    [4]
    The DAKOTA Project Large-Scale Engineering Optimization and Uncertainty Analysis. http://dakota.sandia.gov.
    [5]
    Triana The Open Source Problem Solving Environment. http://www.trianacode.org/index.html.
    [6]
    Uncertainty Quantification. https://computation.llnl.gov/casc/uncertainty quantification/.
    [7]
    S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
    [8]
    R. Brun. Root âĂṰ an object oriented data analysis framework. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 389(1-2):81--86, 1997.
    [9]
    E. Deelman, J. Blythe, A. Gil, C. Kesselman, G. Mehta, S. Patil, M. hui Su, K. Vahi, and M. Livny. Pegasus: Mapping scientific workows onto the grid. pages 11--20, 2004.
    [10]
    J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810--818. ACM, 2010.
    [11]
    M. Eldred, A. Giunta, and B. van Bloemen Waanders. Multilevel parallel optimization using massively parallel structural dynamics. Structural and Multidisciplinary Optimization, 27:97--109, 2004. 10.1007/s00158-003-0371-y.
    [12]
    Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Benchmarking mapreduce implementations for application usage scenarios. Grid 2011: 12th IEEE/ACM International Conference on Grid Computing, 0:1--8, 2011.
    [13]
    Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Mariane: Mapreduce implementation adapted for hpc environments. Grid 2011: 12th IEEE/ACM International Conference on Grid Computing, 0:1--8, 2011.
    [14]
    M. P. I. Forum. Mpi: A message-passing interface standard, 1994.
    [15]
    S. Ghemawat and J. Dean. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI ~ O04), San Francisco, CA, USA, 2004.
    [16]
    S. Ghemawat, H. Gobio, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29--43, New York, NY, USA, 2003. ACM.
    [17]
    Y. Gu and R. L. Grossman. Sector and sphere: the design and implementation of a high-performance data cloud. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 367(1897):2429--2445, June 2009.
    [18]
    D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workows of services. Nucleic Acids Research, 34(Web Server issue):729--732, July 2006.
    [19]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, 2007.
    [20]
    H. Karlo, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '10, pages 938--948, 2010.
    [21]
    P. M. Kelly, P. D. Coddington, and A. L. Wendelborn. Lambda Calculus as a Workow Model. Practice, 21(July 2009):1999--2017, 2008.
    [22]
    J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM.
    [23]
    H. Liu and D. Orban. Cloud mapreduce: A mapreduce implementation on top of a cloud operating system. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID '11, pages 464--474, 2011.
    [24]
    V. M. Markowitz, F. Korzeniewski, K. Palaniappan, E. Szeto, N. Ivanova, and N. C. Kyrpides. The integrated microbial genomes (img) system: a case study in biological data management. In Proceedings of the 31st international conference on Very large data bases, VLDB '05, pages 1067--1078. VLDB Endowment, 2005.
    [25]
    R. K. Menon, G. P. Bhat, and M. C. Schatz. Rapid parallel genome indexing with mapreduce. In Proceedings of the second international workshop on MapReduce and its applications, MapReduce '11, pages 51--58, New York, NY, USA, 2011. ACM.
    [26]
    I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers. In Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on, pages 1--11, nov. 2008.
    [27]
    L. Ramakrishnan and B. Plale. Multidimensional classification model for scientific workow characteristics. In 1st International Workshop on Workow Approaches to New Data-centric Science (WANDS'10), Indianapolis, IN, 06/2010 2010.
    [28]
    L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu. Magellan: experiences from a science cloud. In Proceedings of the 2nd international workshop on Scientific cloud computing, ScienceCloud '11, pages 49--58, 2011.
    [29]
    C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society.
    [30]
    M. C. Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics (Oxford, England), 25(11):1363--1369, June 2009.
    [31]
    K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1 --10, May 2010.
    [32]
    . Sroka, J. Hidders, P. Missier, and C. Goble. A formal semantics for the taverna 2 workow model. Journal of Computer and System Sciences, 76(6):490--508, 2010.
    [33]
    I. Taylor, M. Shields, I. Wang, and A. Harrison. The Triana Workow Environment: Architecture and Applications. In I. Taylor, E. Deelman, D. Gannon, and M. Shields, editors, Workows for e-Science, pages 320--339. Springer, New York, Secaucus, NJ, USA, 2007.
    [34]
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, 2010.
    [35]
    C. Zhang, H. De Sterck, M. Jaatun, G. Zhao, and C. Rong. CloudWF: A Computational Workow System for Clouds Based on Hadoop. In M. G. Jaatun, G. Zhao, and C. Rong, editors, Cloud Computing, volume 5931 of Lecture Notes in Computer Science, pages 393--404. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.

    Cited By

    View all
    • (2019)Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIYIEEE Access10.1109/ACCESS.2019.29498367(156929-156955)Online publication date: 2019
    • (2016)Tigres workflow libraryProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.54(146-155)Online publication date: 16-May-2016
    • (2015)Parallel Programming Models and Systems for High Performance ComputingEmerging Research in Cloud Distributed Computing Systems10.4018/978-1-4666-8213-9.ch008(254-292)Online publication date: 2015
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MTAGS '11: Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
    November 2011
    76 pages
    ISBN:9781450311458
    DOI:10.1145/2132876
    • General Chairs:
    • Ioan Raicu,
    • Ian Foster,
    • Yong Zhao
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MapReduce
    2. data-intensive
    3. hadoop
    4. scientific ensembles
    5. workflows

    Qualifiers

    • Research-article

    Conference

    SC '11
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIYIEEE Access10.1109/ACCESS.2019.29498367(156929-156955)Online publication date: 2019
    • (2016)Tigres workflow libraryProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.54(146-155)Online publication date: 16-May-2016
    • (2015)Parallel Programming Models and Systems for High Performance ComputingEmerging Research in Cloud Distributed Computing Systems10.4018/978-1-4666-8213-9.ch008(254-292)Online publication date: 2015
    • (2014)Big Data solutions on a small scale: Evaluating accessible high-performance computing for social researchBig Data & Society10.1177/20539517145591051:2(205395171455910)Online publication date: 25-Nov-2014
    • (2014)Experiences with User-Centered Design for the Tigres Workflow APIProceedings of the 2014 IEEE 10th International Conference on e-Science - Volume 0110.1109/eScience.2014.56(290-297)Online publication date: 20-Oct-2014
    • (2014)Combining workflow templates with a shared space-based execution modelProceedings of the 9th Workshop on Workflows in Support of Large-Scale Science10.1109/WORKS.2014.14(50-58)Online publication date: 16-Nov-2014
    • (2013)SIDRProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.1145/2503210.2503241(1-12)Online publication date: 17-Nov-2013
    • (2013)SciFlow: A dataflow-driven model architecture for scientific computing using Hadoop2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691725(36-44)Online publication date: Oct-2013
    • (2012)FRIEDAProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.132(1096-1105)Online publication date: 10-Nov-2012

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media