Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2087522.2087526acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Evaluating the suitability of mapreduce for surface temperature analysis codes

Published: 14 November 2011 Publication History

Abstract

Processing large volumes of scientific data requires an efficient and scalable parallel computing framework to obtain meaningful information quickly. In this paper, we evaluate a scientific application from the environmental sciences for its suitability to use the MapReduce framework. We consider cccgistemp -- a Python reimplementation of the original NASA GISS model for estimating global temperature change -- which takes land and ocean temperature records from different sites, removes duplicate records, and adjusts for urbanisation effects before calculating the 12 month running mean global temperature. The application consists of several stages, each displaying differing characteristics, and three stages have been ported to use Hadoop with the mrjob library. We note performance bottlenecks encountered while porting and suggest possible solutions, including modification of data access patterns to overcome uneven distribution of input data.

References

[1]
Apache Hadoop framework. 2008. http://hadoop.apache.org/ (accessed June 27, 2011).
[2]
Barroso, L. A. and Hölzle, U. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers.
[3]
Butt, A. R., Pandey, P., Gupta, K. and Wang, G. 2009. "A Simulation Approach to Evaluating Design Decisions in MapReduce Steps." IEEE Intl. Symposium on Modeling Analysis Simulation of Computer and Telecommunication Systems. IEEE. 1--11.
[4]
Dean, J., and Ghemawat, S. 2004. "Mapreduce: Simplified Data Processing on Large Clusters." OSDI'04. 137--150.
[5]
Ekanayake, J., Pallickara, S. and Fox, G. 2008. "MapReduce for Data Intensive Scientific Analyses." IEEE Fourth Intl .Conference on eScience. IEEE. 277- 284.
[6]
Fadika, Z., Dede, E., Govindaraju, M. and Ramakrishnan, L. 2011. "MARIANE: MApReduce Implementation Adapted for HPC Environments." 12th IEEE/ACM Intl. Conference on Grid Computing.
[7]
Gufler, B., Augsten, N., Reiser, A. and Kemper, A. 2011 "Handling Data Skew in MapReduce." CLOSER 2011 -- Intl. Conference on Cloud Computing and Services Science.
[8]
Hansen, J., Ruedy, R., Glascoe, J. and Sato, M. 1999. "GISS analysis of surface temperature change." J. Geophys. Res.,104: 30,997--31,022.
[9]
Hansen, J., Ruedy, R., Sato, M. and Lo, K. 2010. "Global Surface Temperature Change." J. Geophys. Res, 48: 1--29.
[10]
Hansen, J. and Lebedeff, S. 1987. "Global Trends of Measured Surface Air Temperature." J. Geophys. Res., 92: 13,345--13,372.
[11]
Hey, T., Tansley, S. and Tolle, K. 2009 The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Washington: Microsoft Research.
[12]
Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D. 2007. "Dryad: distributed data-parallel programsfrom sequential building blocks." ACM SIGOPS Operating Systems Review. Vol: 41, Issue: 3, p59.
[13]
Kulkarni, O. 2011. "Benchmarking an Amdahl-balanced Cluster for Data Intensive Computing". MSc in HPC Thesis, University of Edinburgh.
[14]
Kwon, Y.C., Balazinska, M. and Howe, B. 2011. "A Study of Skew in MapReduce Applications." Open Cirrus Summit 2011.
[15]
Lin, J. and Dyer, C. 2010. "Data Intensive Text Processing with MapReduce." Morgan and Claypool.
[16]
Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S. and Wang, J. 2008. "Introducing map-reduce to high end computing." 3rd Petascale Data Storage Workshop. IEEE, 1--6.
[17]
Malewicz, G., et al. 2009 "Pregel: A System for Large-Scale Graph Processing." 28th ACM Symposium on Principles of Distributed Computing (PODC 2009). Calgary, Alberta, Canada: ACM. Volume: 9, Pages: 6--6.
[18]
Moretti, C., Bui, H., Hollingsworth, K., Rich, B., Flynn, P. and Thain, D. 2010. "All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids." IEEE Transactions on Parallel and Distributed Systems. IEEE. Volume: 21, Issue: 1, Pages: 33--46.
[19]
Ogawa, H., Nakada, H., Takano, R. and Kudoh, T. 2010. "An Implementation of Key-value Store based MapReduce Framework." 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE, 754- 761.
[20]
Peterson, T.C., Karl, T.R., Jamason, P.F., Knight, R. and Easterling, D.R. 1998. "First difference method: Maximizing station density for the calculation of long-term global temperature change." J. Geophys. Res.,103: 25,967--25,974.
[21]
Reynolds, R. W., Rayner, N.A., Smith, T.M., Stokes, D.C. and Wang, W. 2002. "An improved in situ and satellite SST analysis for climate." J. Clim., 15: 1609--1625.
[22]
Shafer, J., Rixner, S. and Cox, A. 2010. "The Hadoop Distributed Filesystem: Balancing Portability and Performance." International Symposium on Performance Analysis of Systems & Software (ISPASS), 122--133.
[23]
White, Tom. 2010. Hadoop: The Definitive Guide, Second Edition. O'Reilly.
[24]
Xie, J., Yin, S., Ruan, X., Ding, Z. and Tian, Y. 2010. "Improving MapReduce performance through data placement in heterogeneous Hadoop clusters." 2010 IEEE International Symposium on Parallel Distributed Processing Workshops. IEEE, 1--9.
[25]
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R. and Stoica, I. 2008. "Improving MapReduce Performance in Heterogeneous Environments." 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008. San Diego. 29--42.
[26]
Zhu, S., Xiao, Z., Chen, H., Chen, R., Zhang, W. and Zang, B. 2009. "Evaluating SPLASH-2 Applications Using MapReduce." APPT09. 452--46.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DataCloud-SC '11: Proceedings of the second international workshop on Data intensive computing in the clouds
November 2011
98 pages
ISBN:9781450311441
DOI:10.1145/2087522
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-intensive
  2. environmental sciences
  3. hadoop
  4. mapreduce

Qualifiers

  • Research-article

Conference

SC '11
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 151
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media