Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1851476.1851593acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Twister: a runtime for iterative MapReduce

Published: 21 June 2010 Publication History

Abstract

MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.

References

[1]
}}J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107--113, 2008.
[2]
}}M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007.
[3]
}}J. Ekanayake, A. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, and R. Barga, "DryadLINQ for Scientific Analyses," presented at the 5th IEEE International Conference on e-Science, Oxford UK, 2009.
[4]
}}J. Ekanayake, S. Pallickara, and G. Fox, "MapReduce for Data Intensive Scientific Analyses," presented at the Proceedings of the 2008 Fourth IEEE International Conference on eScience, 2008.
[5]
}}J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. Fox, "High Performance Parallel Computing with Clouds and Cloud Technologies," in Cloud Computing and Software Services: Theory and Techniques, ed: CRC Press (Taylor and Francis).
[6]
}}G. Fox, S.-H. Bae, J. Ekanayake, X. Qiu, and H. Yuan, "Parallel Data Mining from Multicore to Cloudy Grids," presented at the International Advanced Research Workshop on High Performance Computing and Grids (HPC2008), Cetraro, Italy, 2008.
[7]
}}MPI (Message Passing Interface). Available: http://www-unix.mcs.anl.gov/mpi/
[8]
}}PVM (Parallel Virtual Machine). Available: http://www.csm.ornl.gov/pvm/
[9]
}}C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-Reduce for Machine Learning on Multicore," in NIPS, ed: MIT Press, 2006, pp. 281--288.
[10]
}}Apache Hadoop. Available: http://hadoop.apache.org/
[11]
}}Y. Gu and R. L. Grossman, "Sector and Sphere: the design and implementation of a high-performance data cloud," Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, vol. 367, pp. 2429--2445, 2009.
[12]
}}Twister: A Runtime for Iterative MapReduce. Available: http://www.iterativemapreduce.org/
[13]
}}Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and C. J., "DryadLINQ: A System for GeneralPurpose Distributed Data-Parallel Computing Using a HighLevel Language," in Symposium on Operating System Design and Implementation (OSDI), 2008.
[14]
}}Disco project. Available: http://discoproject.org/
[15]
}}S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29--43, 2003.
[16]
}}J. B. MacQueen, "Some Methods for Classification and Analysis of MultiVariate Observations," in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, L. M. L. Cam and J. Neyman, Eds., ed: University of California Press, 1967.
[17]
}}K. Rose, E. Gurewwitz, and G. Fox, "A deterministic annealing approach to clustering," Pattern Recogn. Lett., vol. 11, pp. 589--594, 1990.
[18]
}}S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available: http://infolab.stanford.edu/~backrub/google.html
[19]
}}J. de Leeuw, "Applications of convex analysis to multidimensional scaling," Recent Developments in Statistics, pp. 133--145, 1977.
[20]
}}S. Pallickara and G. Fox, "NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids," presented at the Middleware 2003, 2003.
[21]
}}ActiveMQ. Available: http://activemq.apache.org/
[22]
}}C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," in IEEE Transactions on Parallel and Distributed Systems, 2010, pp. 33--46.
[23]
}}O. Gotoh, "An improved algorithm for matching biological sequences," Journal of Molecular Biology vol. 162, pp. 705--708, 1982.
[24]
}}Source Code. Smith Waterman Software. Available: http://jaligner.sourceforge.net/
[25]
}}J. Qiu, J. Ekanayake, T. Gunarathne, J. Y. Choi, S.-H. Bae, Y. Ruan, S. Ekanayake, S. Wu, S. Beason, G. Fox, M. Rho, and H. Tang, "Data Intensive Computing for Bioinformatics," in Data Intensive Distributed Computing, ed: IGI Publishers, 2010.
[26]
}}A. F. A. Smit, R. Hubley, and P. Green. (2004, Repeatmasker. Available: http://www.repeatmasker.org
[27]
}}J. Jurka, "Repbase Update: a database and an electronic journal of repetitive elements," Trends in Genetics, vol. 6, pp. 418--420, 2000.
[28]
}}J. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis," Psychometrika, vol. 29, pp. 1--27, 1964.
[29]
}}Y. Takane, Young, F. W., & de Leeuw, J., "Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features," Psychometrika, vol. 42, pp. 7--67, 1977.
[30]
}}I. Borg, & Groenen, P. J., Modern Multidimensional Scaling: Theory and Applications: Springer, 2005.
[31]
}}Y. Zhu, S. Ye, and X. Li, "Distributed PageRank computation based on iterative aggregation-disaggregation methods," presented at the Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 2005.
[32]
}}S. Kamvar, T. Haveliwala, C. Manning, and G. Golub, "Exploiting the Block Structure of the Web for Computing PageRank," Stanford InfoLab, Technical Report 2003.
[33]
}}The Power Method. Available: http://en.wikipedia.org/wiki/Pagerank#Power_Method
[34]
}}(2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/Data/clueweb09/
[35]
}}C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for multi-core and multiprocessor systems," in 13th International Symposium on High-Performance Computer Architecture, 2007, pp. 13--24.
[36]
}}C. Team. (2009, Condor DAGMan. Available: http://www.cs.wisc.edu/condor/dagman/.
[37]
}}LINQ Language-Integrated Query. Available: http://msdn.microsoft.com/en-us/netframework/aa904594.aspx
[38]
}}Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. v. Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, Reliable, Loosely Coupled Parallel Computation," in IEEE Congress on Services, 2007, pp. 199--206.

Cited By

View all
  • (2024)A method of test case set generation in the commutativity test of reduce functionsScience of Computer Programming10.1016/j.scico.2023.103006231:COnline publication date: 1-Jan-2024
  • (2024)Methods for concept analysis and multi-relational data mining: a systematic literature reviewKnowledge and Information Systems10.1007/s10115-024-02139-x66:9(5113-5150)Online publication date: 30-May-2024
  • (2023)Trends in High-Performance Data Engineering for Data AnalyticsNew Trends and Challenges in Open Data 10.5772/intechopen.1001458Online publication date: 3-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. cloud technologies
  3. iterative algorithms

Qualifiers

  • Research-article

Conference

HPDC '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A method of test case set generation in the commutativity test of reduce functionsScience of Computer Programming10.1016/j.scico.2023.103006231:COnline publication date: 1-Jan-2024
  • (2024)Methods for concept analysis and multi-relational data mining: a systematic literature reviewKnowledge and Information Systems10.1007/s10115-024-02139-x66:9(5113-5150)Online publication date: 30-May-2024
  • (2023)Trends in High-Performance Data Engineering for Data AnalyticsNew Trends and Challenges in Open Data 10.5772/intechopen.1001458Online publication date: 3-May-2023
  • (2023)Paradise: Real-Time, Generalized, and Distributed Provenance-Based Intrusion DetectionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.316087920:2(1624-1640)Online publication date: 1-Mar-2023
  • (2023)Map-Reduce (Hadoop) based Data Clustering for BigData: A Survey2023 1st International Conference on Cognitive Computing and Engineering Education (ICCCEE)10.1109/ICCCEE55951.2023.10424474(1-4)Online publication date: 27-Apr-2023
  • (2023)RD-FCA: A resilient distributed framework for formal concept analysisJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.04.011179(104710)Online publication date: Sep-2023
  • (2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
  • (2022)Data Locality in High Performance Computing, Big Data, and Converged Systems: An Analysis of the Cutting Edge and a Future System ArchitectureElectronics10.3390/electronics1201005312:1(53)Online publication date: 23-Dec-2022
  • (2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
  • (2022)PushBox: Making Use of Every Bit of Time to Accelerate Completion of Data-Parallel JobsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318203733:12(4256-4269)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media