Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

HaLoop: efficient iterative data processing on large clusters

Published: 01 September 2010 Publication History

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

References

[1]
http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.
[2]
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2(1):922--933, 2009.
[3]
François Bancilhon and Raghu Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD Conference, pages 16--52, 1986.
[4]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[5]
David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
[6]
Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive scientific analysis. In IEEE eScience, pages 277--284, 2008.
[7]
Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.
[8]
Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html. Accessed July 7, 2010.
[9]
Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.
[10]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.
[11]
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.
[12]
Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.
[13]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010.
[14]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.
[15]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.
[16]
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009.
[17]
Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163--176, 1995.

Cited By

View all
  • (2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
  • (2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
  • (2022)Photon: A Fast Query Engine for Lakehouse SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526054(2326-2339)Online publication date: 10-Jun-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
  • Editors:
  • Elisa Bertino,
  • Paolo Atzeni,
  • Kian Lee Tan,
  • Yi Chen,
  • Y. C. Tay
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)6
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
  • (2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
  • (2022)Photon: A Fast Query Engine for Lakehouse SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526054(2326-2339)Online publication date: 10-Jun-2022
  • (2022)Distributed subgraph query for RDF graph data based on MapReduceComputers and Electrical Engineering10.1016/j.compeleceng.2022.108221102:COnline publication date: 1-Sep-2022
  • (2022)Toward High-Performance Delta-Based Iterative Processing with a Group-Based ApproachJournal of Computer Science and Technology10.1007/s11390-022-2101-137:4(797-813)Online publication date: 1-Jul-2022
  • (2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
  • (2021)IDCOS: optimization strategy for parallel complex expression computation on big dataThe Journal of Supercomputing10.1007/s11227-021-03674-y77:9(10334-10356)Online publication date: 4-Mar-2021
  • (2021)Mining user–user communities for a weighted bipartite network using spark GraphFrames and Flink GellyThe Journal of Supercomputing10.1007/s11227-020-03488-477:6(5984-6035)Online publication date: 1-Jun-2021
  • (2020)SemeruProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488781(261-280)Online publication date: 4-Nov-2020
  • (2020)Feedback-Based Resource Utilization for Smart Home Automation in Fog Assistance IoT-Based CloudInternational Journal of Fog Computing10.4018/IJFC.20200101033:1(41-63)Online publication date: 1-Jan-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media