research-article

HaLoop: efficient iterative data processing on large clusters

Editors: Elisa Bertino, Paolo Atzeni, Kian Lee Tan, Yi Chen, Y. C. Tay Authors:

Magdalena Balazinska,

Michael D. ErnstAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 3, Issue 1-2

Pages 285 - 296

https://doi.org/10.14778/1920841.1920881

Published: 01 September 2010 Publication History

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

References

[1]

http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.

[2]

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2(1):922--933, 2009.

Digital Library

[3]

François Bancilhon and Raghu Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD Conference, pages 16--52, 1986.

Digital Library

[4]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[5]

David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.

Digital Library

[6]

Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive scientific analysis. In IEEE eScience, pages 277--284, 2008.

Digital Library

[7]

Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.

[8]

Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html. Accessed July 7, 2010.

[9]

Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.

[10]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007.

Digital Library

[11]

Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.

Digital Library

[12]

Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.

[13]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010.

Digital Library

[14]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.

Digital Library

[15]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.

[16]

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009.

Digital Library

[17]

Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163--176, 1995.

Digital Library

Cited By

Margara ACugola GFelicioni NCilloni S(2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604801
Gévay GRabl TBreß SMadai-Tahy LQuiané-Ruiz JMarkl V(2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1145/3542700.3542715
Behm APalkar SAgarwal UArmstrong TCashman DDave AGreenstein THovsepian SJohnson RSai Krishnan ALeventis PLuszczak AMenon PMokhtar MPang GParanjpye SRahn GSamwel Bvan Bussel Tvan Hovell HXue MXin RZaharia MIves ZBonifati AEl Abbadi A(2022)Photon: A Fast Query Engine for Lakehouse SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526054(2326-2339)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526054
Show More Cited By

Recommendations

The HaLoop approach to large-scale iterative data analysis

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce ...
Big Data Analytics with R and Hadoop
Big Data Analytics

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 3, Issue 1-2

September 2010

1658 pages

ISSN:2150-8097

Editors:
Elisa Bertino,
Paolo Atzeni,
Kian Lee Tan,
Yi Chen,
Y. C. Tay

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010

Published in PVLDB Volume 3, Issue 1-2

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

204
Total Citations
View Citations
2,844
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)6

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Margara ACugola GFelicioni NCilloni S(2023)A Model and Survey of Distributed Data-Intensive SystemsACM Computing Surveys10.1145/360480156:1(1-69)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604801
Gévay GRabl TBreß SMadai-Tahy LQuiané-Ruiz JMarkl V(2022)Imperative or Functional Control Flow HandlingACM SIGMOD Record10.1145/3542700.354271551:1(60-67)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1145/3542700.3542715
Behm APalkar SAgarwal UArmstrong TCashman DDave AGreenstein THovsepian SJohnson RSai Krishnan ALeventis PLuszczak AMenon PMokhtar MPang GParanjpye SRahn GSamwel Bvan Bussel Tvan Hovell HXue MXin RZaharia MIves ZBonifati AEl Abbadi A(2022)Photon: A Fast Query Engine for Lakehouse SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526054(2326-2339)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526054
Su QHuang QWu NPan Y(2022)Distributed subgraph query for RDF graph data based on MapReduceComputers and Electrical Engineering10.1016/j.compeleceng.2022.108221102:COnline publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.compeleceng.2022.108221
Yu HJiang XZhao JQi HZhang YLiao XLiu HMao FJin H(2022)Toward High-Performance Delta-Based Iterative Processing with a Group-Based ApproachJournal of Computer Science and Technology10.1007/s11390-022-2101-137:4(797-813)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.1007/s11390-022-2101-1
Gévay GSoto JMarkl V(2021)Handling Iterations in Distributed Dataflow SystemsACM Computing Surveys10.1145/347760254:9(1-38)Online publication date: 8-Oct-2021
https://dl.acm.org/doi/10.1145/3477602
Song YJin HWang HLiu Y(2021)IDCOS: optimization strategy for parallel complex expression computation on big dataThe Journal of Supercomputing10.1007/s11227-021-03674-y77:9(10334-10356)Online publication date: 4-Mar-2021
https://dl.acm.org/doi/10.1007/s11227-021-03674-y
Ramalingeswara Rao TGhosh SGoswami A(2021)Mining user–user communities for a weighted bipartite network using spark GraphFrames and Flink GellyThe Journal of Supercomputing10.1007/s11227-020-03488-477:6(5984-6035)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s11227-020-03488-4
Wang CMa HLiu SLi YRuan ZNguyen KBond MNetravali RKim MXu GLu SHowell J(2020)SemeruProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488781(261-280)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488781
Mallikarjuna B(2020)Feedback-Based Resource Utilization for Smart Home Automation in Fog Assistance IoT-Based CloudInternational Journal of Fog Computing10.4018/IJFC.20200101033:1(41-63)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.4018/IJFC.2020010103
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents