Article

MapReduce online

Authors:

Joseph M. Hellerstein,

Khaled Elmeleegy,

Russell SearsAuthors Info & Claims

NSDI'10: Proceedings of the 7th USENIX conference on Networked systems design and implementation

Page 21

Published: 28 April 2010 Publication History

Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.

References

[1]

ABADI, D. J., AHMAD, Y., BALAZINSKA, M., CETINTEMEL, U., CHERNIACK, M., HWANG, J.-H., LINDNER, W., MASKEY, A. S., RASIN, A., RYVKINA, E., TATBUL, N., XING, Y., AND ZDONIK, S. The design of the Borealis stream processing engine. In CIDR (2005).

[2]

AVNUR, R., AND HELLERSTEIN, J. M. Eddies: Continuously adaptive query processing. In SIGMOD (2000).

Digital Library

[3]

BALAZINSKA, M., BALAKRISHNAN, H., MADDEN, S., AND STONEBRAKER, M. Fault-tolerance in the Borealis distributed stream processing system. In SIGMOD (2005).

Digital Library

[4]

CHANDRASEKARAN, S., COOPER, O., DESHPANDE, A., FRANKLIN, M. J., HELLERSTEIN, J. M., HONG, W., KRISHNAMURTHY, S., MADDEN, S. R., RAMAN, V., REISS, F., AND SHAH, M. A. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR (2003).

[5]

CIESLEWICZ, J., FRIEDMAN, E., AND PAWLOWSKI, P. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In VLDB (2009).

Digital Library

[6]

DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. In OSDI (2004).

Digital Library

[7]

GIBBONS, P. B., AND MATIAS, Y. New sampling-based summary statistics for improving approximate query answers. In SIGMOD (1998).

Digital Library

[8]

GRAEFE, G. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD (1990).

Digital Library

[9]

GRAY, J., CHAUDHURI, S., BOSWORTH, A., LAYMAN, A., REICHART, D., VENKATRAO, M., PELLOW, F., AND PIRAHESH, H. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov. 1, 1 (1997), 29-53.

Digital Library

[10]

Greenplum: A unified engine for RDBMS and MapReduce, Oct. 2008. Downloaded from http://www.greenplum.com/download. php?alias=register-map-reduce&file= Greenplum-MapReduce-Whitepaper.pdf.

[11]

HELLERSTEIN, J. M., AVNUR, R., CHOU, A., HIDBER, C., OLSTON, C., RAMAN, V., ROTH, T., AND HAAS, P. J. Interactive data analysis with CONTROL. IEEE Computer 32, 8 (Aug. 1999).

Digital Library

[12]

HELLERSTEIN, J. M., HAAS, P. J., AND WANG, H. J. Online aggregation. In SIGMOD (1997).

Digital Library

[13]

ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys (2007).

Digital Library

[14]

JERMAINE, C. Online random shuffling of large database tables. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 73-84.

Digital Library

[15]

JERMAINE, C., ARUMUGAM, S., POL, A., AND DOBRA, A. Scalable approximate query processing with the DBO engine. In SIGMOD (2007).

Digital Library

[16]

LOGOTHETIS, D., AND YOCUM, K. Ad-hoc data processing in the cloud (demonstration). Proc. VLDB Endow. 1, 2 (2008).

Digital Library

[17]

LUO, G., ELLMANN, C. J., HAAS, P. J., AND NAUGHTON, J. F. A scalable hash ripple join algorithm. In SIGMOD (2002).

Digital Library

[18]

MOTWANI, R., WIDOM, J., ARASU, A., BABCOCK, B., BABU, S., DATAR, M., MANKU, G., OLSTON, C., ROSENSTEIN, J., AND VARMA, R. Query processing, resource management, and approximation in a data stream management system. In CIDR (2003).

[19]

OLSTON, C., REED, B., SILBERSTEIN, A., AND SRIVASTAVA, U. Automatic optimization of parallel dataflow programs. In USENIX Technical Conference (2008).

Digital Library

[20]

OLSTON, C., REED, B., SRIVASTAVA, U., KUMAR, R., AND TOMKINS, A. Pig Latin: a not-so-foreign language for data processing. In SIGMOD (2008).

Digital Library

[21]

PAVLO, A., PAULSON, E., RASIN, A., ABADI, D. J., DEWITT, D. J., MADDEN, S., AND STONEBRAKER, M. A comparison of approaches to large-scale data analysis. In SIGMOD (2009).

Digital Library

[22]

PIETZUCH, P., LEDLIE, J., SHNEIDMAN, J., ROUSSOPOULOS, M., WELSH, M., AND SELTZER, M. Network-aware operator placement for stream-processing systems. In ICDE (2006).

Digital Library

[23]

PIKE, R., DORWARD, S., GRIESEMER, R., AND QUINLAN, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13, 4 (2005), 277-298.

Digital Library

[24]

SHAH, M. A., HELLERSTEIN, J. M., AND BREWER, E. A. Highly-available, fault-tolerant, parallel dataflows. In SIGMOD (2004).

Digital Library

[25]

SHAH, M. A., HELLERSTEIN, J. M., CHANDRASEKARAN, S., AND FRANKLIN, M. J. Flux: An adaptive partitioning operator for continuous query systems. In ICDE (2003).

[26]

SKOMOROCH, P. N. Wikipedia page traffic statistics, 2009. Downloaded from http://developer.amazonwebservices. com/connect/entry.jspa?externalID=2596.

[27]

THUSOO, A., SARMA, J. S., JAIN, N., SHAO, Z., CHAKKA, P., ANTHONY, S., LIU, H., WYCKOFF, P., AND MURTHY, R. Hive--a warehousing solution over a Map-Reduce framework. In VLDB (2009).

Digital Library

[28]

WELSH, M., CULLER, D., AND BREWER, E. SEDA: An architecture for well-conditioned, scalable internet services. In SOSP (2001).

Digital Library

[29]

WU, S., JIANG, S., OOI, B. C., AND TAN, K.-L. Distributed online aggregation. In VLDB (2009).

Digital Library

[30]

XU, W., HUANG, L., FOX, A., PATTERSON, D., AND JORDAN, M. I. Detecting large-scale system problems by mining console logs. In SOSP (2009).

Digital Library

[31]

YANG, C., YEN, C., TAN, C., AND MADDEN, S. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In ICDE (2010).

Cited By

Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Xing MMao HYin SPan LZhang ZXiao ZLong JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement LearningProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599241(2776-2788)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599241
Kallas KNiksic FStanford CAlur RLee JAgrawal KSpear M(2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508413
Show More Cited By

MapReduce online
1. Computer systems organization
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

Online aggregation and continuous query support in MapReduce
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Online MapReduce scheduling problem of minimizing the makespan

MapReduce system is a popular big data processing framework, and the performance of it is closely related to the efficiency of the centralized scheduler. In practice, the centralized scheduler often has little information in advance, which means each ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

NSDI'10: Proceedings of the 7th USENIX conference on Networked systems design and implementation

April 2010

29 pages

Sponsors

USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 28 April 2010

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

205
Total Citations
View Citations
2
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Xing MMao HYin SPan LZhang ZXiao ZLong JSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement LearningProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599241(2776-2788)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599241
Kallas KNiksic FStanford CAlur RLee JAgrawal KSpear M(2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508413
Ahmadvand HGoudarzi M(2022)SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraintThe Journal of Supercomputing10.1007/s11227-019-02797-775:9(5760-5781)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/s11227-019-02797-7
Shang ZZgraggen EBuratti BEichmann PKarimeddiny NMeyer CRunnels WKraska T(2021)DavosProceedings of the VLDB Endowment10.14778/3476311.347637014:12(2893-2905)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476370
Poepsel-Lemaitre RKiefer Mvon Hein JQuiané-Ruiz JMarkl V(2021)In the land of data streams where synopses are missing, one framework to bring them allProceedings of the VLDB Endowment10.14778/3467861.346787114:10(1818-1831)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467871
Yan SDing BGuo WZhou JWei ZJiang XXu S(2021)FlashPProceedings of the VLDB Endowment10.14778/3446095.344609614:5(721-729)Online publication date: 23-Mar-2021
https://dl.acm.org/doi/10.14778/3446095.3446096
Wang WDas SWu XWang ZChen ANg T(2021)MXDAGProceedings of the 20th ACM Workshop on Hot Topics in Networks10.1145/3484266.3487384(221-228)Online publication date: 10-Nov-2021
https://dl.acm.org/doi/10.1145/3484266.3487384
Shen MZhou YSingh C(2020)MagnetProceedings of the VLDB Endowment10.14778/3415478.341555813:12(3382-3395)Online publication date: 14-Sep-2020
https://dl.acm.org/doi/10.14778/3415478.3415558
Kong LMamouras K(2020)StreamQL: a query language for processing streaming time seriesProceedings of the ACM on Programming Languages10.1145/34282514:OOPSLA(1-32)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428251
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten