Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1855711.1855732acmotherconferencesArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

MapReduce online

Published: 28 April 2010 Publication History

Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.

References

[1]
ABADI, D. J., AHMAD, Y., BALAZINSKA, M., CETINTEMEL, U., CHERNIACK, M., HWANG, J.-H., LINDNER, W., MASKEY, A. S., RASIN, A., RYVKINA, E., TATBUL, N., XING, Y., AND ZDONIK, S. The design of the Borealis stream processing engine. In CIDR (2005).
[2]
AVNUR, R., AND HELLERSTEIN, J. M. Eddies: Continuously adaptive query processing. In SIGMOD (2000).
[3]
BALAZINSKA, M., BALAKRISHNAN, H., MADDEN, S., AND STONEBRAKER, M. Fault-tolerance in the Borealis distributed stream processing system. In SIGMOD (2005).
[4]
CHANDRASEKARAN, S., COOPER, O., DESHPANDE, A., FRANKLIN, M. J., HELLERSTEIN, J. M., HONG, W., KRISHNAMURTHY, S., MADDEN, S. R., RAMAN, V., REISS, F., AND SHAH, M. A. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR (2003).
[5]
CIESLEWICZ, J., FRIEDMAN, E., AND PAWLOWSKI, P. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In VLDB (2009).
[6]
DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. In OSDI (2004).
[7]
GIBBONS, P. B., AND MATIAS, Y. New sampling-based summary statistics for improving approximate query answers. In SIGMOD (1998).
[8]
GRAEFE, G. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD (1990).
[9]
GRAY, J., CHAUDHURI, S., BOSWORTH, A., LAYMAN, A., REICHART, D., VENKATRAO, M., PELLOW, F., AND PIRAHESH, H. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov. 1, 1 (1997), 29-53.
[10]
Greenplum: A unified engine for RDBMS and MapReduce, Oct. 2008. Downloaded from http://www.greenplum.com/download. php?alias=register-map-reduce&file= Greenplum-MapReduce-Whitepaper.pdf.
[11]
HELLERSTEIN, J. M., AVNUR, R., CHOU, A., HIDBER, C., OLSTON, C., RAMAN, V., ROTH, T., AND HAAS, P. J. Interactive data analysis with CONTROL. IEEE Computer 32, 8 (Aug. 1999).
[12]
HELLERSTEIN, J. M., HAAS, P. J., AND WANG, H. J. Online aggregation. In SIGMOD (1997).
[13]
ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys (2007).
[14]
JERMAINE, C. Online random shuffling of large database tables. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 73-84.
[15]
JERMAINE, C., ARUMUGAM, S., POL, A., AND DOBRA, A. Scalable approximate query processing with the DBO engine. In SIGMOD (2007).
[16]
LOGOTHETIS, D., AND YOCUM, K. Ad-hoc data processing in the cloud (demonstration). Proc. VLDB Endow. 1, 2 (2008).
[17]
LUO, G., ELLMANN, C. J., HAAS, P. J., AND NAUGHTON, J. F. A scalable hash ripple join algorithm. In SIGMOD (2002).
[18]
MOTWANI, R., WIDOM, J., ARASU, A., BABCOCK, B., BABU, S., DATAR, M., MANKU, G., OLSTON, C., ROSENSTEIN, J., AND VARMA, R. Query processing, resource management, and approximation in a data stream management system. In CIDR (2003).
[19]
OLSTON, C., REED, B., SILBERSTEIN, A., AND SRIVASTAVA, U. Automatic optimization of parallel dataflow programs. In USENIX Technical Conference (2008).
[20]
OLSTON, C., REED, B., SRIVASTAVA, U., KUMAR, R., AND TOMKINS, A. Pig Latin: a not-so-foreign language for data processing. In SIGMOD (2008).
[21]
PAVLO, A., PAULSON, E., RASIN, A., ABADI, D. J., DEWITT, D. J., MADDEN, S., AND STONEBRAKER, M. A comparison of approaches to large-scale data analysis. In SIGMOD (2009).
[22]
PIETZUCH, P., LEDLIE, J., SHNEIDMAN, J., ROUSSOPOULOS, M., WELSH, M., AND SELTZER, M. Network-aware operator placement for stream-processing systems. In ICDE (2006).
[23]
PIKE, R., DORWARD, S., GRIESEMER, R., AND QUINLAN, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13, 4 (2005), 277-298.
[24]
SHAH, M. A., HELLERSTEIN, J. M., AND BREWER, E. A. Highly-available, fault-tolerant, parallel dataflows. In SIGMOD (2004).
[25]
SHAH, M. A., HELLERSTEIN, J. M., CHANDRASEKARAN, S., AND FRANKLIN, M. J. Flux: An adaptive partitioning operator for continuous query systems. In ICDE (2003).
[26]
SKOMOROCH, P. N. Wikipedia page traffic statistics, 2009. Downloaded from http://developer.amazonwebservices. com/connect/entry.jspa?externalID=2596.
[27]
THUSOO, A., SARMA, J. S., JAIN, N., SHAO, Z., CHAKKA, P., ANTHONY, S., LIU, H., WYCKOFF, P., AND MURTHY, R. Hive--a warehousing solution over a Map-Reduce framework. In VLDB (2009).
[28]
WELSH, M., CULLER, D., AND BREWER, E. SEDA: An architecture for well-conditioned, scalable internet services. In SOSP (2001).
[29]
WU, S., JIANG, S., OOI, B. C., AND TAN, K.-L. Distributed online aggregation. In VLDB (2009).
[30]
XU, W., HUANG, L., FOX, A., PATTERSON, D., AND JORDAN, M. I. Detecting large-scale system problems by mining console logs. In SOSP (2009).
[31]
YANG, C., YEN, C., TAN, C., AND MADDEN, S. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In ICDE (2010).

Cited By

View all
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement LearningProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599241(2776-2788)Online publication date: 6-Aug-2023
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
NSDI'10: Proceedings of the 7th USENIX conference on Networked systems design and implementation
April 2010
29 pages

Sponsors

  • USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 28 April 2010

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement LearningProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599241(2776-2788)Online publication date: 6-Aug-2023
  • (2022)Stream processing with dependency-guided synchronizationProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508413(1-16)Online publication date: 2-Apr-2022
  • (2022)SAIR: significance-aware approach to improve QoR of big data processing in case of budget constraintThe Journal of Supercomputing10.1007/s11227-019-02797-775:9(5760-5781)Online publication date: 10-Mar-2022
  • (2021)DavosProceedings of the VLDB Endowment10.14778/3476311.347637014:12(2893-2905)Online publication date: 28-Oct-2021
  • (2021)In the land of data streams where synopses are missing, one framework to bring them allProceedings of the VLDB Endowment10.14778/3467861.346787114:10(1818-1831)Online publication date: 26-Oct-2021
  • (2021)FlashPProceedings of the VLDB Endowment10.14778/3446095.344609614:5(721-729)Online publication date: 23-Mar-2021
  • (2021)MXDAGProceedings of the 20th ACM Workshop on Hot Topics in Networks10.1145/3484266.3487384(221-228)Online publication date: 10-Nov-2021
  • (2020)MagnetProceedings of the VLDB Endowment10.14778/3415478.341555813:12(3382-3395)Online publication date: 14-Sep-2020
  • (2020)StreamQL: a query language for processing streaming time seriesProceedings of the ACM on Programming Languages10.1145/34282514:OOPSLA(1-32)Online publication date: 13-Nov-2020
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media