Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2391229.2391242acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Themis: an I/O-efficient MapReduce

Published: 14 October 2012 Publication History

Abstract

"Big Data" computing increasingly utilizes the MapReduce programming model for scalable processing of large data collections. Many MapReduce jobs are I/O-bound, and so minimizing the number of I/O operations is critical to improving their performance. In this work, we present Themis, a MapReduce implementation that reads and writes data records to disk exactly twice, which is the minimum amount possible for data sets that cannot fit in memory.
In order to minimize I/O, Themis makes fundamentally different design decisions from previous MapReduce implementations. Themis performs a wide variety of MapReduce jobs -- including click log analysis, DNA read sequence alignment, and PageRank -- at nearly the speed of TritonSort's record-setting sort performance [29].

References

[1]
A. Aggarwal and J. Vitter. The Input/Output Complexity of Sorting and Related Problems. CACM, 31(9), Sept. 1988.
[2]
E. Anderson and J. Tucek. Efficiency Matters! In HotStorage, 2009.
[3]
E. Bauer, X. Zhang, and D. Kimber. Practical System Reliability (pg. 226). Wiley-IEEE Press, 2009.
[4]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient Iterative Data Processing on Large Clusters. In VLDB, 2010.
[5]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- A Technique for Cheap Recovery. In OSDI, 2004.
[6]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL Implementation On The MapReduce Framework. In Proc. VLDB Endowment, 2011.
[7]
Dell and Cloudera Hadoop Platform. http://www.cloudera.com/company/press-center/releases/dell-and-cloudera-collaborate-to-enable-large-scale-data-analysis-and-modeling-through-open-source/.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
[9]
D. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. CACM, 35(6), June 1992.
[10]
D. DeWitt, J. Naughton, and D. Schneider. Parallel Sorting on a Shared-Nothing Architecture Using Probabilistic Splitting. In PDIS, 1991.
[11]
E. N. M. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM CSUR, 34(3), Sept. 2002.
[12]
D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. In OSDI, 2010.
[13]
M. Hadjieleftheriou, J. Byers, and G. Kollios. Robust Sketching and Aggregation of Distributed Data Streams. Technical Report 2005--011, Boston University, 2005.
[14]
Hadoop PoweredBy Index. http://wiki.apache.org/hadoop/PoweredBy.
[15]
B. Howe. lakewash_combined_v2.genes.nucleotide. https://dada.cs.washington.edu/research/projects/db-data-L1_bu/escience_datasets/seq_alignment/.
[16]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In SoCC, 2010.
[17]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating Skew in MapReduce Applications. In SIGMOD, 2012.
[18]
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful Bulk Processing for Incremental Analytics. In SoCC, 2010.
[19]
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. In SIGMOD, 1999.
[20]
J. P. McDermott, G. J. Babu, J. C. Liechty, and D. K. Lin. Data Skeletons: Simultaneous Estimation of Multiple Quantiles for Massive Streaming Datasets with Applications to Density Estimation. Statistics and Computing, 17(4), Dec. 2007.
[21]
Michael C Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 25(11): 1363--9, 2009.
[22]
J. C. Mogul and K. K. Ramakrishnan. Eliminating Receive Livelock in an Interrupt-Driven Kernel. ACM TOCS, 15(3), Aug. 1997.
[23]
C. Monash. Petabyte-Scale Hadoop Clusters (Dozens of Them). http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/.
[24]
W. A. Najjar, E. A. Lee, and G. R. Gao. Advances in the Dataflow Computational Model. Parallel Computing, 25(13): 1907--1929, 1999.
[25]
S. Nath, H. Yu, P. B. Gibbons, and S. Seshan. Subtleties in Tolerating Correlated Failures in Wide-Area Storage Systems. In NSDI, 2006.
[26]
D. Peng and F. Dabek. Large-Scale Incremental Processing Using Distributed Transactions and Notifications. In OSDI, 2010.
[27]
E. Pinheiro, W. Weber, and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In FAST, 2007.
[28]
S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, and D. Reeves. Sailfish: A framework for large scale data processing. Technical Report YL-2012-002, Yahoo! Research, 2012.
[29]
A. Rasmussen, G. Porter, M. Conley, H. V. Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. TritonSort: A Balanced Large-Scale Sorting System. In NSDI, 2011.
[30]
Recovery-Oriented Computing. http://roc.cs.berkeley.edu/.
[31]
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. Fail-Stutter Fault Tolerance. In HotOS, 2001.
[32]
B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. In DSN, 2006.
[33]
B. Schroeder and G. A. Gibson. Understanding Disk Failure Rates: What Does an MTTF of 1,000,000 Hours Mean to You? ACM TOS, 3(3), Oct. 2007.
[34]
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In ICDE, 2003.
[35]
A. D. Smith and W. Chung. The RMAP Software for Short-Read Mapping. http://rulai.cshl.edu/rmap/.
[36]
Sort Benchmark. http://sortbenchmark.org/.
[37]
J. S. Vitter. Random Sampling with a Reservoir. ACM TOMS, 11(1), Mar. 1985.
[38]
Freebase Wikipedia Extraction (WEX). http://wiki.freebase.com/wiki/WEX.
[39]
Apache Hadoop. http://hadoop.apache.org/.
[40]
Scaling Hadoop to 4000 Nodes at Yahoo! http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html.
[41]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012.
[42]
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, 2008.

Cited By

View all
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2022)Shadow: Exploiting the Power of Choice for Efficient Shuffling in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2019.29434738:1(253-267)Online publication date: 1-Feb-2022
  • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing
October 2012
325 pages
ISBN:9781450317610
DOI:10.1145/2391229
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SOCC '12
Sponsor:
SOCC '12: ACM Symposium on Cloud Computing
October 14 - 17, 2012
California, San Jose

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2022)Shadow: Exploiting the Power of Choice for Efficient Shuffling in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2019.29434738:1(253-267)Online publication date: 1-Feb-2022
  • (2022)VeloxDFS: Streaming Access to Distributed Datasets to Reduce Disk Seeks2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00012(31-40)Online publication date: May-2022
  • (2022)SMART: Speedup Job Completion Time by Scheduling Reduce TasksJournal of Computer Science and Technology10.1007/s11390-022-2118-537:4(763-778)Online publication date: 30-Jul-2022
  • (2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
  • (2021)JumpgateProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463770(1-12)Online publication date: 14-Jun-2021
  • (2021)Bucket MapReduce: Relieving the Disk I/O Intensity of Data-Intensive Applications in MapReduce Frameworks2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00013(18-25)Online publication date: Mar-2021
  • (2021)Data-driven Performance Tuning for Big Data Analytics PlatformsBig Data Research10.1016/j.bdr.2021.100206(100206)Online publication date: Jan-2021
  • (2020)Improved Intermediate Data Management for MapReduce Frameworks2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00062(536-545)Online publication date: May-2020
  • (2020)Plumb: Efficient stream processing of multi‐user pipelinesSoftware: Practice and Experience10.1002/spe.290951:2(385-408)Online publication date: 11-Oct-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media