Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2391229.2391233acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Sailfish: a framework for large scale data processing

Published: 14 October 2012 Publication History

Abstract

In this paper, we present Sailfish, a new Map-Reduce framework for large scale data processing. The Sailfish design is centered around aggregating intermediate data, specifically data produced by map tasks and consumed later by reduce tasks, to improve performance by batching disk I/O. We introduce an abstraction called I-files for supporting data aggregation, and describe how we implemented it as an extension of the distributed filesystem, to efficiently batch data written by multiple writers and read by multiple readers. Sailfish adapts the Map-Reduce layer in Hadoop to use I-files for transporting data from map tasks to reduce tasks. We present experimental results demonstrating that Sailfish improves performance of standard Hadoop; in particular, we show 20% to 5 times faster performance on a representative mix of real jobs and datasets at Yahoo!. We also demonstrate that the Sailfish design enables auto-tuning functionality that handles changes in data volume and skewed distributions effectively, thereby addressing an important practical drawback of Hadoop, which in contrast relies on programmers to configure system parameters appropriately for each job, for each input dataset. Our Sailfish implementation and the other software components developed as part of this paper has been released as open source.

References

[1]
Apache Hadoop NextGen MapReduce (YARN). http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.
[2]
Apache Hadoop Project. http://hadoop.apache.org/.
[3]
HDFS. http://hadoop.apache.org/hdfs.
[4]
KFS. http://code.google.com/p/kosmosfs/.
[5]
Preemption and restart of mapreduce tasks. http://issues.apache.org/jira/browse/MAPREDUCE-4585.
[6]
Sailfish. http://code.google.com/p/sailfish/.
[7]
Sort benchmark home page. http://sortbenchmark.org/.
[8]
G. Ananthanarayanan, C. Douglas, R. Ramakrishnan, S. Rao, and I. Stoica. True Elasticity in Multi-Tenant Clusters through Amoeba. In ACM Symposium on Cloud Computing, SoCC'12, October 2012.
[9]
J. Dean. Software engineering advice from building large-scale distributed systems. http://research.google.com/people/jeff/stanford-295-talk.pdf.
[10]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, 2004.
[11]
J. Dean and S. Ghemawat. Mapreduce: A flexible data processing tool. Communications of the ACM, 53(1): 72--77, January 2010.
[12]
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1), 2010.
[13]
S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, volume 37 of SOSP '03, pages 29--43, New York, NY, USA, Oct. 2003.
[14]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. Systems Research, pages 261--272, 2011.
[15]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, 2007.
[16]
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: an in-depth study. Proc. VLDB Endow., 3(1), Sept. 2010.
[17]
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley Professional, second edition, May 1998.
[18]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 75--86, New York, NY, USA, 2010. ACM.
[19]
A. Murthy. Apache hadoop: Best practices and anti-patterns. http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/.
[20]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, 2008.
[21]
J. Ousterhout et al. The case for ramclouds: Scalable high-performance storage entirely in dram. SIGOPS Operating Systems Review, 43(4): 92--105, December 2009.
[22]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 165--178, New York, NY, USA, 2009. ACM.
[23]
A. Rasmussen, M. Conley, R. Kapoor, V. The Lam, G. Porter, and A. Vahdat. ThemisMR: An I/O Efficient MapReduce. Technical Report CS2012-0983, Department of Computer Science and Engineering, University of California at San Diego, July 2012.
[24]
A. Rasmussen, M. Conley, G. Porter, and A. Vahdat. Tritonsort 2011. http://sortbenchmark.org/2011_06_tritonsort.pdf.
[25]
A. Rasmussen, G. Porter, M. Conley, H. V. Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. Tritonsort: a balanced large-scale sorting system. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, Berkeley, CA, USA, 2011.
[26]
S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, D. Reeves. Sailfish: A framework for large scale data processing. Technical Report YL-2012-002, Yahoo! Labs.
[27]
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1): 64--71, Jan. 2010.
[28]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2), 2009.
[29]
R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using Situation-Aware Mappers. In International Conference on Extending Database Technology (EDBT), 2012.

Cited By

View all
  • (2024)Uncover the Premeditated Attacks: Detecting Exploitable Reentrancy Vulnerabilities by Identifying Attacker ContractsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639153(1-12)Online publication date: 20-May-2024
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)Definition and Detection of Defects in NFT Smart ContractsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598063(373-384)Online publication date: 12-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing
October 2012
325 pages
ISBN:9781450317610
DOI:10.1145/2391229
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SOCC '12
Sponsor:
SOCC '12: ACM Symposium on Cloud Computing
October 14 - 17, 2012
California, San Jose

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Uncover the Premeditated Attacks: Detecting Exploitable Reentrancy Vulnerabilities by Identifying Attacker ContractsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639153(1-12)Online publication date: 20-May-2024
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)Definition and Detection of Defects in NFT Smart ContractsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598063(373-384)Online publication date: 12-Jul-2023
  • (2022)Programming big data analysis: principles and solutionsJournal of Big Data10.1186/s40537-021-00555-29:1Online publication date: 6-Jan-2022
  • (2022)Shadow: Exploiting the Power of Choice for Efficient Shuffling in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2019.29434738:1(253-267)Online publication date: 1-Feb-2022
  • (2022)Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00075(337-346)Online publication date: Dec-2022
  • (2022)SMART: Speedup Job Completion Time by Scheduling Reduce TasksJournal of Computer Science and Technology10.1007/s11390-022-2118-537:4(763-778)Online publication date: 30-Jul-2022
  • (2022)Basics on network theory to analyze biological systems: a hands-on outlookFunctional & Integrative Genomics10.1007/s10142-022-00907-y22:6(1433-1448)Online publication date: 13-Oct-2022
  • (2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
  • (2021)Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671899(120-130)Online publication date: 15-Dec-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media