research-article

Sailfish: a framework for large scale data processing

Authors:

Raghu Ramakrishnan,

Adam Silberstein,

Mike Ovsiannikov,

Damian ReevesAuthors Info & Claims

SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing

Article No.: 4, Pages 1 - 14

https://doi.org/10.1145/2391229.2391233

Published: 14 October 2012 Publication History

Abstract

In this paper, we present Sailfish, a new Map-Reduce framework for large scale data processing. The Sailfish design is centered around aggregating intermediate data, specifically data produced by map tasks and consumed later by reduce tasks, to improve performance by batching disk I/O. We introduce an abstraction called I-files for supporting data aggregation, and describe how we implemented it as an extension of the distributed filesystem, to efficiently batch data written by multiple writers and read by multiple readers. Sailfish adapts the Map-Reduce layer in Hadoop to use I-files for transporting data from map tasks to reduce tasks. We present experimental results demonstrating that Sailfish improves performance of standard Hadoop; in particular, we show 20% to 5 times faster performance on a representative mix of real jobs and datasets at Yahoo!. We also demonstrate that the Sailfish design enables auto-tuning functionality that handles changes in data volume and skewed distributions effectively, thereby addressing an important practical drawback of Hadoop, which in contrast relies on programmers to configure system parameters appropriately for each job, for each input dataset. Our Sailfish implementation and the other software components developed as part of this paper has been released as open source.

References

[1]

Apache Hadoop NextGen MapReduce (YARN). http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.

[2]

Apache Hadoop Project. http://hadoop.apache.org/.

[3]

HDFS. http://hadoop.apache.org/hdfs.

[4]

KFS. http://code.google.com/p/kosmosfs/.

[5]

Preemption and restart of mapreduce tasks. http://issues.apache.org/jira/browse/MAPREDUCE-4585.

[6]

Sailfish. http://code.google.com/p/sailfish/.

[7]

Sort benchmark home page. http://sortbenchmark.org/.

[8]

G. Ananthanarayanan, C. Douglas, R. Ramakrishnan, S. Rao, and I. Stoica. True Elasticity in Multi-Tenant Clusters through Amoeba. In ACM Symposium on Cloud Computing, SoCC'12, October 2012.

Digital Library

[9]

J. Dean. Software engineering advice from building large-scale distributed systems. http://research.google.com/people/jeff/stanford-295-talk.pdf.

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, 2004.

Digital Library

[11]

J. Dean and S. Ghemawat. Mapreduce: A flexible data processing tool. Communications of the ACM, 53(1): 72--77, January 2010.

Digital Library

[12]

J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1), 2010.

Digital Library

[13]

S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, volume 37 of SOSP '03, pages 29--43, New York, NY, USA, Oct. 2003.

Digital Library

[14]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. Systems Research, pages 261--272, 2011.

[15]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, 2007.

Digital Library

[16]

D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: an in-depth study. Proc. VLDB Endow., 3(1), Sept. 2010.

Digital Library

[17]

D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley Professional, second edition, May 1998.

Digital Library

[18]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 75--86, New York, NY, USA, 2010. ACM.

Digital Library

[19]

A. Murthy. Apache hadoop: Best practices and anti-patterns. http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/.

[20]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, 2008.

Digital Library

[21]

J. Ousterhout et al. The case for ramclouds: Scalable high-performance storage entirely in dram. SIGOPS Operating Systems Review, 43(4): 92--105, December 2009.

Digital Library

[22]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 165--178, New York, NY, USA, 2009. ACM.

Digital Library

[23]

A. Rasmussen, M. Conley, R. Kapoor, V. The Lam, G. Porter, and A. Vahdat. ThemisMR: An I/O Efficient MapReduce. Technical Report CS2012-0983, Department of Computer Science and Engineering, University of California at San Diego, July 2012.

[24]

A. Rasmussen, M. Conley, G. Porter, and A. Vahdat. Tritonsort 2011. http://sortbenchmark.org/2011_06_tritonsort.pdf.

[25]

A. Rasmussen, G. Porter, M. Conley, H. V. Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. Tritonsort: a balanced large-scale sorting system. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, Berkeley, CA, USA, 2011.

Digital Library

[26]

S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, D. Reeves. Sailfish: A framework for large scale data processing. Technical Report YL-2012-002, Yahoo! Labs.

[27]

M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Commun. ACM, 53(1): 64--71, Jan. 2010.

Digital Library

[28]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2), 2009.

Digital Library

[29]

R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using Situation-Aware Mappers. In International Conference on Extending Database Technology (EDBT), 2012.

Digital Library

Cited By

Yang SChen JHuang MZheng ZHuang YRoychoudhury APaiva AAbreu RStorey M(2024)Uncover the Premeditated Attacks: Detecting Exploitable Reentrancy Vulnerabilities by Identifying Attacker ContractsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639153(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639153
Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Yang SChen JZheng ZJust RFraser G(2023)Definition and Detection of Defects in NFT Smart ContractsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598063(373-384)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598063
Show More Cited By

Index Terms

Sailfish: a framework for large scale data processing
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Cross-platform development for Sailfish OS and Android: Architectural patterns and "Dictionary trainer" application case study
FRUCT'19: Proceedings of the 19th Conference of Open Innovations Association FRUCT

With the widespread use of mobile devices, the role of mobile applications increases. Nevertheless, the variety of mobile platforms and the differences between them make the development of applications for multiple mobile platforms a highly resource-...
Big Data Analytics with R and Hadoop
Big Data Analytics

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing

October 2012

325 pages

ISBN:9781450317610

DOI:10.1145/2391229

Program Chairs:
Michael Carey
UC Irvine
,
Steven Hand
University of Cambridge

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SOCC '12

Sponsor:

SOCC '12: ACM Symposium on Cloud Computing

October 14 - 17, 2012

California, San Jose

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
1,016
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang SChen JHuang MZheng ZHuang YRoychoudhury APaiva AAbreu RStorey M(2024)Uncover the Premeditated Attacks: Detecting Exploitable Reentrancy Vulnerabilities by Identifying Attacker ContractsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639153(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639153
Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Yang SChen JZheng ZJust RFraser G(2023)Definition and Detection of Defects in NFT Smart ContractsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598063(373-384)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598063
Belcastro LCantini RMarozzo FOrsino ATalia DTrunfio P(2022)Programming big data analysis: principles and solutionsJournal of Big Data10.1186/s40537-021-00555-29:1Online publication date: 6-Jan-2022
https://doi.org/10.1186/s40537-021-00555-2
Wu SChen HJin HIbrahim S(2022)Shadow: Exploiting the Power of Choice for Efficient Shuffling in MapReduceIEEE Transactions on Big Data10.1109/TBDATA.2019.29434738:1(253-267)Online publication date: 1-Feb-2022
https://doi.org/10.1109/TBDATA.2019.2943473
Gu RHuang XDai HGeng XChen XHuang YXiao FChen G(2022)Efficient. Scalable and Robust Data Shuffle Service for Distributed MapReduce Computing on Cloud2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00075(337-346)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00075
Dong JHe ZGong YYu PTian CDou WChen GXia NGuan H(2022)SMART: Speedup Job Completion Time by Scheduling Reduce TasksJournal of Computer Science and Technology10.1007/s11390-022-2118-537:4(763-778)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s11390-022-2118-5
Ruiz Amores GMartínez-Antonio A(2022)Basics on network theory to analyze biological systems: a hands-on outlookFunctional & Integrative Genomics10.1007/s10142-022-00907-y22:6(1433-1448)Online publication date: 13-Oct-2022
https://doi.org/10.1007/s10142-022-00907-y
Song WYang YEo JSeo JKim JLee SLee GUm TCho HChun B(2021)Apache Nemo: A Framework for Optimizing Distributed Data ProcessingACM Transactions on Computer Systems10.1145/346814438:3-4(1-31)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3468144
Nikitas NKonstantinou IKalogeraki VKoziris N(2021)Cherry: A Distributed Task-Aware Shuffle Service for Serverless Analytics2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671899(120-130)Online publication date: 15-Dec-2021
https://doi.org/10.1109/BigData52589.2021.9671899
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents