research-article

Open access

Riffle: optimized shuffle service for large-scale data analytics

Authors:

Michael J. FreedmanAuthors Info & Claims

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

Article No.: 43, Pages 1 - 15

https://doi.org/10.1145/3190508.3190534

Published: 23 April 2018 Publication History

Abstract

The rapidly growing size of data and complexity of analytics present new challenges for large-scale data processing systems. Modern systems keep data partitions in memory for pipelined operators, and persist data across stages with wide dependencies on disks for fault tolerance. While processing can often scale well by splitting jobs into smaller tasks for better parallelism, all-to-all data transfer---called shuffle operations---become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs. Our key observation is that this bottleneck is due to the superlinear increase in disk I/O operations as data volume increases.

We present Riffle, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. To do so, Riffle efficiently merges fragmented intermediate shuffle files into larger block files, and thus converts small, random disk I/O requests into large, sequential ones. Riffle further improves performance and fault tolerance by mixing both merged and unmerged block files to minimize merge operation overhead. Using Riffle, Facebook production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.

References

[1]

Retrieved 10/20/2017. Apache Hadoop. (Retrieved 10/20/2017). http://hadoop.apache.org/.

[2]

Retrieved 10/20/2017. Apache Ignite. (Retrieved 10/20/2017). https://ignite.apache.org/.

[3]

Retrieved 10/20/2017. Apache Spark. (Retrieved 10/20/2017). http://spark.apache.org/.

[4]

Retrieved 10/20/2017. Apache Spark Performance Tuning âĂŞ Degree of Parallelism. (Retrieved 10/20/2017). https://goo.gl/Mpt13F.

[5]

Retrieved 10/20/2017. Apache Spark @Scale: A 60 TB+ Production Use Case. (Retrieved 10/20/2017). https://code.facebook.com/posts/1671373793181703/.

[6]

Retrieved 10/20/2017. Apache Spark the fastest open source engine for sorting a petabyte. (Retrieved 10/20/2017). https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

[7]

Retrieved 10/20/2017. Facebook Disaggregate: Networking recap. (Retrieved 10/20/2017). https://code.facebook.com/posts/1887543398133443/.

[8]

Retrieved 10/20/2017. Facebook's Disaggregate Storage and Compute for Map/Reduce. (Retrieved 10/20/2017). https://goo.gl/8vQdfU.

[9]

Retrieved 10/20/2017. LZ4: Extremely Fast Compression Algorithm. (Retrieved 10/20/2017). http://www.lz4.org.

[10]

Retrieved 10/20/2017. MapReduce-4049: Plugin for Generic Shuffle Service. (Retrieved 10/20/2017). https://issues.apache.org/jira/browse/MAPREDUCE-4049.

[11]

Retrieved 10/20/2017. Snappy: A Fast Compressor/Decompressor. (Retrieved 10/20/2017). https://google.github.io/snappy/.

[12]

Retrieved 10/20/2017. Spark Configuration: External Shuffle Service. (Retrieved 10/20/2017). https://spark.apache.org/docs/latest/job-scheduling.html.

[13]

Retrieved 10/20/2017. Tim Sort. (Retrieved 10/20/2017). http://wiki.c2.com/?TimSort.

[14]

Retrieved 10/20/2017. Working with Apache Spark. (Retrieved 10/20/2017). https://goo.gl/XbUA42.

[15]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys.

Digital Library

[16]

Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI.

Digital Library

[17]

Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In USENIX NSDI.

Digital Library

[18]

Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-reduce Clusters Using Mantri. In USENIX OSDI.

Digital Library

[19]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In ACM SIGMOD.

Digital Library

[20]

Josep Lluís Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, and Daron Green. 2015. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[21]

Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In USENIX NSDI.

Digital Library

[22]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI.

Digital Library

[23]

Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In ACM SIGMOD.

Digital Library

[24]

Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network Requirements for Resource Disaggregation. In USENIX OSDI.

Digital Library

[25]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SOSP.

Digital Library

[26]

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In USENIX NSDI.

Digital Library

[27]

Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The Bleak Future of NAND Flash Memory. In USENIX FAST.

Digital Library

[28]

Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics. In CIDR. 261--272.

[29]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In USENIX NSDI.

Digital Library

[30]

Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito VI, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar, Abhishek Mathur, Sachin Kulkarni, Matthew Burke, and Wyatt Lloyd. 2017. SVE: Distributed Video Processing at Facebook Scale. In ACM SOSP.

Digital Library

[31]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In ACM EuroSys.

Digital Library

[32]

S. Kambhampati, J. Kelley, C. Stewart, W. C. L. Stewart, and R. Ramnath. 2014. Managing Tiny Tasks for Data-Parallel, Subsampling Workloads. In 2014 IEEE International Conference on Cloud Engineering.

Digital Library

[33]

Vamsee Kasavajhala. 2011. Solid State Drive vs. Hard Disk Drive Price and Performance Study: A Dell Technical White Paper. Dell Power Vault Storage Systems (May 2011).

[34]

Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An Analysis of Traces from a Production MapReduce Cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid).

Digital Library

[35]

Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In ACM SoCC.

Digital Library

[36]

Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In ACM ISCA.

Digital Library

[37]

David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In USENIX OSDI. Savannah, GA.

Digital Library

[38]

S. T. Maguluri, R. Srikant, and L. Ying. 2012. Stochastic Models of Load Balancing and Scheduling in Cloud Computing Clusters. In IEEE INFOCOM.

[39]

M. D. McKay, R. J. Beckman, and W. J. Conover. 2000. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 42, 1 (Feb. 2000), 55--61.

Digital Library

[40]

Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (Oct. 2001), 1094--1104.

Digital Library

[41]

Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In ACM SOSP.

Digital Library

[42]

Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2013. The Case for Tiny Tasks in Compute Clusters. In USENIX HotOS Workshop. Santa Ana Pueblo, NM.

Digital Library

[43]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In USENIX NSDI.

Digital Library

[44]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In ACM SOSP.

Digital Library

[45]

Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: A Framework for Large Scale Data Processing. In ACM SoCC.

Digital Library

[46]

A. Rasmussen, M. Conley, R. Kapoor, V.T. Lam, G. Porter, and A. Vahdat. 2012. ThemisMR: An I/O-efficient MapReduce. Technical Report (University of California, San Diego. Department of Computer Science and Engineering) (2012).

[47]

Alexander Rasmussen, Vinh The Lam, Michael Conley, George Porter, Rishi Kapoor, and Amin Vahdat. 2012. Themis: An I/O-efficient MapReduce. In ACM SoCC.

Digital Library

[48]

Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In USENIX NSDI.

Digital Library

[49]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

Digital Library

[50]

Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. 2017. Crail: A High-Performance I/O Architecture for Distributed Data Processing. IEEE Data Eng. Bull. 40, 1 (2017), 38--49.

[51]

Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. 2014. The Power of Choice in Data-aware Cluster Scheduling. In USENIX OSDI.

Digital Library

[52]

Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing Cloud Computing Hardware Reliability. In ACM SoCC.

Digital Library

[53]

Y. Wang, R. Goldstone, W. Yu, and T. Wang. 2014. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems. In IEEE 28th International Parallel and Distributed Processing Symposium.

Digital Library

[54]

Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. 2011. Hadoop Acceleration Through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[55]

Caesar Wu and Rajkumar Buyya. 2015. Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center (1st ed.). Morgan Kaufmann Publishers Inc.

Digital Library

[56]

Tao Ye and Shivkumar Kalyanaraman. 2003. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration.

[57]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In USENIX NSDI.

Digital Library

[58]

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In USENIX OSDI.

Digital Library

[59]

Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. 2017. BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning. In ACM SoCC. Santa Clara, CA.

Digital Library

Cited By

Wu YHuang XWei ZCheng HXin CChen ZChen BWu YWang HZhang TShi RGao XLiang YZhao PChen G(2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685804
Eizaguirre GSánchez-Artigas M(2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104763
Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Show More Cited By

Index Terms

Riffle: optimized shuffle service for large-scale data analytics

Recommendations

Run-time performance optimization of a BigData query language
ICPE '14: Proceedings of the 5th ACM/SPEC international conference on Performance engineering

JAQL is a query language for large-scale data that connects BigData analytics and MapReduce framework together. Also an IBM product, JAQL's performance is critical for IBM InfoSphere BigInsights, a BigData analytics platform. In this paper, we report ...
A survey of big data management

The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous ...
Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Solid-state drives (SSDs) are an attractive alternative to hard disk drives (HDDs) to accelerate the Hadoop MapReduce Framework. However, the SSD characteristics and today's Hadoop framework exhibit mismatches that impede indiscriminate SSD integration. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '18: Proceedings of the Thirteenth EuroSys Conference

April 2018

631 pages

ISBN:9781450355841

DOI:10.1145/3190508

General Chair:
Rui Oliveira,
Program Chairs:
Pascal Felber,
Y. Charlie Hu

Copyright © 2018 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2018

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

EuroSys '18

Sponsor:

SIGOPS

EuroSys '18: Thirteenth EuroSys Conference 2018

April 23 - 26, 2018

Porto, Portugal

Acceptance Rates

EuroSys '18 Paper Acceptance Rate 43 of 262 submissions, 16%;

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
2,364
Total Downloads

Downloads (Last 12 months)284
Downloads (Last 6 weeks)35

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu YHuang XWei ZCheng HXin CChen ZChen BWu YWang HZhang TShi RGao XLiang YZhao PChen G(2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685804
Eizaguirre GSánchez-Artigas M(2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104763
Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Luo ZFu SAmaro EOusterhout ARatnasamy SShenker SBaumann ACrooks NSchwarzkopf M(2023)Out of Hand for Hardware? Within Reach for Software!Proceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595898(30-37)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3593856.3595898
Lin JJi THao XCha HLe YYu XAkella A(2023)Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899807:2(1-23)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3589980
He YWu WLe YLiu MLao CAamodt TJerger NSwift M(2023)A Generic Service to Provide In-Network Aggregation for Key-Value StreamsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575708(33-47)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575708
Kumar SMohbey K(2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3221210
Dou WXu XYu SDou WXu XYu S(2023)Architecture of Industrial Internet-Centric BPMIntelligent Industrial Internet Systems10.1007/978-981-99-5732-3_2(25-37)Online publication date: 21-Nov-2023
https://doi.org/10.1007/978-981-99-5732-3_2
Modi ARajan KThimmaiah SJain PMann SAgarwal AShetty AI SGosalia ASarthi P(2022)New query optimization techniques in the Spark engine of Azure synapseProceedings of the VLDB Endowment10.14778/3503585.350360115:4(936-948)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503601
Goyal MAkella AGroppe SGruenwald LHsu C(2022)Think before you shuffleProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532922(1-6)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3530050.3532922
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents