Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3190508.3190534acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

Riffle: optimized shuffle service for large-scale data analytics

Published: 23 April 2018 Publication History

Abstract

The rapidly growing size of data and complexity of analytics present new challenges for large-scale data processing systems. Modern systems keep data partitions in memory for pipelined operators, and persist data across stages with wide dependencies on disks for fault tolerance. While processing can often scale well by splitting jobs into smaller tasks for better parallelism, all-to-all data transfer---called shuffle operations---become the scaling bottleneck when running many small tasks in multi-stage data analytics jobs. Our key observation is that this bottleneck is due to the superlinear increase in disk I/O operations as data volume increases.
We present Riffle, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data. To do so, Riffle efficiently merges fragmented intermediate shuffle files into larger block files, and thus converts small, random disk I/O requests into large, sequential ones. Riffle further improves performance and fault tolerance by mixing both merged and unmerged block files to minimize merge operation overhead. Using Riffle, Facebook production jobs on Spark clusters with over 1,000 executors experience up to a 10x reduction in the number of shuffle I/O requests and 40% improvement in the end-to-end job completion time.

References

[1]
Retrieved 10/20/2017. Apache Hadoop. (Retrieved 10/20/2017). http://hadoop.apache.org/.
[2]
Retrieved 10/20/2017. Apache Ignite. (Retrieved 10/20/2017). https://ignite.apache.org/.
[3]
Retrieved 10/20/2017. Apache Spark. (Retrieved 10/20/2017). http://spark.apache.org/.
[4]
Retrieved 10/20/2017. Apache Spark Performance Tuning âĂŞ Degree of Parallelism. (Retrieved 10/20/2017). https://goo.gl/Mpt13F.
[5]
Retrieved 10/20/2017. Apache Spark @Scale: A 60 TB+ Production Use Case. (Retrieved 10/20/2017). https://code.facebook.com/posts/1671373793181703/.
[6]
Retrieved 10/20/2017. Apache Spark the fastest open source engine for sorting a petabyte. (Retrieved 10/20/2017). https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
[7]
Retrieved 10/20/2017. Facebook Disaggregate: Networking recap. (Retrieved 10/20/2017). https://code.facebook.com/posts/1887543398133443/.
[8]
Retrieved 10/20/2017. Facebook's Disaggregate Storage and Compute for Map/Reduce. (Retrieved 10/20/2017). https://goo.gl/8vQdfU.
[9]
Retrieved 10/20/2017. LZ4: Extremely Fast Compression Algorithm. (Retrieved 10/20/2017). http://www.lz4.org.
[10]
Retrieved 10/20/2017. MapReduce-4049: Plugin for Generic Shuffle Service. (Retrieved 10/20/2017). https://issues.apache.org/jira/browse/MAPREDUCE-4049.
[11]
Retrieved 10/20/2017. Snappy: A Fast Compressor/Decompressor. (Retrieved 10/20/2017). https://google.github.io/snappy/.
[12]
Retrieved 10/20/2017. Spark Configuration: External Shuffle Service. (Retrieved 10/20/2017). https://spark.apache.org/docs/latest/job-scheduling.html.
[13]
Retrieved 10/20/2017. Tim Sort. (Retrieved 10/20/2017). http://wiki.c2.com/?TimSort.
[14]
Retrieved 10/20/2017. Working with Apache Spark. (Retrieved 10/20/2017). https://goo.gl/XbUA42.
[15]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys.
[16]
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones. In USENIX NSDI.
[17]
Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, and Minlan Yu. 2014. GRASS: Trimming Stragglers in Approximation Analytics. In USENIX NSDI.
[18]
Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-reduce Clusters Using Mantri. In USENIX OSDI.
[19]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In ACM SIGMOD.
[20]
Josep Lluís Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, and Daron Green. 2015. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[21]
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In USENIX NSDI.
[22]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In USENIX OSDI.
[23]
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. In ACM SIGMOD.
[24]
Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network Requirements for Resource Disaggregation. In USENIX OSDI.
[25]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In ACM SOSP.
[26]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In USENIX NSDI.
[27]
Laura M. Grupp, John D. Davis, and Steven Swanson. 2012. The Bleak Future of NAND Flash Memory. In USENIX FAST.
[28]
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics. In CIDR. 261--272.
[29]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In USENIX NSDI.
[30]
Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito VI, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar, Abhishek Mathur, Sachin Kulkarni, Matthew Burke, and Wyatt Lloyd. 2017. SVE: Distributed Video Processing at Facebook Scale. In ACM SOSP.
[31]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In ACM EuroSys.
[32]
S. Kambhampati, J. Kelley, C. Stewart, W. C. L. Stewart, and R. Ramnath. 2014. Managing Tiny Tasks for Data-Parallel, Subsampling Workloads. In 2014 IEEE International Conference on Cloud Engineering.
[33]
Vamsee Kasavajhala. 2011. Solid State Drive vs. Hard Disk Drive Price and Performance Study: A Dell Technical White Paper. Dell Power Vault Storage Systems (May 2011).
[34]
Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An Analysis of Traces from a Production MapReduce Cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid).
[35]
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In ACM SoCC.
[36]
Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated Memory for Expansion and Sharing in Blade Servers. In ACM ISCA.
[37]
David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don't Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In USENIX OSDI. Savannah, GA.
[38]
S. T. Maguluri, R. Srikant, and L. Ying. 2012. Stochastic Models of Load Balancing and Scheduling in Cloud Computing Clusters. In IEEE INFOCOM.
[39]
M. D. McKay, R. J. Beckman, and W. J. Conover. 2000. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 42, 1 (Feb. 2000), 55--61.
[40]
Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load Balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (Oct. 2001), 1094--1104.
[41]
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, and Scott Shenker. 2017. Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks. In ACM SOSP.
[42]
Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2013. The Case for Tiny Tasks in Compute Clusters. In USENIX HotOS Workshop. Santa Ana Pueblo, NM.
[43]
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In USENIX NSDI.
[44]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: Distributed, Low Latency Scheduling. In ACM SOSP.
[45]
Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves. 2012. Sailfish: A Framework for Large Scale Data Processing. In ACM SoCC.
[46]
A. Rasmussen, M. Conley, R. Kapoor, V.T. Lam, G. Porter, and A. Vahdat. 2012. ThemisMR: An I/O-efficient MapReduce. Technical Report (University of California, San Diego. Department of Computer Science and Engineering) (2012).
[47]
Alexander Rasmussen, Vinh The Lam, Michael Conley, George Porter, Rishi Kapoor, and Amin Vahdat. 2012. Themis: An I/O-efficient MapReduce. In ACM SoCC.
[48]
Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. 2011. TritonSort: A Balanced Large-scale Sorting System. In USENIX NSDI.
[49]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[50]
Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Radu Stoica, Bernard Metzler, Nikolas Ioannou, and Ioannis Koltsidas. 2017. Crail: A High-Performance I/O Architecture for Distributed Data Processing. IEEE Data Eng. Bull. 40, 1 (2017), 38--49.
[51]
Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. 2014. The Power of Choice in Data-aware Cluster Scheduling. In USENIX OSDI.
[52]
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing Cloud Computing Hardware Reliability. In ACM SoCC.
[53]
Y. Wang, R. Goldstone, W. Yu, and T. Wang. 2014. Characterization and Optimization of Memory-Resident MapReduce on HPC Systems. In IEEE 28th International Parallel and Distributed Processing Symposium.
[54]
Yandong Wang, Xinyu Que, Weikuan Yu, Dror Goldenberg, and Dhiraj Sehgal. 2011. Hadoop Acceleration Through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.
[55]
Caesar Wu and Rajkumar Buyya. 2015. Cloud Data Centers and Cost Modeling: A Complete Guide To Planning, Designing and Building a Cloud Data Center (1st ed.). Morgan Kaufmann Publishers Inc.
[56]
Tao Ye and Shivkumar Kalyanaraman. 2003. A Recursive Random Search Algorithm for Large-scale Network Parameter Configuration.
[57]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In USENIX NSDI.
[58]
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In USENIX OSDI.
[59]
Yuqing Zhu, Jianxun Liu, Mengying Guo, Yungang Bao, Wenlong Ma, Zhuoyue Liu, Kunpeng Song, and Yingchun Yang. 2017. BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning. In ACM SoCC. Santa Clara, CA.

Cited By

View all
  • (2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
  • (2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '18: Proceedings of the Thirteenth EuroSys Conference
April 2018
631 pages
ISBN:9781450355841
DOI:10.1145/3190508
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2018

Check for updates

Author Tags

  1. I/O optimization
  2. big-data analytics frameworks
  3. shuffle service
  4. storage

Qualifiers

  • Research-article

Funding Sources

Conference

EuroSys '18
Sponsor:
EuroSys '18: Thirteenth EuroSys Conference 2018
April 23 - 26, 2018
Porto, Portugal

Acceptance Rates

EuroSys '18 Paper Acceptance Rate 43 of 262 submissions, 16%;
Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)284
  • Downloads (Last 6 weeks)35
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
  • (2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)Out of Hand for Hardware? Within Reach for Software!Proceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595898(30-37)Online publication date: 22-Jun-2023
  • (2023)Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899807:2(1-23)Online publication date: 22-May-2023
  • (2023)A Generic Service to Provide In-Network Aggregation for Key-Value StreamsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575708(33-47)Online publication date: 27-Jan-2023
  • (2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
  • (2023)Architecture of Industrial Internet-Centric BPMIntelligent Industrial Internet Systems10.1007/978-981-99-5732-3_2(25-37)Online publication date: 21-Nov-2023
  • (2022)New query optimization techniques in the Spark engine of Azure synapseProceedings of the VLDB Endowment10.14778/3503585.350360115:4(936-948)Online publication date: 14-Apr-2022
  • (2022)Think before you shuffleProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532922(1-6)Online publication date: 12-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media