Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Magnet: push-based shuffle service for large-scale data processing

Published: 01 August 2020 Publication History

Abstract

Over the past decade, Apache Spark has become a popular compute engine for large scale data processing. Similar to other compute engines based on the MapReduce compute paradigm, the shuffle operation, namely the all-to-all transfer of the intermediate data, plays an important role in Spark. At LinkedIn, with the rapid growth of the data size and scale of the Spark deployment, the shuffle operation is becoming a bottleneck of further scaling the infrastructure. This has led to overall job slowness and even failures for long running jobs. This not only impacts developer productivity for addressing such slowness and failures, but also results in high operational cost of infrastructure.
In this work, we describe the main bottlenecks impacting shuffle scalability. We propose Magnet, a novel shuffle mechanism that can scale to handle petabytes of daily shuffled data and clusters with thousands of nodes. Magnet is designed to work with both on-prem and cloud-based cluster deployments. It addresses a key shuffle scalability bottleneck by merging fragmented intermediate shuffle data into large blocks. Magnet provides further improvements by co-locating merged blocks with the reduce tasks. Our benchmarks show that Magnet significantly improves shuffle performance independent of the underlying hardware. Magnet reduces the end-to-end runtime of Linkedln's production Spark jobs by nearly 30%. Furthermore, Magnet improves user productivity by removing the shuffle related tuning burden from users.

References

[1]
Apache Hadoop. http://hadoop.apache.org (Retrieved 02/20/2020).
[2]
Apache spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html (Retrieved 02/20/2020).
[3]
Building the next version of our infrastructure. https://engineering.linkedin.com/blog/2019/building-next-infra (Retrieved 05/15/2020).
[4]
Cosco: An efficient facebook-scale shuffle service. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service (Retrieved 02/20/2020).
[5]
Create shuffle service for external block storage. https://issues.apache.org/jira/browse/SPARK-3796 (Retrieved 02/20/2020).
[6]
Dr. elephant for monitoring and tuning apache spark jobs on hadoop. https://databricks.com/session/dr-elephant-for-monitoring-and-tuning-apache-spark-jobs-on-hadoop (Retrieved 02/20/2020).
[7]
Introduce pluggable shuffle service architecture. https://issues.apache.org/jira/browse/FLINK-10653 (Retrieved 02/20/2020).
[8]
KFS. https://code.google.com/archive/p/kosmosfs/ (Retrieved 02/20/2020).
[9]
Making apache spark effortless for all of uber. https://eng.uber.com/uscs-apache-spark/ (Retrieved 02/20/2020).
[10]
Netflix: Integrating spark at petabyte scale. https://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43373 (Retrieved 02/20/2020).
[11]
plugin for generic shuffle service. https://issues.apache.org/jira/browse/MAPREDUCE-4049 (Retrieved 02/20/2020).
[12]
Spark data locality documentation. https://spark.apache.org/docs/latest/tuning.html#data-locality (Retrieved 02/20/2020).
[13]
Spark dynamic resource allocation documentation. https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation (Retrieved 02/20/2020).
[14]
Spark sql adaptive execution at 100 tb. https://software.intel.com/en-us/articles/spark-sql-adaptive-execution-at-100--tb (Retrieved 02/20/2020).
[15]
Taking advantage of a disaggregated storage and compute architecture. https://databricks.com/session/taking-advantage-of-a-disaggregated-storage-and-compute-architecture (Retrieved 02/20/2020).
[16]
Tuning apache spark for large-scale workloads. https://databricks.com/session/tuning-apache-spark-for-large-scale-workloads (Retrieved 02/20/2020).
[17]
G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. USENIX HotOS, 2011.
[18]
M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015.
[19]
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin IEEE Comput. Soc. Tech. Committee Data Eng, 38(4):28--38, 2015.
[20]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. NSDI 2010, 10(4), 2010.
[21]
A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.
[22]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 5(1):107--113, 2008.
[23]
R. C. Gonçalves, J. Pereira, and R. Jiménez-Peris. An rdma middleware for asynchronous multi-stage shuffling in analytical processing. IFIP International Conference on Distributed Applications and Interoperable Systems, pages 61--74, 2016.
[24]
L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future of nand flash memory. FAST, 2012.
[25]
Y. Guo, J. Rao, D. Cheng, and X. Zhou. ishuffle: Improving hadoop performance with shuffle-on-write. IEEE Transactions on Parallel and Distributed Systems, 28(6):1649--1662, 2016.
[26]
Z. Guo, G. Fox, and M. Zhou. Investigation of data locality in mapreduce. 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 419--426, 2012.
[27]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. Cidr, pages 261--272, 2011.
[28]
V. Kasavajhala. Solid state drive vs. hard disk drive price and performance study. Proc. Dell Tech. White Paper, pages 8--9, 2011.
[29]
B. S. Kim, J. Choi, and S. L. Min. Design tradeoffs for SSD reliability. 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 281--294, 2019.
[30]
K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. Proceedings of the 24th ACM Symposium on Operating Systems Principles, pages 69--84, 2013.
[31]
S. Qiao, A. Nicoara, J. Sun, M. Friedman, H. Patel, and J. Ekanayake. Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope. PVLDB, 12(10):1113--1125, 2019.
[32]
N. Rana and S. Deshmukh. Shuffle performance in apache spark. International Journal of Engineering Research and Technology, pages 177--180, 2015.
[33]
S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, and D. Reeves. Sailfish: A framework for large scale data processing. Proceedings of the 3rd ACM Symposium on Cloud Computing, pages 1--14, 2012.
[34]
M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. Proceedings of the 8th ACM European Conference on Computer Systems, pages 351--364, 2013.
[35]
R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner. Presto: Sql on everything. Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1802--1813, 2019.
[36]
P. Stuedi, A. Trivedi, J. Pfefferle, R. Stoica, B. Metzler, N. Ioannou, and I. Koltsidas. Crail: A high-performance i/o architecture for distributed data processing. IEEE Data Eng. Bull., 40(1):38--49, 2017.
[37]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. Proceedings of the 4th annual Symposium on Cloud Computing, pages 1--16, 2013.
[38]
K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu. Optimizing load balancing and data-locality with data-aware scheduling. 2014 IEEE International Conference on Big Data, pages 119--128, 2014.
[39]
Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal. Hadoop acceleration through network levitated merge. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--10, 2011.
[40]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proc. USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.
[41]
H. Zhang, B. Cho, E. Seyfe, A. Ching, and M. J. Freedman. Riffle: optimized shuffle service for large-scale data analytics. Proceedings of the 13th EuroSys Conference, pages 1--15, 2018.
[42]
P. Zhuang, K. Huang, Y. Zhao, W. Kang, H. Wang, Y. Li, and J. Yu. Shuffle manager in a distributed memory object architecture, 2020. US Patent App. 16/372,161.

Cited By

View all
  • (2024)Petabyte-Scale Row-Level Operations in Data LakehousesProceedings of the VLDB Endowment10.14778/3685800.368583417:12(4159-4172)Online publication date: 1-Aug-2024
  • (2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
  • (2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)5
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Petabyte-Scale Row-Level Operations in Data LakehousesProceedings of the VLDB Endowment10.14778/3685800.368583417:12(4159-4172)Online publication date: 1-Aug-2024
  • (2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
  • (2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
  • (2023)The Story of AWS GlueProceedings of the VLDB Endowment10.14778/3611540.361154716:12(3557-3569)Online publication date: 1-Aug-2023
  • (2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
  • (2023)Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899807:2(1-23)Online publication date: 22-May-2023
  • (2022)New query optimization techniques in the Spark engine of Azure synapseProceedings of the VLDB Endowment10.14778/3503585.350360115:4(936-948)Online publication date: 14-Apr-2022
  • (2022)Think before you shuffleProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532922(1-6)Online publication date: 12-Jun-2022
  • (2022)A seer knows bestProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565241(148-160)Online publication date: 7-Nov-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media