research-article

Magnet: push-based shuffle service for large-scale data processing

Authors:

Chandni SinghAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 12

Pages 3382 - 3395

https://doi.org/10.14778/3415478.3415558

Published: 01 August 2020 Publication History

Abstract

Over the past decade, Apache Spark has become a popular compute engine for large scale data processing. Similar to other compute engines based on the MapReduce compute paradigm, the shuffle operation, namely the all-to-all transfer of the intermediate data, plays an important role in Spark. At LinkedIn, with the rapid growth of the data size and scale of the Spark deployment, the shuffle operation is becoming a bottleneck of further scaling the infrastructure. This has led to overall job slowness and even failures for long running jobs. This not only impacts developer productivity for addressing such slowness and failures, but also results in high operational cost of infrastructure.

In this work, we describe the main bottlenecks impacting shuffle scalability. We propose Magnet, a novel shuffle mechanism that can scale to handle petabytes of daily shuffled data and clusters with thousands of nodes. Magnet is designed to work with both on-prem and cloud-based cluster deployments. It addresses a key shuffle scalability bottleneck by merging fragmented intermediate shuffle data into large blocks. Magnet provides further improvements by co-locating merged blocks with the reduce tasks. Our benchmarks show that Magnet significantly improves shuffle performance independent of the underlying hardware. Magnet reduces the end-to-end runtime of Linkedln's production Spark jobs by nearly 30%. Furthermore, Magnet improves user productivity by removing the shuffle related tuning burden from users.

References

[1]

Apache Hadoop. http://hadoop.apache.org (Retrieved 02/20/2020).

[2]

Apache spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html (Retrieved 02/20/2020).

[3]

Building the next version of our infrastructure. https://engineering.linkedin.com/blog/2019/building-next-infra (Retrieved 05/15/2020).

[4]

Cosco: An efficient facebook-scale shuffle service. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service (Retrieved 02/20/2020).

[5]

Create shuffle service for external block storage. https://issues.apache.org/jira/browse/SPARK-3796 (Retrieved 02/20/2020).

[6]

Dr. elephant for monitoring and tuning apache spark jobs on hadoop. https://databricks.com/session/dr-elephant-for-monitoring-and-tuning-apache-spark-jobs-on-hadoop (Retrieved 02/20/2020).

[7]

Introduce pluggable shuffle service architecture. https://issues.apache.org/jira/browse/FLINK-10653 (Retrieved 02/20/2020).

[8]

KFS. https://code.google.com/archive/p/kosmosfs/ (Retrieved 02/20/2020).

[9]

Making apache spark effortless for all of uber. https://eng.uber.com/uscs-apache-spark/ (Retrieved 02/20/2020).

[10]

Netflix: Integrating spark at petabyte scale. https://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43373 (Retrieved 02/20/2020).

[11]

plugin for generic shuffle service. https://issues.apache.org/jira/browse/MAPREDUCE-4049 (Retrieved 02/20/2020).

[12]

Spark data locality documentation. https://spark.apache.org/docs/latest/tuning.html#data-locality (Retrieved 02/20/2020).

[13]

Spark dynamic resource allocation documentation. https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation (Retrieved 02/20/2020).

[14]

Spark sql adaptive execution at 100 tb. https://software.intel.com/en-us/articles/spark-sql-adaptive-execution-at-100--tb (Retrieved 02/20/2020).

[15]

Taking advantage of a disaggregated storage and compute architecture. https://databricks.com/session/taking-advantage-of-a-disaggregated-storage-and-compute-architecture (Retrieved 02/20/2020).

[16]

Tuning apache spark for large-scale workloads. https://databricks.com/session/tuning-apache-spark-for-large-scale-workloads (Retrieved 02/20/2020).

[17]

G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. USENIX HotOS, 2011.

Digital Library

[18]

M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015.

Digital Library

[19]

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin IEEE Comput. Soc. Tech. Committee Data Eng, 38(4):28--38, 2015.

[20]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. NSDI 2010, 10(4), 2010.

Digital Library

[21]

A. Davidson and A. Or. Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech. Rep, 2013.

[22]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 5(1):107--113, 2008.

Digital Library

[23]

R. C. Gonçalves, J. Pereira, and R. Jiménez-Peris. An rdma middleware for asynchronous multi-stage shuffling in analytical processing. IFIP International Conference on Distributed Applications and Interoperable Systems, pages 61--74, 2016.

Digital Library

[24]

L. M. Grupp, J. D. Davis, and S. Swanson. The bleak future of nand flash memory. FAST, 2012.

Digital Library

[25]

Y. Guo, J. Rao, D. Cheng, and X. Zhou. ishuffle: Improving hadoop performance with shuffle-on-write. IEEE Transactions on Parallel and Distributed Systems, 28(6):1649--1662, 2016.

Digital Library

[26]

Z. Guo, G. Fox, and M. Zhou. Investigation of data locality in mapreduce. 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 419--426, 2012.

Digital Library

[27]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. Cidr, pages 261--272, 2011.

[28]

V. Kasavajhala. Solid state drive vs. hard disk drive price and performance study. Proc. Dell Tech. White Paper, pages 8--9, 2011.

[29]

B. S. Kim, J. Choi, and S. L. Min. Design tradeoffs for SSD reliability. 17th USENIX Conference on File and Storage Technologies (FAST 19), pages 281--294, 2019.

Digital Library

[30]

K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. Proceedings of the 24th ACM Symposium on Operating Systems Principles, pages 69--84, 2013.

Digital Library

[31]

S. Qiao, A. Nicoara, J. Sun, M. Friedman, H. Patel, and J. Ekanayake. Hyper dimension shuffle: Efficient data repartition at petabyte scale in scope. PVLDB, 12(10):1113--1125, 2019.

Digital Library

[32]

N. Rana and S. Deshmukh. Shuffle performance in apache spark. International Journal of Engineering Research and Technology, pages 177--180, 2015.

[33]

S. Rao, R. Ramakrishnan, A. Silberstein, M. Ovsiannikov, and D. Reeves. Sailfish: A framework for large scale data processing. Proceedings of the 3rd ACM Symposium on Cloud Computing, pages 1--14, 2012.

Digital Library

[34]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. Proceedings of the 8th ACM European Conference on Computer Systems, pages 351--364, 2013.

Digital Library

[35]

R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner. Presto: Sql on everything. Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1802--1813, 2019.

[36]

P. Stuedi, A. Trivedi, J. Pfefferle, R. Stoica, B. Metzler, N. Ioannou, and I. Koltsidas. Crail: A high-performance i/o architecture for distributed data processing. IEEE Data Eng. Bull., 40(1):38--49, 2017.

[37]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. Proceedings of the 4th annual Symposium on Cloud Computing, pages 1--16, 2013.

Digital Library

[38]

K. Wang, X. Zhou, T. Li, D. Zhao, M. Lang, and I. Raicu. Optimizing load balancing and data-locality with data-aware scheduling. 2014 IEEE International Conference on Big Data, pages 119--128, 2014.

[39]

Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal. Hadoop acceleration through network levitated merge. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--10, 2011.

Digital Library

[40]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proc. USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.

Digital Library

[41]

H. Zhang, B. Cho, E. Seyfe, A. Ching, and M. J. Freedman. Riffle: optimized shuffle service for large-scale data analytics. Proceedings of the 13th EuroSys Conference, pages 1--15, 2018.

Digital Library

[42]

P. Zhuang, K. Huang, Y. Zhao, W. Kang, H. Wang, Y. Li, and J. Yu. Shuffle manager in a distributed memory object architecture, 2020. US Patent App. 16/372,161.

Cited By

Okolnychyi ASun CTanimura KSpitzer RBlue RHo SGu YLakkundi VTsai D(2024)Petabyte-Scale Row-Level Operations in Data LakehousesProceedings of the VLDB Endowment10.14778/3685800.368583417:12(4159-4172)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685834
Wu YHuang XWei ZCheng HXin CChen ZChen BWu YWang HZhang TShi RGao XLiang YZhao PChen G(2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685804
Eizaguirre GSánchez-Artigas M(2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104763
Show More Cited By

Recommendations

Design of low speed no - speed sensor for permanent magnet synchronous motor
ISBDAI '18: Proceedings of the International Symposium on Big Data and Artificial Intelligence

Speed Sensorless Control Technology of Permanent Magnet Synchronous Motor at Low SpeedMainly based on the salient pole characteristics of the motor, injecting high frequency excitation signals and noise filters requiring high bandwidth, It can detect ...
Design of permanent magnet hysteresis motors
GEMESED'11: Proceedings of the 4th WSEAS international conference on Energy and development - environment - biomedicine

The hysteresis motor is so named because it is producing mechanical torque utilizing the phenomenon of hysteresis. The rotor of a hysteresis motor is a cylindrical tube of high hysteresis loss permanent magnet material without windings or slots. This ...
Development of Magnet Structure Optimization on Achieving the Cogging Torque Reduction in Permanent Magnet Machine
2020 23rd International Conference on Electrical Machines and Systems (ICEMS)
This paper dealt with the magnet structure optimization of integral slot numbers on achieving the cogging torque reduction of the permanent magnet machine. The permanent magnet machines have been widely used in many applications for the last few years. ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 12

August 2020

1710 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2020

Published in PVLDB Volume 13, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)5

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Okolnychyi ASun CTanimura KSpitzer RBlue RHo SGu YLakkundi VTsai D(2024)Petabyte-Scale Row-Level Operations in Data LakehousesProceedings of the VLDB Endowment10.14778/3685800.368583417:12(4159-4172)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685834
Wu YHuang XWei ZCheng HXin CChen ZChen BWu YWang HZhang TShi RGao XLiang YZhao PChen G(2024)Towards Resource Efficiency: Practical Insights into Large-Scale Spark Workloads at ByteDanceProceedings of the VLDB Endowment10.14778/3685800.368580417:12(3759-3771)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685804
Eizaguirre GSánchez-Artigas M(2024)A Seer knows bestJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104763183:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104763
Saxena MSowell BAlamgir DBahadur NBisht BChandrachood SKeswani CKrishnamoorthy GLee ALi BMitchell ZPorwal VChappidi MRoss BSekiyama NZaki OZhang LShah M(2023)The Story of AWS GlueProceedings of the VLDB Endowment10.14778/3611540.361154716:12(3557-3569)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611547
Luan FWang SYagati SKim SLien KOng IHong TCho SLiang EStoica ISchulzrinne HKohler EMaltz DMisra V(2023)Exoshuffle: An Extensible Shuffle ArchitectureProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604848(564-577)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604848
Lin JJi THao XCha HLe YYu XAkella A(2023)Towards Accelerating Data Intensive Application's Shuffle Process Using SmartNICsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899807:2(1-23)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3589980
Modi ARajan KThimmaiah SJain PMann SAgarwal AShetty AI SGosalia ASarthi P(2022)New query optimization techniques in the Spark engine of Azure synapseProceedings of the VLDB Endowment10.14778/3503585.350360115:4(936-948)Online publication date: 14-Apr-2022
https://dl.acm.org/doi/10.14778/3503585.3503601
Goyal MAkella AGroppe SGruenwald LHsu C(2022)Think before you shuffleProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532922(1-6)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3530050.3532922
Sánchez-Artigas MEizaguirre GBellavista PZhang KGherbi ABagchi SPatiño MDi Modica GGascon-Samson J(2022)A seer knows bestProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3565241(148-160)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3528535.3565241

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents