research-article

CAB-MPI: exploring interprocess work-stealing towards balanced MPI communication

Authors:

Kaiming Ouyang,

Pavan BalajiAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 36, Pages 1 - 15

Published: 09 November 2020 Publication History

Abstract

Load balance is essential for high-performance applications. Unbalanced communication can cause severe performance degradation, even in computation-balanced BSP applications. Designing communication-balanced applications is challenging, however, because of the diverse communication implementations at the underlying runtime system. In this paper, we address this challenge through an interprocess work-stealing scheme based on process-memory-sharing techniques. We present CAB-MPI, an MPI implementation that can identify idle processes inside MPI and use these idle resources to dynamically balance communication workload on the node. We design throughput-optimized strategies to ensure efficient stealing of the data movement tasks. We demonstrate the benefit of work stealing through several internal processes in MPI, including intranode data transfer, pack/unpack for noncontiguous communication, and computation in one-sided accumulates. The implementation is evaluated through a set of microbenchmarks and proxy applications on Intel Xeon and Xeon Phi platforms.

References

[1]

J. E. Flaherty, R. M. Loy, C. zturan, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz, "Parallel Structures and Dynamic Load Balancing for Adaptive Finite Element Computation," Applied Numerical Mathematics, vol. 26, no. 12, pp. 241--263, 1998.

Digital Library

[2]

R. Biswas, S. K. Das, D. J. Harvey, and L. Oliker, "Parallel Dynamic Load Balancing Strategies for Adaptive Irregular Applications," Applied Mathematical Modelling, vol. 25, no. 2, pp. 109--122, 2000.

[3]

B. Hendrickson and K. Devine, "Dynamic Load Balancing in Computational Mechanics," Computer Methods in Applied Mechanics and Engineering, vol. 184, no. 2, pp. 485--500, 2000.

[4]

K. D. Devine, E. G. Boman, and G. Karypis, Partitioning and Load Balancing for Emerging Parallel Applications and Architectures, 2006, pp. 99--126.

[5]

R. F. Barrett, C. T. Vaughan, and M. A. Heroux, "MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies using Stencil Computations in Scientific parallel computing," Sandia National Laboratories, Tech. Rep. SAND, vol. 5294832, p. 2011, 2011.

[6]

Y. Guo, J. Zhao, V. Cave, and V. Sarkar, "SLAW: A Scalable Locality-Aware Adaptive Work-Stealing Scheduler," in 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010, pp. 1--12.

[7]

R. D. Blumofe and C. E. Leiserson, "Scheduling Multithreaded Computations by Work Stealing," Journal of the ACM (JACM), vol. 46, no. 5, pp. 720--748, 1999.

Digital Library

[8]

O. Tardieu, H. Wang, and H. Lin, "A Work-Stealing Scheduler for X10's Task Parallelism with Suspension," ACM SIGPLAN Notices, vol. 47, no. 8, pp. 267--276, 2012.

Digital Library

[9]

A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin, "Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler," in ACM Sigplan Notices, vol. 45, no. 5. ACM, 2010, pp. 179--190.

[10]

S. Shiina and K. Taura, "Almost Deterministic Work Stealing," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2019, p. 47.

[11]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton, "Thread Scheduling for Multiprogrammed multiprocessors," Theory of computing systems, vol. 34, no. 2, pp. 115--144, 2001.

[12]

D. Chase and Y. Lev, "Dynamic Circular Work-Stealing Deque," in Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2005, pp. 21--28.

[13]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe, "The Data Locality of Work Stealing," in Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures. ACM, 2000, pp. 1--12.

[14]

D. Hendler and N. Shavit, "Non-Blocking Steal-Half Work Queues," in Proceedings of the twenty-first annual symposium on Principles of distributed computing. ACM, 2002, pp. 280--289.

[15]

Q. Chen, Z. Huang, M. Guo, and J. Zhou, "Cab: Cache Aware Bitier Task-Stealing in Multi-Socket Multi-Core Architecture," in 2011 International Conference on Parallel Processing. IEEE, 2011, pp. 722--732.

[16]

Q. Chen, M. Guo, and Z. Huang, "Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 12, pp. 2334--2343, 2012.

Digital Library

[17]

A. Hori, M. Si, B. Gerofi, M. Takagi, J. Dayal, P. Balaji, and Y. Ishikawa, "Process-in-Process: Techniques for Practical Address-Space Sharing," in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2018, pp. 131--143.

[18]

M. Kerrisk. (2008) Overview of POSIX Shared Memory. [Online]. Available: http://man7.org/linux/man-pages/man7/shm_overview.7.html

[19]

J. Vienne, "Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters," in Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 2014, p. 33.

[20]

B. Goglin and S. Moreaud, "KNEM: A Generic and Scalable Kernel-Assisted Intra-Node MPI Communication Framework," Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 176--188, 2013.

Digital Library

[21]

N. Hjelm, P. Shamis, and J. Squyres. (2018) XPMEM Linux Kernel Module. [Online]. Available: https://github.com/hjelmn/xpmem

[22]

M. Pérache, H. Jourdren, and R. Namyst, "MPC: A Unified Parallel Runtime for Clusters of NUMA Machines," in Proceedings of the 14th International Euro-Par Conference on Parallel Processing, ser. Euro-Par'08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 78--88.

[23]

M. Bhandarkar, L. V. Kalé, E. de Sturler, and J. Hoeflinger, "Adaptive Load Balancing for MPI Programs," in International Conference on Computational Science. Springer, 2001, pp. 108--117.

[24]

L. V. Kale and S. Krishnan, "CHARM++: A Portable Concurrent Object Oriented System Based on C++," in OOPSLA, vol. 93. Citeseer, 1993, pp. 91--108.

[25]

S. Jain, R. Kaleem, M. G. Balmana, A. Langer, D. Durnov, A. Sannikov, and M. Garzaran, "Framework for Scalable Intra-Node Collective Operations Using Shared Memory," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC 18. IEEE Press, 2018.

[26]

J. M. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. K. Panda, "Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-Cores," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 1020--1029.

[27]

P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar et al., "Haswell: The Fourth-Generation Intel Core Processor," IEEE Micro, vol. 34, no. 2, pp. 6--20, 2014.

[28]

S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and D. K. Panda, "Cooperative Rendezvous Protocols for Improved Performance and Overlap," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 361--373.

[29]

M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus et al., "NWChem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations," Computer Physics Communications, vol. 181, no. 9, pp. 1477--1489, 2010.

[30]

J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Apra, "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit," The International Journal of High Performance Computing Applications, vol. 20, no. 2, pp. 203--231, 2006.

Digital Library

[31]

J. S. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, "Supporting the Global Arrays PGAS Model using MPI One-Sided Communication," in 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012.

[32]

B. Sheridan and J. T. Fineman, "A Case for Distributed Work-Stealing in Regular Applications," in Proceedings of the 6th ACM SIGPLAN Workshop on X10, 2016, pp. 32--33.

[33]

Q. Chen, M. Guo, and H. Guan, "LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-Core Architectures," in Proceedings of the 28th ACM international conference on Supercomputing. ACM, 2014, pp. 3--12.

[34]

S.-J. Min, C. Iancu, and K. Yelick, "Hierarchical Work Stealing on Manycore Clusters," in Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), vol. 625, 2011.

[35]

S. Barghi and M. Karsten, "Work-Stealing, Locality-Aware Actor Scheduling," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 484--494.

[36]

J. Deters, J. Wu, Y. Xu, and I.-T. A. Lee, "A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle," in 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2018, pp. 59--70.

[37]

Q. Chen, M. Guo, and Z. Huang, "CATS: Cache Aware Task-Stealing Based on Online Profiling in Multi-Socket Multi-Core Architectures," in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012, pp. 163--172.

[38]

R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-Aware Task Management for Unstructured Parallelism: A Quantitative Limit Study," in Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2013, pp. 315--325.

[39]

J. Lifflander, S. Krishnamoorthy, and L. V. Kale, "Optimizing Data Locality for Fork/Join Programs using Constrained Work Stealing," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 857--868.

[40]

A. Drebes, A. Pop, K. Heydemann, N. Drach, and A. Cohen, "NUMA-Aware Scheduling and Memory Allocation for Data-Flow Task-Parallel Applications," in ACM Sigplan Notices, vol. 51, no. 8. ACM, 2016, p. 44.

[41]

Q. Chen and M. Guo, "Contention and Locality-Aware Work-Stealing for Iterative Applications in Multi-Socket Computers," IEEE Transactions on Computers, vol. 67, no. 6, pp. 784--798, 2017.

[42]

M. Shaheen and R. Strzodka, "NUMA Aware Iterative Stencil Computations on Many-Core Systems," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, 2012, pp. 461--473.

[43]

J. Paudel, O. Tardieu, and J. N. Amaral, "On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks," in 2013 42nd International Conference on Parallel Processing. IEEE, 2013, pp. 100--109.

[44]

H. Zhao, Q. Chen, Y. Qiu, M. Wu, Y. Shen, J. Leng, C. Li, and M. Guo, "Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory," ACM Transactions on Architecture and Code Optimization (TACO), vol. 15, no. 4, p. 55, 2018.

[45]

U. A. Acar, A. Chargu'eraud, and M. Rainey, "Scheduling Parallel Programs by Work Stealing with Private Deques," in Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013, pp. 219--228.

[46]

T. Hiraishi, M. Yasugi, S. Umatani, and T. Yuasa, "Backtracking-Based Load Balancing," ACM Sigplan Notices, vol. 44, no. 4, pp. 55--64, 2009.

Digital Library

Recommendations

MT-MPI: multithreaded MPI for many-core environments
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Many-core architectures, such as the Intel Xeon Phi, provide dozens of cores and hundreds of hardware threads. To utilize such architectures, application programmers are increasingly looking at hybrid programming models, where multiple threads interact ...
MPI-StarT: delivering network performance to numerical applications
SC '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing

We describe an MPI implementation for a cluster of SMPs interconnected by a high-performance interconnect. This work is a collaboration between a numerical applications programmer and a cluster interconnect architect. The collaboration started with the ...
Performance Evaluation of MPI Implementations and MPI-Based Parallel ELLPACK Solvers
MPIDC '96: Proceedings of the Second MPI Developers Conference

Abstract: We are concerned with the parallelization of finite element mesh generation and its decomposition, and the parallel solution of sparse algebraic equations which are obtained from the parallel discretization of second order elliptic partial ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
166
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten