Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3433701.3433748acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

CAB-MPI: exploring interprocess work-stealing towards balanced MPI communication

Published: 09 November 2020 Publication History

Abstract

Load balance is essential for high-performance applications. Unbalanced communication can cause severe performance degradation, even in computation-balanced BSP applications. Designing communication-balanced applications is challenging, however, because of the diverse communication implementations at the underlying runtime system. In this paper, we address this challenge through an interprocess work-stealing scheme based on process-memory-sharing techniques. We present CAB-MPI, an MPI implementation that can identify idle processes inside MPI and use these idle resources to dynamically balance communication workload on the node. We design throughput-optimized strategies to ensure efficient stealing of the data movement tasks. We demonstrate the benefit of work stealing through several internal processes in MPI, including intranode data transfer, pack/unpack for noncontiguous communication, and computation in one-sided accumulates. The implementation is evaluated through a set of microbenchmarks and proxy applications on Intel Xeon and Xeon Phi platforms.

References

[1]
J. E. Flaherty, R. M. Loy, C. zturan, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H. Ziantz, "Parallel Structures and Dynamic Load Balancing for Adaptive Finite Element Computation," Applied Numerical Mathematics, vol. 26, no. 12, pp. 241--263, 1998.
[2]
R. Biswas, S. K. Das, D. J. Harvey, and L. Oliker, "Parallel Dynamic Load Balancing Strategies for Adaptive Irregular Applications," Applied Mathematical Modelling, vol. 25, no. 2, pp. 109--122, 2000.
[3]
B. Hendrickson and K. Devine, "Dynamic Load Balancing in Computational Mechanics," Computer Methods in Applied Mechanics and Engineering, vol. 184, no. 2, pp. 485--500, 2000.
[4]
K. D. Devine, E. G. Boman, and G. Karypis, Partitioning and Load Balancing for Emerging Parallel Applications and Architectures, 2006, pp. 99--126.
[5]
R. F. Barrett, C. T. Vaughan, and M. A. Heroux, "MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies using Stencil Computations in Scientific parallel computing," Sandia National Laboratories, Tech. Rep. SAND, vol. 5294832, p. 2011, 2011.
[6]
Y. Guo, J. Zhao, V. Cave, and V. Sarkar, "SLAW: A Scalable Locality-Aware Adaptive Work-Stealing Scheduler," in 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 2010, pp. 1--12.
[7]
R. D. Blumofe and C. E. Leiserson, "Scheduling Multithreaded Computations by Work Stealing," Journal of the ACM (JACM), vol. 46, no. 5, pp. 720--748, 1999.
[8]
O. Tardieu, H. Wang, and H. Lin, "A Work-Stealing Scheduler for X10's Task Parallelism with Suspension," ACM SIGPLAN Notices, vol. 47, no. 8, pp. 267--276, 2012.
[9]
A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin, "Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler," in ACM Sigplan Notices, vol. 45, no. 5. ACM, 2010, pp. 179--190.
[10]
S. Shiina and K. Taura, "Almost Deterministic Work Stealing," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2019, p. 47.
[11]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton, "Thread Scheduling for Multiprogrammed multiprocessors," Theory of computing systems, vol. 34, no. 2, pp. 115--144, 2001.
[12]
D. Chase and Y. Lev, "Dynamic Circular Work-Stealing Deque," in Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2005, pp. 21--28.
[13]
U. A. Acar, G. E. Blelloch, and R. D. Blumofe, "The Data Locality of Work Stealing," in Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures. ACM, 2000, pp. 1--12.
[14]
D. Hendler and N. Shavit, "Non-Blocking Steal-Half Work Queues," in Proceedings of the twenty-first annual symposium on Principles of distributed computing. ACM, 2002, pp. 280--289.
[15]
Q. Chen, Z. Huang, M. Guo, and J. Zhou, "Cab: Cache Aware Bitier Task-Stealing in Multi-Socket Multi-Core Architecture," in 2011 International Conference on Parallel Processing. IEEE, 2011, pp. 722--732.
[16]
Q. Chen, M. Guo, and Z. Huang, "Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 12, pp. 2334--2343, 2012.
[17]
A. Hori, M. Si, B. Gerofi, M. Takagi, J. Dayal, P. Balaji, and Y. Ishikawa, "Process-in-Process: Techniques for Practical Address-Space Sharing," in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 2018, pp. 131--143.
[18]
M. Kerrisk. (2008) Overview of POSIX Shared Memory. [Online]. Available: http://man7.org/linux/man-pages/man7/shm_overview.7.html
[19]
J. Vienne, "Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters," in Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. ACM, 2014, p. 33.
[20]
B. Goglin and S. Moreaud, "KNEM: A Generic and Scalable Kernel-Assisted Intra-Node MPI Communication Framework," Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 176--188, 2013.
[21]
N. Hjelm, P. Shamis, and J. Squyres. (2018) XPMEM Linux Kernel Module. [Online]. Available: https://github.com/hjelmn/xpmem
[22]
M. Pérache, H. Jourdren, and R. Namyst, "MPC: A Unified Parallel Runtime for Clusters of NUMA Machines," in Proceedings of the 14th International Euro-Par Conference on Parallel Processing, ser. Euro-Par'08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 78--88.
[23]
M. Bhandarkar, L. V. Kalé, E. de Sturler, and J. Hoeflinger, "Adaptive Load Balancing for MPI Programs," in International Conference on Computational Science. Springer, 2001, pp. 108--117.
[24]
L. V. Kale and S. Krishnan, "CHARM++: A Portable Concurrent Object Oriented System Based on C++," in OOPSLA, vol. 93. Citeseer, 1993, pp. 91--108.
[25]
S. Jain, R. Kaleem, M. G. Balmana, A. Langer, D. Durnov, A. Sannikov, and M. Garzaran, "Framework for Scalable Intra-Node Collective Operations Using Shared Memory," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC 18. IEEE Press, 2018.
[26]
J. M. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. K. Panda, "Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-Cores," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 1020--1029.
[27]
P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar et al., "Haswell: The Fourth-Generation Intel Core Processor," IEEE Micro, vol. 34, no. 2, pp. 6--20, 2014.
[28]
S. Chakraborty, M. Bayatpour, J. Hashmi, H. Subramoni, and D. K. Panda, "Cooperative Rendezvous Protocols for Improved Performance and Overlap," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 361--373.
[29]
M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. V. Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus et al., "NWChem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations," Computer Physics Communications, vol. 181, no. 9, pp. 1477--1489, 2010.
[30]
J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Apra, "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit," The International Journal of High Performance Computing Applications, vol. 20, no. 2, pp. 203--231, 2006.
[31]
J. S. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, "Supporting the Global Arrays PGAS Model using MPI One-Sided Communication," in 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012.
[32]
B. Sheridan and J. T. Fineman, "A Case for Distributed Work-Stealing in Regular Applications," in Proceedings of the 6th ACM SIGPLAN Workshop on X10, 2016, pp. 32--33.
[33]
Q. Chen, M. Guo, and H. Guan, "LAWS: Locality-Aware Work-Stealing for Multi-Socket Multi-Core Architectures," in Proceedings of the 28th ACM international conference on Supercomputing. ACM, 2014, pp. 3--12.
[34]
S.-J. Min, C. Iancu, and K. Yelick, "Hierarchical Work Stealing on Manycore Clusters," in Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), vol. 625, 2011.
[35]
S. Barghi and M. Karsten, "Work-Stealing, Locality-Aware Actor Scheduling," in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018, pp. 484--494.
[36]
J. Deters, J. Wu, Y. Xu, and I.-T. A. Lee, "A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle," in 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2018, pp. 59--70.
[37]
Q. Chen, M. Guo, and Z. Huang, "CATS: Cache Aware Task-Stealing Based on Online Profiling in Multi-Socket Multi-Core Architectures," in Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012, pp. 163--172.
[38]
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-Aware Task Management for Unstructured Parallelism: A Quantitative Limit Study," in Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2013, pp. 315--325.
[39]
J. Lifflander, S. Krishnamoorthy, and L. V. Kale, "Optimizing Data Locality for Fork/Join Programs using Constrained Work Stealing," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2014, pp. 857--868.
[40]
A. Drebes, A. Pop, K. Heydemann, N. Drach, and A. Cohen, "NUMA-Aware Scheduling and Memory Allocation for Data-Flow Task-Parallel Applications," in ACM Sigplan Notices, vol. 51, no. 8. ACM, 2016, p. 44.
[41]
Q. Chen and M. Guo, "Contention and Locality-Aware Work-Stealing for Iterative Applications in Multi-Socket Computers," IEEE Transactions on Computers, vol. 67, no. 6, pp. 784--798, 2017.
[42]
M. Shaheen and R. Strzodka, "NUMA Aware Iterative Stencil Computations on Many-Core Systems," in 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, 2012, pp. 461--473.
[43]
J. Paudel, O. Tardieu, and J. N. Amaral, "On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks," in 2013 42nd International Conference on Parallel Processing. IEEE, 2013, pp. 100--109.
[44]
H. Zhao, Q. Chen, Y. Qiu, M. Wu, Y. Shen, J. Leng, C. Li, and M. Guo, "Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory," ACM Transactions on Architecture and Code Optimization (TACO), vol. 15, no. 4, p. 55, 2018.
[45]
U. A. Acar, A. Chargu'eraud, and M. Rainey, "Scheduling Parallel Programs by Work Stealing with Private Deques," in Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013, pp. 219--228.
[46]
T. Hiraishi, M. Yasugi, S. Umatani, and T. Yuasa, "Backtracking-Based Load Balancing," ACM Sigplan Notices, vol. 44, no. 4, pp. 55--64, 2009.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2020
1454 pages
ISBN:9781728199986

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

  1. MPI
  2. communication
  3. load balance
  4. work stealing

Qualifiers

  • Research-article

Conference

SC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 166
    Total Downloads
  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media