Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Lock Contention Management in Multithreaded MPI

Published: 08 January 2019 Publication History

Abstract

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.

References

[1]
David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2 (2015), Article 13.
[2]
Abdelhalim Amer, Pavan Balaji, Wesley Bland, William Gropp, Rob Latham, Huiwei Lu, Lena Oden, Antonio Pena, Ken Raffenetti, Sangmin Seo, et al. 2015. MPICH User’s Guide.
[3]
Abdelhalim Amer, Huiwei Lu, Pavan Balaji, and Satoshi Matsuoka. 2015. Characterizing MPI and hybrid MPI+threads applications at scale: Case study with BFS. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’15). 1075--1083.
[4]
Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+threads: Runtime contention and remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 239--248.
[5]
A. Amer, Huiwei Lu, Yanjie Wei, Jeff Hammond, Satoshi Matsuoka, and Pavan Balaji. 2016. Locking Aspects in Multithreaded MPI Implementations. Technical Report P6005-0516. Argonne National Lab.
[6]
Randal S. Baker and Kenneth R. Koch. 1998. An S<sub>n</sub> algorithm for the massively parallel CM-200 computer. Nuclear Science and Engineering 128, 3 (1998), 312--320.
[7]
Pavan Balaji, Darius Buntinas, D. Goodell, W. D. Gropp, and Rajeev Thakur. 2010. Fine-grained multithreading support for hybrid threaded MPI programming. International Journal of High Performance Computing Applications (IJHPCA) 24 (2010), 49--57.
[8]
François Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. hwloc: A generic framework for managing hardware affinities in HPC applications. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP’10). IEEE, 180--186.
[9]
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J Marathe, and Nir Shavit. 2013. NUMA-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13), Vol. 48. 157--166.
[10]
Milind Chabbi, Abdelhalim Amer, Shasha Wen, and Xu Liu. 2017. An efficient abortable-locking protocol for multi-level NUMA systems. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 61--74.
[11]
Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High performance locks for multi-level NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 215--226.
[12]
Milind Chabbi and John Mellor-Crummey. 2016. Contention-conscious, locality-preserving locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). 22:1--22:14.
[13]
Hoang-Vu Dang, Sangmin Seo, Abdelhalim Amer, and Pavan Balaji. 2017. Advanced thread synchronization for multithreaded MPI implementations. In Proceedings of the17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’17). IEEE, 314--324.
[14]
Dave Dice. 2017. Malthusian locks. In Proceedings of the 12th European Conference on Computer Systems. ACM, 314--327.
[15]
David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2 (2015), 13.
[16]
James Dinan, Pavan Balaji, Dave Goodell, Doug Miller, Marc Snir, and Rajeev Thakur. 2013. Enabling MPI interoperability through flexible communication endpoints. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI’13), 13--18.
[17]
Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, and Rajeev Thakur. 2010. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI’10). Springer-Verlag, Berlin, 11--20.
[18]
Ulrich Drepper. 2009. Futexes are tricky. Retrieved October 18, 2016 from https://www.akkadia.org/drepper/futex.pdf. Red Hat Inc. (2009).
[19]
Ulrich Drepper and Ingo Molnar. 2005. The native POSIX thread library for Linux. Retrieved October 18, 2016 from https://www.akkadia.org/drepper/nptl-design.pdf. White Paper, Red Hat Inc. (2005).
[20]
Hubertus Franke, Rusty Russell, and Matthew Kirkwood. {n.d.}. Fuss, futexes and furwocks: Fast userlevel locking in linux. In AUUG Conference Proceedings.
[21]
David Goodell, Pavan Balaji, Darius Buntinas, Gabor Dozsa, William Gropp, Sameer Kumar, Bronis R. de Supinski, and Rajeev Thakur. 2010. Minimizing MPI resource contention in multithreaded multicore environments. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’10), 1--8.
[22]
William Gropp and Rajeev Thakur. 2007. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Computing 33 (2007), 595--604.
[23]
Torsten Hoefler, Greg Bronevetsky, Brian Barrett, Bronis R. de Supinski, and Andrew Lumsdaine. 2010. Efficient MPI support for advanced hybrid programming models. In Proceedings of the 17th European MPI Users’ Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI’10), Vol. 6305. Springer, 50--61.
[24]
Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. 2013. MPI+MPI: A new hybrid approach to parallel programming with MPI plus shared memory. Computing 95, 12 (2013), 1121--1136.
[25]
Saurabh Kalikar and Rupesh Nasre. 2016. DomLock: A new multi-granularity locking technique for hierarchies. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). 23.
[26]
John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS) 9, 1 (1991), 21--65.
[27]
Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Technical Report.

Cited By

View all
  • (2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159:C(444-458)Online publication date: 1-Oct-2024
  • (2022)A Survey on Minimizing Lock Contention in Shared Resources in Linux Kernel2022 13th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC55196.2022.9952854(1133-1135)Online publication date: 19-Oct-2022
  • (2021)Finer-LRU: A Scalable Page Management Scheme for HPC Manycore Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00065(567-576)Online publication date: May-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 5, Issue 3
September 2018
89 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3305217
  • Editor:
  • David Bader
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 January 2019
Accepted: 01 July 2018
Revised: 01 May 2018
Received: 01 May 2016
Published in TOPC Volume 5, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. critical section
  3. runtime contention
  4. threads

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Exascale Computing Project
  • Science Technology and Innovation Committee of Shenzhen Municipality
  • JSPS KAKENHI
  • U.S. Department of Energy Office of Science
  • National Nuclear Security Administration

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)247
  • Downloads (Last 6 weeks)31
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159:C(444-458)Online publication date: 1-Oct-2024
  • (2022)A Survey on Minimizing Lock Contention in Shared Resources in Linux Kernel2022 13th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC55196.2022.9952854(1133-1135)Online publication date: 19-Oct-2022
  • (2021)Finer-LRU: A Scalable Page Management Scheme for HPC Manycore Architectures2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00065(567-576)Online publication date: May-2021
  • (undefined)A Fine-Grained Page Management Scheme For Hpc Manycore I/O SystemsSSRN Electronic Journal10.2139/ssrn.4192491

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media