Mitigating the NUMA effect on task-based runtime systems

Maroñas, Marcos; Navarro, Antoni; Ayguadé, Eduard; Beltran, Vicenç

doi:10.1007/s11227-023-05164-9

Mitigating the NUMA effect on task-based runtime systems

Published: 06 April 2023

Volume 79, pages 14287–14312, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Marcos Maroñas^1,2,
Antoni Navarro^1,2,
Eduard Ayguadé^1,2 &
…
Vicenç Beltran¹

353 Accesses
Explore all metrics

Abstract

Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

To Share or Not to Share: A Case for MPI in Shared-Memory

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Affinity-Aware Synchronization in Work Stealing Run-Times for NUMA Multi-core Processors

Data Availability

The software used in this work is publicly accessible. The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher.

Notes

N is the number of NUMA nodes available in the system.

References

Al-Omairy R, Miranda G, Ltaief H et al (2015) Dense matrix computations on numa architectures with distance-aware work stealing. Supercomput Front Innov 2(1):49–72
Google Scholar
Álvarez D, Sala K, Maroñas M, et al (2021) Advanced synchronization techniques for task-based runtime systems. pp 334–347, https://doi.org/10.1145/3437801.3441601
Andi K, SuSE Labs (2022a) numa(3)-Linux manual page. https://man7.org/linux/man-pages/man3/numa.3.html, accessed: 2022-05-10
Andi K, SuSE Labs (2022b) numactl(8)-Linux man page. https://linux.die.net/man/8/numactl, accessed: 2022-05-10
Barcelona Supercomputing Center (2022a) Nanos6 2022.11 Release. https://github.com/bsc-pm/nanos6/releases/tag/github-release-2022.11, accessed: 2023-02-25
Barcelona Supercomputing Center (2022b) OmpSs User Guide - How to exploit NUMA (socket) aware scheduling policy using Nanos++. https://pm.bsc.es/ftp/ompss/doc/user-guide/faq-numa-schedule.html, accessed: 2022-05-10
Broquedis F, Clet-Ortega J, Moreaud S, et al (2010) hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 180–186, https://doi.org/10.1109/PDP.2010.67
Dashti M, Fedorova A, Funston J et al (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGPLAN Notices 48(4):381–394. https://doi.org/10.1145/2499368.2451157
Article Google Scholar
Dennard R, Gaensslen F, Yu HN et al (1974) Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J Solid-State Circ 9(5):256–268. https://doi.org/10.1109/JSSC.1974.1050511
Article Google Scholar
Dokulil J, Benkner S (2020) Automatic placement of tasks to numa nodes in iterative applications. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, pp 192–195
Dokulil J, Benkner S (2022) The ocr-vx experience: lessons learned from designing and implementing a task-based runtime system. J Supercomput 78(10):12344–12379
Article Google Scholar
de Supinski BR, Scogland TRW, Duran A et al (2018) The ongoing evolution of openmp. Proc IEEE 106(11):2004–2019
Article Google Scholar
Intel (2022) Memory access analysis for cache misses and high bandwidth issues. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/microarchitecture-analysis-group/memory-access-analysis.html, accessed: 2022-05-10
Klinkenberg J, Samfass P, Terboven C, et al (2018) Assessing task-to-data affinity in the llvm openmp runtime. In: Evolving OpenMP for Evolving Architectures, Springer International Publishing, pp 236–251
Kodama Y, Odajima T, Asato A, et al (2019) Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451
Lachaize R, Lepers B, Quéma V (2012) Memprof: A memory profiler for numa multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX Association, USA, USENIX ATC’12, p 5
Liu X, Mellor-Crummey J (2014) A tool to analyze the performance of multithreaded programs on numa architectures. SIGPLAN Not 49(8):259–272. https://doi.org/10.1145/2692916.2555271
Article Google Scholar
Liu X, Wu B (2015) Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC ’15, https://doi.org/10.1145/2807591.2807648
Maroñas M, Sala K, Mateo S, et al (2019) Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 383–394
Maroñas M, Teruel X, Bull M, et al (2020) Evaluating worksharing tasks on distributed environments. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), IEEE
Navarro A, Mateo S, Perez JM, et al (2017) Adaptive and architecture-independent task granularity for recursive applications. In: International Workshop on OpenMP, Springer, pp 169–182
Navarro A, Lorenzon AF, Ayguadé E, et al (2020) Enhancing resource management through prediction-based policies. In: European Conference on Parallel Processing, Springer, pp 493–509
OpenMP Architecture Review Board (2022) OpenMP application programming interface - memory allocators. https://www.openmp.org/spec-html/5.0/openmpsu53.html, accessed: 2022-05-10
Plauth M, Eberhardt F, Grapentin A, et al (2022) Improving the accessibility of numa-aware c++ application development based on the pgasus framework. Concurr Comput Pract Exp :e6887
Sadasivam SK, Thompto BW, Kalla R et al (2017) Ibm power9 processor architecture. IEEE Micro 37(2):40–51
Article Google Scholar
Sala K, Teruel X, Perez JM et al (2019) Integrating blocking and non-blocking mpi primitives with task-based programming models. Parallel Comput 85:153–166
Article Google Scholar
Sala K, Rico A, Beltran V (2020) Towards data-flow parallelization for adaptive mesh refinement applications. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp 314–325
Sánchez Barrera I, Moretó M, Ayguadé E, et al (2018) Reducing data movement on large shared memory systems by exploiting computation dependencies. In: Proceedings of the 2018 International Conference on Supercomputing, pp 207–217
Sodani A (2015) Knights landing (knl): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, pp 1–24
Suggs D, Subramony M, Bouvier D (2020) The amd “zen 2’’ processor. IEEE Micro 40(2):45–52
Article Google Scholar
Tanaka M, Tatebe O (2012) Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp 65–72
Wheeler DA (2022) SLOCCount. https://dwheeler.com/sloccount/, accessed: 2022-05-10
Zhao X, Zhou J, Guan H, et al (2021) Numaperf: Predictive numa profiling. In: Proceedings of the ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, ICS ’21, p 52–62, https://doi.org/10.1145/3447818.3460361

Download references

Funding

This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA), project PCI2021-121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation, Generalitat de Catalunya (contract 2021-SGR-01007), the Spanish Ministry of Science and Technology (contract PID2019-107255GB), and Severo Ochoa (CEX2021- 001148-S / MCIN/AEI /10.13039/501100011033).

Author information

Authors and Affiliations

Barcelona Supercomputing Center, Barcelona, Spain
Marcos Maroñas, Antoni Navarro, Eduard Ayguadé & Vicenç Beltran
Universitat Politècnica de Catalunya, Barcelona, Spain
Marcos Maroñas, Antoni Navarro & Eduard Ayguadé

Authors

Marcos Maroñas
View author publications
You can also search for this author in PubMed Google Scholar
Antoni Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Ayguadé
View author publications
You can also search for this author in PubMed Google Scholar
Vicenç Beltran
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All of the material is owned by the authors and/or no permissions are required.

Corresponding author

Correspondence to Marcos Maroñas.

Ethics declarations

Conflict of interest

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Consent for Publication

I confirm that I understand journal The Journal of Supercomputing is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.

Ethics Approval and Consent to Participate

I have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Maroñas, M., Navarro, A., Ayguadé, E. et al. Mitigating the NUMA effect on task-based runtime systems. J Supercomput 79, 14287–14312 (2023). https://doi.org/10.1007/s11227-023-05164-9

Download citation

Accepted: 04 March 2023
Published: 06 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11227-023-05164-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mitigating the NUMA effect on task-based runtime systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

To Share or Not to Share: A Case for MPI in Shared-Memory

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Affinity-Aware Synchronization in Work Stealing Run-Times for NUMA Multi-core Processors

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for Publication

Ethics Approval and Consent to Participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Mitigating the NUMA effect on task-based runtime systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

To Share or Not to Share: A Case for MPI in Shared-Memory

OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks

Affinity-Aware Synchronization in Work Stealing Run-Times for NUMA Multi-core Processors

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for Publication

Ethics Approval and Consent to Participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation