Abstract
Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.
Similar content being viewed by others
Data Availability
The software used in this work is publicly accessible. The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher.
Notes
N is the number of NUMA nodes available in the system.
References
Al-Omairy R, Miranda G, Ltaief H et al (2015) Dense matrix computations on numa architectures with distance-aware work stealing. Supercomput Front Innov 2(1):49–72
Álvarez D, Sala K, Maroñas M, et al (2021) Advanced synchronization techniques for task-based runtime systems. pp 334–347, https://doi.org/10.1145/3437801.3441601
Andi K, SuSE Labs (2022a) numa(3)-Linux manual page. https://man7.org/linux/man-pages/man3/numa.3.html, accessed: 2022-05-10
Andi K, SuSE Labs (2022b) numactl(8)-Linux man page. https://linux.die.net/man/8/numactl, accessed: 2022-05-10
Barcelona Supercomputing Center (2022a) Nanos6 2022.11 Release. https://github.com/bsc-pm/nanos6/releases/tag/github-release-2022.11, accessed: 2023-02-25
Barcelona Supercomputing Center (2022b) OmpSs User Guide - How to exploit NUMA (socket) aware scheduling policy using Nanos++. https://pm.bsc.es/ftp/ompss/doc/user-guide/faq-numa-schedule.html, accessed: 2022-05-10
Broquedis F, Clet-Ortega J, Moreaud S, et al (2010) hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 180–186, https://doi.org/10.1109/PDP.2010.67
Dashti M, Fedorova A, Funston J et al (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGPLAN Notices 48(4):381–394. https://doi.org/10.1145/2499368.2451157
Dennard R, Gaensslen F, Yu HN et al (1974) Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J Solid-State Circ 9(5):256–268. https://doi.org/10.1109/JSSC.1974.1050511
Dokulil J, Benkner S (2020) Automatic placement of tasks to numa nodes in iterative applications. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, pp 192–195
Dokulil J, Benkner S (2022) The ocr-vx experience: lessons learned from designing and implementing a task-based runtime system. J Supercomput 78(10):12344–12379
de Supinski BR, Scogland TRW, Duran A et al (2018) The ongoing evolution of openmp. Proc IEEE 106(11):2004–2019
Intel (2022) Memory access analysis for cache misses and high bandwidth issues. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/microarchitecture-analysis-group/memory-access-analysis.html, accessed: 2022-05-10
Klinkenberg J, Samfass P, Terboven C, et al (2018) Assessing task-to-data affinity in the llvm openmp runtime. In: Evolving OpenMP for Evolving Architectures, Springer International Publishing, pp 236–251
Kodama Y, Odajima T, Asato A, et al (2019) Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451
Lachaize R, Lepers B, Quéma V (2012) Memprof: A memory profiler for numa multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX Association, USA, USENIX ATC’12, p 5
Liu X, Mellor-Crummey J (2014) A tool to analyze the performance of multithreaded programs on numa architectures. SIGPLAN Not 49(8):259–272. https://doi.org/10.1145/2692916.2555271
Liu X, Wu B (2015) Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC ’15, https://doi.org/10.1145/2807591.2807648
Maroñas M, Sala K, Mateo S, et al (2019) Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 383–394
Maroñas M, Teruel X, Bull M, et al (2020) Evaluating worksharing tasks on distributed environments. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), IEEE
Navarro A, Mateo S, Perez JM, et al (2017) Adaptive and architecture-independent task granularity for recursive applications. In: International Workshop on OpenMP, Springer, pp 169–182
Navarro A, Lorenzon AF, Ayguadé E, et al (2020) Enhancing resource management through prediction-based policies. In: European Conference on Parallel Processing, Springer, pp 493–509
OpenMP Architecture Review Board (2022) OpenMP application programming interface - memory allocators. https://www.openmp.org/spec-html/5.0/openmpsu53.html, accessed: 2022-05-10
Plauth M, Eberhardt F, Grapentin A, et al (2022) Improving the accessibility of numa-aware c++ application development based on the pgasus framework. Concurr Comput Pract Exp :e6887
Sadasivam SK, Thompto BW, Kalla R et al (2017) Ibm power9 processor architecture. IEEE Micro 37(2):40–51
Sala K, Teruel X, Perez JM et al (2019) Integrating blocking and non-blocking mpi primitives with task-based programming models. Parallel Comput 85:153–166
Sala K, Rico A, Beltran V (2020) Towards data-flow parallelization for adaptive mesh refinement applications. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp 314–325
Sánchez Barrera I, Moretó M, Ayguadé E, et al (2018) Reducing data movement on large shared memory systems by exploiting computation dependencies. In: Proceedings of the 2018 International Conference on Supercomputing, pp 207–217
Sodani A (2015) Knights landing (knl): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, pp 1–24
Suggs D, Subramony M, Bouvier D (2020) The amd “zen 2’’ processor. IEEE Micro 40(2):45–52
Tanaka M, Tatebe O (2012) Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp 65–72
Wheeler DA (2022) SLOCCount. https://dwheeler.com/sloccount/, accessed: 2022-05-10
Zhao X, Zhou J, Guan H, et al (2021) Numaperf: Predictive numa profiling. In: Proceedings of the ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, ICS ’21, p 52–62, https://doi.org/10.1145/3447818.3460361
Funding
This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA), project PCI2021-121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation, Generalitat de Catalunya (contract 2021-SGR-01007), the Spanish Ministry of Science and Technology (contract PID2019-107255GB), and Severo Ochoa (CEX2021- 001148-S / MCIN/AEI /10.13039/501100011033).
Author information
Authors and Affiliations
Contributions
All of the material is owned by the authors and/or no permissions are required.
Corresponding author
Ethics declarations
Conflict of interest
I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Consent for Publication
I confirm that I understand journal The Journal of Supercomputing is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.
Ethics Approval and Consent to Participate
I have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Maroñas, M., Navarro, A., Ayguadé, E. et al. Mitigating the NUMA effect on task-based runtime systems. J Supercomput 79, 14287–14312 (2023). https://doi.org/10.1007/s11227-023-05164-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05164-9