research-article

Mitigating the NUMA effect on task-based runtime systems

Authors:

Marcos Maroñas,

Antoni Navarro,

Eduard Ayguadé, and

Vicenç BeltranAuthors Info & Claims

The Journal of Supercomputing, Volume 79, Issue 13

Pages 14287 - 14312

https://doi.org/10.1007/s11227-023-05164-9

Published: 06 April 2023 Publication History

Abstract

Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.

References

[1]

Al-Omairy R, Miranda G, Ltaief H, et al. Dense matrix computations on numa architectures with distance-aware work stealing Supercomput Front Innov 2015 2 1 49-72

[2]

Álvarez D, Sala K, Maroñas M, et al (2021) Advanced synchronization techniques for task-based runtime systems. pp 334–347,

[3]

Andi K, SuSE Labs (2022a) numa(3)-Linux manual page. https://man7.org/linux/man-pages/man3/numa.3.html, accessed: 2022-05-10

[4]

Andi K, SuSE Labs (2022b) numactl(8)-Linux man page. https://linux.die.net/man/8/numactl, accessed: 2022-05-10

[5]

Barcelona Supercomputing Center (2022a) Nanos6 2022.11 Release. https://github.com/bsc-pm/nanos6/releases/tag/github-release-2022.11, accessed: 2023-02-25

[6]

Barcelona Supercomputing Center (2022b) OmpSs User Guide - How to exploit NUMA (socket) aware scheduling policy using Nanos++. https://pm.bsc.es/ftp/ompss/doc/user-guide/faq-numa-schedule.html, accessed: 2022-05-10

[7]

Broquedis F, Clet-Ortega J, Moreaud S, et al (2010) hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 180–186,

[8]

Dashti M, Fedorova A, Funston J, et al. Traffic management: a holistic approach to memory placement on numa systems ACM SIGPLAN Notices 2013 48 4 381-394

[9]

Dennard R, Gaensslen F, Yu HN, et al. Design of ion-implanted mosfet’s with very small physical dimensions IEEE J Solid-State Circ 1974 9 5 256-268

[10]

Dokulil J, Benkner S (2020) Automatic placement of tasks to numa nodes in iterative applications. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, pp 192–195

[11]

Dokulil J and Benkner S The ocr-vx experience: lessons learned from designing and implementing a task-based runtime system J Supercomput 2022 78 10 12344-12379

[12]

de Supinski BR, Scogland TRW, Duran A, et al. The ongoing evolution of openmp Proc IEEE 2018 106 11 2004-2019

[13]

Intel (2022) Memory access analysis for cache misses and high bandwidth issues. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/microarchitecture-analysis-group/memory-access-analysis.html, accessed: 2022-05-10

[14]

Klinkenberg J, Samfass P, Terboven C, et al (2018) Assessing task-to-data affinity in the llvm openmp runtime. In: Evolving OpenMP for Evolving Architectures, Springer International Publishing, pp 236–251

[15]

Kodama Y, Odajima T, Asato A, et al (2019) Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451

[16]

Lachaize R, Lepers B, Quéma V (2012) Memprof: A memory profiler for numa multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX Association, USA, USENIX ATC’12, p 5

[17]

Liu X and Mellor-Crummey J A tool to analyze the performance of multithreaded programs on numa architectures SIGPLAN Not 2014 49 8 259-272

[18]

Liu X, Wu B (2015) Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC ’15,

[19]

Maroñas M, Sala K, Mateo S, et al (2019) Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 383–394

[20]

Maroñas M, Teruel X, Bull M, et al (2020) Evaluating worksharing tasks on distributed environments. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), IEEE

[21]

Navarro A, Mateo S, Perez JM, et al (2017) Adaptive and architecture-independent task granularity for recursive applications. In: International Workshop on OpenMP, Springer, pp 169–182

[22]

Navarro A, Lorenzon AF, Ayguadé E, et al (2020) Enhancing resource management through prediction-based policies. In: European Conference on Parallel Processing, Springer, pp 493–509

[23]

OpenMP Architecture Review Board (2022) OpenMP application programming interface - memory allocators. https://www.openmp.org/spec-html/5.0/openmpsu53.html, accessed: 2022-05-10

[24]

Plauth M, Eberhardt F, Grapentin A, et al (2022) Improving the accessibility of numa-aware c++ application development based on the pgasus framework. Concurr Comput Pract Exp :e6887

[25]

Sadasivam SK, Thompto BW, Kalla R, et al. Ibm power9 processor architecture IEEE Micro 2017 37 2 40-51

[26]

Sala K, Teruel X, Perez JM, et al. Integrating blocking and non-blocking mpi primitives with task-based programming models Parallel Comput 2019 85 153-166

[27]

Sala K, Rico A, Beltran V (2020) Towards data-flow parallelization for adaptive mesh refinement applications. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp 314–325

[28]

Sánchez Barrera I, Moretó M, Ayguadé E, et al (2018) Reducing data movement on large shared memory systems by exploiting computation dependencies. In: Proceedings of the 2018 International Conference on Supercomputing, pp 207–217

[29]

Sodani A (2015) Knights landing (knl): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, pp 1–24

[30]

Suggs D, Subramony M, and Bouvier D The amd “zen 2” processor IEEE Micro 2020 40 2 45-52

[31]

Tanaka M, Tatebe O (2012) Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp 65–72

[32]

Wheeler DA (2022) SLOCCount. https://dwheeler.com/sloccount/, accessed: 2022-05-10

[33]

Zhao X, Zhou J, Guan H, et al (2021) Numaperf: Predictive numa profiling. In: Proceedings of the ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, ICS ’21, p 52–62,

Recommendations

Monitoring Memory Behaviors and Mitigating NUMA Drawbacks on Tiered NVM Systems
Network and Parallel Computing
Abstract
Non-Volatile Memory with byte-addressability invites a new paradigm to access persistent data directly. However, this paradigm brings new challenges to the Non-Uniform Memory Access (NUMA) architecture. Since data accesses cross NUMA node can ...
Read More
A case for NUMA-aware contention management on multicore systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention ...
Read More
Mitigating sync overhead in single-level store systems
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Emerging non-volatile memory technologies offer the durability of disk and the byte-addressability of DRAM, which makes it feasible to build up single-level store systems. However, due to extremely low latency of persistent writes to non-volatile memory,...
Read More

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 79, Issue 13

Sep 2023

1305 pages

ISSN:0920-8542

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 06 April 2023

Accepted: 04 March 2023

Author Tags

Qualifiers

Research-article

Funding Sources

DEEP-SEA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents