Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Mitigating the NUMA effect on task-based runtime systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The software used in this work is publicly accessible. The results/data/figures in this manuscript have not been published elsewhere, nor are they under consideration by another publisher.

Notes

  1. N is the number of NUMA nodes available in the system.

References

  1. Al-Omairy R, Miranda G, Ltaief H et al (2015) Dense matrix computations on numa architectures with distance-aware work stealing. Supercomput Front Innov 2(1):49–72

    Google Scholar 

  2. Álvarez D, Sala K, Maroñas M, et al (2021) Advanced synchronization techniques for task-based runtime systems. pp 334–347, https://doi.org/10.1145/3437801.3441601

  3. Andi K, SuSE Labs (2022a) numa(3)-Linux manual page. https://man7.org/linux/man-pages/man3/numa.3.html, accessed: 2022-05-10

  4. Andi K, SuSE Labs (2022b) numactl(8)-Linux man page. https://linux.die.net/man/8/numactl, accessed: 2022-05-10

  5. Barcelona Supercomputing Center (2022a) Nanos6 2022.11 Release. https://github.com/bsc-pm/nanos6/releases/tag/github-release-2022.11, accessed: 2023-02-25

  6. Barcelona Supercomputing Center (2022b) OmpSs User Guide - How to exploit NUMA (socket) aware scheduling policy using Nanos++. https://pm.bsc.es/ftp/ompss/doc/user-guide/faq-numa-schedule.html, accessed: 2022-05-10

  7. Broquedis F, Clet-Ortega J, Moreaud S, et al (2010) hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 180–186, https://doi.org/10.1109/PDP.2010.67

  8. Dashti M, Fedorova A, Funston J et al (2013) Traffic management: a holistic approach to memory placement on numa systems. ACM SIGPLAN Notices 48(4):381–394. https://doi.org/10.1145/2499368.2451157

    Article  Google Scholar 

  9. Dennard R, Gaensslen F, Yu HN et al (1974) Design of ion-implanted mosfet’s with very small physical dimensions. IEEE J Solid-State Circ 9(5):256–268. https://doi.org/10.1109/JSSC.1974.1050511

    Article  Google Scholar 

  10. Dokulil J, Benkner S (2020) Automatic placement of tasks to numa nodes in iterative applications. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, pp 192–195

  11. Dokulil J, Benkner S (2022) The ocr-vx experience: lessons learned from designing and implementing a task-based runtime system. J Supercomput 78(10):12344–12379

    Article  Google Scholar 

  12. de Supinski BR, Scogland TRW, Duran A et al (2018) The ongoing evolution of openmp. Proc IEEE 106(11):2004–2019

    Article  Google Scholar 

  13. Intel (2022) Memory access analysis for cache misses and high bandwidth issues. https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/microarchitecture-analysis-group/memory-access-analysis.html, accessed: 2022-05-10

  14. Klinkenberg J, Samfass P, Terboven C, et al (2018) Assessing task-to-data affinity in the llvm openmp runtime. In: Evolving OpenMP for Evolving Architectures, Springer International Publishing, pp 236–251

  15. Kodama Y, Odajima T, Asato A, et al (2019) Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451

  16. Lachaize R, Lepers B, Quéma V (2012) Memprof: A memory profiler for numa multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX Association, USA, USENIX ATC’12, p 5

  17. Liu X, Mellor-Crummey J (2014) A tool to analyze the performance of multithreaded programs on numa architectures. SIGPLAN Not 49(8):259–272. https://doi.org/10.1145/2692916.2555271

    Article  Google Scholar 

  18. Liu X, Wu B (2015) Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC ’15, https://doi.org/10.1145/2807591.2807648

  19. Maroñas M, Sala K, Mateo S, et al (2019) Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 383–394

  20. Maroñas M, Teruel X, Bull M, et al (2020) Evaluating worksharing tasks on distributed environments. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), IEEE

  21. Navarro A, Mateo S, Perez JM, et al (2017) Adaptive and architecture-independent task granularity for recursive applications. In: International Workshop on OpenMP, Springer, pp 169–182

  22. Navarro A, Lorenzon AF, Ayguadé E, et al (2020) Enhancing resource management through prediction-based policies. In: European Conference on Parallel Processing, Springer, pp 493–509

  23. OpenMP Architecture Review Board (2022) OpenMP application programming interface - memory allocators. https://www.openmp.org/spec-html/5.0/openmpsu53.html, accessed: 2022-05-10

  24. Plauth M, Eberhardt F, Grapentin A, et al (2022) Improving the accessibility of numa-aware c++ application development based on the pgasus framework. Concurr Comput Pract Exp :e6887

  25. Sadasivam SK, Thompto BW, Kalla R et al (2017) Ibm power9 processor architecture. IEEE Micro 37(2):40–51

    Article  Google Scholar 

  26. Sala K, Teruel X, Perez JM et al (2019) Integrating blocking and non-blocking mpi primitives with task-based programming models. Parallel Comput 85:153–166

    Article  Google Scholar 

  27. Sala K, Rico A, Beltran V (2020) Towards data-flow parallelization for adaptive mesh refinement applications. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp 314–325

  28. Sánchez Barrera I, Moretó M, Ayguadé E, et al (2018) Reducing data movement on large shared memory systems by exploiting computation dependencies. In: Proceedings of the 2018 International Conference on Supercomputing, pp 207–217

  29. Sodani A (2015) Knights landing (knl): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, pp 1–24

  30. Suggs D, Subramony M, Bouvier D (2020) The amd “zen 2’’ processor. IEEE Micro 40(2):45–52

    Article  Google Scholar 

  31. Tanaka M, Tatebe O (2012) Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp 65–72

  32. Wheeler DA (2022) SLOCCount. https://dwheeler.com/sloccount/, accessed: 2022-05-10

  33. Zhao X, Zhou J, Guan H, et al (2021) Numaperf: Predictive numa profiling. In: Proceedings of the ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, ICS ’21, p 52–62, https://doi.org/10.1145/3447818.3460361

Download references

Funding

This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA), project PCI2021-121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation, Generalitat de Catalunya (contract 2021-SGR-01007), the Spanish Ministry of Science and Technology (contract PID2019-107255GB), and Severo Ochoa (CEX2021- 001148-S / MCIN/AEI /10.13039/501100011033).

Author information

Authors and Affiliations

Authors

Contributions

All of the material is owned by the authors and/or no permissions are required.

Corresponding author

Correspondence to Marcos Maroñas.

Ethics declarations

Conflict of interest

I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Consent for Publication

I confirm that I understand journal The Journal of Supercomputing is a transformative journal. When research is accepted for publication, there is a choice to publish using either immediate gold open access or the traditional publishing route.

Ethics Approval and Consent to Participate

I have read the Springer journal policies on author responsibilities and submit this manuscript in accordance with those policies.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maroñas, M., Navarro, A., Ayguadé, E. et al. Mitigating the NUMA effect on task-based runtime systems. J Supercomput 79, 14287–14312 (2023). https://doi.org/10.1007/s11227-023-05164-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05164-9

Keywords