Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Mitigating the NUMA effect on task-based runtime systems

Published: 06 April 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.

    References

    [1]
    Al-Omairy R, Miranda G, Ltaief H, et al. Dense matrix computations on numa architectures with distance-aware work stealing Supercomput Front Innov 2015 2 1 49-72
    [2]
    Álvarez D, Sala K, Maroñas M, et al (2021) Advanced synchronization techniques for task-based runtime systems. pp 334–347,
    [3]
    Andi K, SuSE Labs (2022a) numa(3)-Linux manual page. https://man7.org/linux/man-pages/man3/numa.3.html, accessed: 2022-05-10
    [4]
    Andi K, SuSE Labs (2022b) numactl(8)-Linux man page. https://linux.die.net/man/8/numactl, accessed: 2022-05-10
    [5]
    Barcelona Supercomputing Center (2022a) Nanos6 2022.11 Release. https://github.com/bsc-pm/nanos6/releases/tag/github-release-2022.11, accessed: 2023-02-25
    [6]
    Barcelona Supercomputing Center (2022b) OmpSs User Guide - How to exploit NUMA (socket) aware scheduling policy using Nanos++. https://pm.bsc.es/ftp/ompss/doc/user-guide/faq-numa-schedule.html, accessed: 2022-05-10
    [7]
    Broquedis F, Clet-Ortega J, Moreaud S, et al (2010) hwloc: A generic framework for managing hardware affinities in hpc applications. In: 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp 180–186,
    [8]
    Dashti M, Fedorova A, Funston J, et al. Traffic management: a holistic approach to memory placement on numa systems ACM SIGPLAN Notices 2013 48 4 381-394
    [9]
    Dennard R, Gaensslen F, Yu HN, et al. Design of ion-implanted mosfet’s with very small physical dimensions IEEE J Solid-State Circ 1974 9 5 256-268
    [10]
    Dokulil J, Benkner S (2020) Automatic placement of tasks to numa nodes in iterative applications. In: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), IEEE, pp 192–195
    [11]
    Dokulil J and Benkner S The ocr-vx experience: lessons learned from designing and implementing a task-based runtime system J Supercomput 2022 78 10 12344-12379
    [12]
    de Supinski BR, Scogland TRW, Duran A, et al. The ongoing evolution of openmp Proc IEEE 2018 106 11 2004-2019
    [14]
    Klinkenberg J, Samfass P, Terboven C, et al (2018) Assessing task-to-data affinity in the llvm openmp runtime. In: Evolving OpenMP for Evolving Architectures, Springer International Publishing, pp 236–251
    [15]
    Kodama Y, Odajima T, Asato A, et al (2019) Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451
    [16]
    Lachaize R, Lepers B, Quéma V (2012) Memprof: A memory profiler for numa multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference. USENIX Association, USA, USENIX ATC’12, p 5
    [17]
    Liu X and Mellor-Crummey J A tool to analyze the performance of multithreaded programs on numa architectures SIGPLAN Not 2014 49 8 259-272
    [18]
    Liu X, Wu B (2015) Scaanalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, USA, SC ’15,
    [19]
    Maroñas M, Sala K, Mateo S, et al (2019) Worksharing tasks: An efficient way to exploit irregular and fine-grained loop parallelism. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), IEEE, pp 383–394
    [20]
    Maroñas M, Teruel X, Bull M, et al (2020) Evaluating worksharing tasks on distributed environments. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), IEEE
    [21]
    Navarro A, Mateo S, Perez JM, et al (2017) Adaptive and architecture-independent task granularity for recursive applications. In: International Workshop on OpenMP, Springer, pp 169–182
    [22]
    Navarro A, Lorenzon AF, Ayguadé E, et al (2020) Enhancing resource management through prediction-based policies. In: European Conference on Parallel Processing, Springer, pp 493–509
    [23]
    OpenMP Architecture Review Board (2022) OpenMP application programming interface - memory allocators. https://www.openmp.org/spec-html/5.0/openmpsu53.html, accessed: 2022-05-10
    [24]
    Plauth M, Eberhardt F, Grapentin A, et al (2022) Improving the accessibility of numa-aware c++ application development based on the pgasus framework. Concurr Comput Pract Exp :e6887
    [25]
    Sadasivam SK, Thompto BW, Kalla R, et al. Ibm power9 processor architecture IEEE Micro 2017 37 2 40-51
    [26]
    Sala K, Teruel X, Perez JM, et al. Integrating blocking and non-blocking mpi primitives with task-based programming models Parallel Comput 2019 85 153-166
    [27]
    Sala K, Rico A, Beltran V (2020) Towards data-flow parallelization for adaptive mesh refinement applications. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp 314–325
    [28]
    Sánchez Barrera I, Moretó M, Ayguadé E, et al (2018) Reducing data movement on large shared memory systems by exploiting computation dependencies. In: Proceedings of the 2018 International Conference on Supercomputing, pp 207–217
    [29]
    Sodani A (2015) Knights landing (knl): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), IEEE, pp 1–24
    [30]
    Suggs D, Subramony M, and Bouvier D The amd “zen 2” processor IEEE Micro 2020 40 2 45-52
    [31]
    Tanaka M, Tatebe O (2012) Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp 65–72
    [32]
    Wheeler DA (2022) SLOCCount. https://dwheeler.com/sloccount/, accessed: 2022-05-10
    [33]
    Zhao X, Zhou J, Guan H, et al (2021) Numaperf: Predictive numa profiling. In: Proceedings of the ACM International Conference on Supercomputing. Association for Computing Machinery, New York, NY, USA, ICS ’21, p 52–62,

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The Journal of Supercomputing
    The Journal of Supercomputing  Volume 79, Issue 13
    Sep 2023
    1305 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 06 April 2023
    Accepted: 04 March 2023

    Author Tags

    1. NUMA-awareness
    2. OmpSs-2
    3. Parallel programming model
    4. Scheduling
    5. Task-aware

    Qualifiers

    • Research-article

    Funding Sources

    • DEEP-SEA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media