research-article

Public Access

NumaPerf: predictive NUMA profiling

Authors:

Xu Liu, and

Tongping LiuAuthors Info & Claims

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

June 2021

Pages 52 - 62

https://doi.org/10.1145/3447818.3460361

Published: 04 June 2021 Publication History

Abstract

It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.

References

[1]

Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdullah Muzahid. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, pages 298-- 313, New York, NY, USA, 2017. ACM.

Digital Library

[2]

David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. Tabarnac: Visualizing and resolving memory access issues on numa architectures. In Proceedings of the 2nd Workshop on Visual Performance Analysis, VPA ’15, New York, NY, USA, 2015. Association for Computing Machinery.

Digital Library

[3]

Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages 117--128, New York, NY, USA, 2000. ACM Press.

Digital Library

[4]

Christian Bienia and Kai Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.

[5]

Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’11, pages 1--1, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[6]

William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. Numa policies and their relation to memory architecture. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 212--221, New York, NY, USA, 1991. ACM.

Digital Library

[7]

Milind Chabbi, Shasha Wen, and Xu Liu. Featherlight on-the-fly false- sharing detection. In Andreas Krall and Thomas R. Gross, editors, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, pages 152--167. ACM, 2018.

[8]

Charlie Curtsinger and Emery D. Berger. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 184--197, New York, NY, USA, 2015. ACM.

Digital Library

[9]

Matthias Diener, Eduardo HM Cruz, Laércio L Pilla, Fabrice Dupros, and Philippe OA Navaux. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation, 88:18--36, 2015.

Digital Library

[10]

Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. Linux kernel profiling with perf. https://perf.wiki.kernel.org/index. php/Tutorial, 2015.

[11]

Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pages 120--126, 1982.

[12]

Christian Helm and Kenjiro Taura. Perfmemplus: A tool for automatic discovery of memory performance problems. In International Conference on High Performance Computing, pages 209--226. Springer, 2019.

[13]

Intel Corporation. Intel VTune performance analyzer. http://www.intel. com/software/products/vtune.

[14]

Lawrence Livermore National Laboratory. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh). https://codesign. llnl.gov/lulesh.php., Dec 2010.

[15]

Lawrence Livermore National Laboratory. Llnl coral benchmarks. https://asc.llnl.gov/CORAL-benchmarks., Dec 2013.

[16]

Lawrence Livermore National Laboratory. Llnl sequoia benchmarks. https://asc.llnl.gov/sequoia/benchmarks., Dec 2013.

[17]

Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. Memprof: A memory profiler for numa multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 5--5, Berkeley, CA, USA, 2012. USENIX Association.

[18]

Christoph Lameter. An overview of non-uniform memory access. Com- mun. ACM, 56(9):59--54, September 2013.

Digital Library

[19]

Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pages 75--, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[20]

Tongping Liu and Emery D. Berger. Sheriff: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA ’11, pages 3--18, New York, NY, USA, 2011. ACM.

Digital Library

[21]

Tongping Liu and Xu Liu. Cheetah: Detecting false sharing efficiently and effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 1--11, New York, NY, USA, 2016. ACM.

Digital Library

[22]

Tongping Liu, Chen Tian, Hu Ziang, and Emery D. Berger. Predator: Predictive false sharing detection. In Proceedings of 19th ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming, PPOPP’14, New York, NY, USA, 2014. ACM.

Digital Library

[23]

Xu Liu and John Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 259--272, New York, NY, USA, 2014.ACM.

Digital Library

[24]

C. McCurdy and J. Vetter. Memphis: Finding and fixing numa-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pages 87--96, March 2010.

[25]

Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. Rthms: A tool for data placement on hybrid memory system. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, page 82--91, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[26]

Ashay Rane and James Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 147--156, New York, NY, USA, 2012. ACM.

Digital Library

[27]

Othman Bouizi Sebastien Valat. Numaprof, a numa memory profiler. In Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham, pages 159--170, December 2018.

[28]

Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. Addresssanitizer: A fast address sanity checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 28--28, Berkeley, CA, USA, 2012. USENIX Association.

Digital Library

[29]

François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. Numamma: Numa memory analyzer. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, New York, NY, USA, 2018. Association for Computing Machinery.

[30]

R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profiling directed numa optimization on linux systems: A case study of the gaussian computational chemistry code. In 2011 IEEE International Parallel Distributed Processing Symposium, pages 1046--1057, May 2011.

Digital Library

[31]

Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011.

Digital Library

[32]

L. Zhu, H. Jin, and X. Liao. A tool to detect performance problems of multi-threaded programs on numa systems. In 2016 IEEE Trust- com/BigDataSE/ISPA, pages 1145--1152, 2016.

Cited By

Yang HZhao XZhou JWang WKundu SWu BGuan HLiu TBlackburn SPetrank E(2023)NUMAlloc: A Faster NUMA Memory AllocatorProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595276(97-110)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595276
Papadakis OAndronikakis AFoutris NPapadimitriou MStratikopoulos AZakkak FXekalakis PKotselidis CBlackburn SPetrank E(2023)Scaling Up Performance of Managed Applications on NUMA SystemsProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595270(1-14)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595270
You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934
Show More Cited By

Index Terms

NumaPerf: predictive NUMA profiling
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking

Recommendations

A tool to analyze the performance of multithreaded programs on NUMA architectures
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it ...
Read More
A tool to analyze the performance of multithreaded programs on NUMA architectures
PPoPP '14

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it ...
Read More
DrGPU: A Top-Down Profiler for GPU Applications
ICPE '23: Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering

GPUs have become common in HPC systems to accelerate scientific computing and machine learning applications. Efficiently mapping these applications to rapid evolutions of GPU architectures for high performance is a well-known challenge. Various ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

June 2021

506 pages

ISBN:9781450383356

DOI:10.1145/3447818

General Chairs:
Huiyang Zhou
North Carolina State University
,
Jose Moreira
IBM Research
,
Program Chairs:
Frank Mueller
North Carolina State University
,
Yoav Etsion
Technion

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

ICS '21

Sponsor:

SIGARCH

ICS '21: 2021 International Conference on Supercomputing

June 14 - 17, 2021

Virtual Event, USA

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
389
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)13

Other Metrics

View Author Metrics

Citations

Cited By

Yang HZhao XZhou JWang WKundu SWu BGuan HLiu TBlackburn SPetrank E(2023)NUMAlloc: A Faster NUMA Memory AllocatorProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595276(97-110)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595276
Papadakis OAndronikakis AFoutris NPapadimitriou MStratikopoulos AZakkak FXekalakis PKotselidis CBlackburn SPetrank E(2023)Scaling Up Performance of Managed Applications on NUMA SystemsProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595270(1-14)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595270
You XYang HLei KLuan ZQian DAamodt TJerger NSwift M(2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3576934
Bilbao CSaez JPrieto-Matias M(2023)Divide&Content: A Fair OS-Level Resource Manager for Contention Balancing on NUMA MulticoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330999934:11(2928-2945)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3309999
Maroñas MNavarro AAyguadé EBeltran V(2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
https://dl.acm.org/doi/10.1007/s11227-023-05164-9
Kilic OTallent NSuriyakumar YXie CMarquez AEranian S(2022)MemGaze: Rapid and Effective Load-Level Memory Trace Analysis2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00058(484-495)Online publication date: Sep-2022
https://doi.org/10.1109/CLUSTER51413.2022.00058
MacGregor RTrinder PLoidl HKeller GHenriksen T(2021)Improving GHC Haskell NUMA profilingProceedings of the 9th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3471873.3472974(1-12)Online publication date: 22-Aug-2021
https://dl.acm.org/doi/10.1145/3471873.3472974

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents