Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3447818.3460361acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

NumaPerf: predictive NUMA profiling

Published: 04 June 2021 Publication History
  • Get Citation Alerts
  • Abstract

    It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.

    References

    [1]
    Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and Abdullah Muzahid. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, pages 298-- 313, New York, NY, USA, 2017. ACM.
    [2]
    David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. Tabarnac: Visualizing and resolving memory access issues on numa architectures. In Proceedings of the 2nd Workshop on Visual Performance Analysis, VPA ’15, New York, NY, USA, 2015. Association for Computing Machinery.
    [3]
    Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages 117--128, New York, NY, USA, 2000. ACM Press.
    [4]
    Christian Bienia and Kai Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.
    [5]
    Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC’11, pages 1--1, Berkeley, CA, USA, 2011. USENIX Association.
    [6]
    William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. Numa policies and their relation to memory architecture. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 212--221, New York, NY, USA, 1991. ACM.
    [7]
    Milind Chabbi, Shasha Wen, and Xu Liu. Featherlight on-the-fly false- sharing detection. In Andreas Krall and Thomas R. Gross, editors, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018, pages 152--167. ACM, 2018.
    [8]
    Charlie Curtsinger and Emery D. Berger. Coz: Finding code that counts with causal profiling. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 184--197, New York, NY, USA, 2015. ACM.
    [9]
    Matthias Diener, Eduardo HM Cruz, Laércio L Pilla, Fabrice Dupros, and Philippe OA Navaux. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation, 88:18--36, 2015.
    [10]
    Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn. Linux kernel profiling with perf. https://perf.wiki.kernel.org/index. php/Tutorial, 2015.
    [11]
    Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a call graph execution profiler. In SIGPLAN Symposium on Compiler Construction, pages 120--126, 1982.
    [12]
    Christian Helm and Kenjiro Taura. Perfmemplus: A tool for automatic discovery of memory performance problems. In International Conference on High Performance Computing, pages 209--226. Springer, 2019.
    [13]
    Intel Corporation. Intel VTune performance analyzer. http://www.intel. com/software/products/vtune.
    [14]
    Lawrence Livermore National Laboratory. Livermore unstructured lagrangian explicit shock hydrodynamics (lulesh). https://codesign. llnl.gov/lulesh.php., Dec 2010.
    [15]
    Lawrence Livermore National Laboratory. Llnl coral benchmarks. https://asc.llnl.gov/CORAL-benchmarks., Dec 2013.
    [16]
    Lawrence Livermore National Laboratory. Llnl sequoia benchmarks. https://asc.llnl.gov/sequoia/benchmarks., Dec 2013.
    [17]
    Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. Memprof: A memory profiler for numa multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 5--5, Berkeley, CA, USA, 2012. USENIX Association.
    [18]
    Christoph Lameter. An overview of non-uniform memory access. Com- mun. ACM, 56(9):59--54, September 2013.
    [19]
    Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’04, pages 75--, Washington, DC, USA, 2004. IEEE Computer Society.
    [20]
    Tongping Liu and Emery D. Berger. Sheriff: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications, OOPSLA ’11, pages 3--18, New York, NY, USA, 2011. ACM.
    [21]
    Tongping Liu and Xu Liu. Cheetah: Detecting false sharing efficiently and effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, pages 1--11, New York, NY, USA, 2016. ACM.
    [22]
    Tongping Liu, Chen Tian, Hu Ziang, and Emery D. Berger. Predator: Predictive false sharing detection. In Proceedings of 19th ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming, PPOPP’14, New York, NY, USA, 2014. ACM.
    [23]
    Xu Liu and John Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 259--272, New York, NY, USA, 2014.ACM.
    [24]
    C. McCurdy and J. Vetter. Memphis: Finding and fixing numa-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pages 87--96, March 2010.
    [25]
    Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. Rthms: A tool for data placement on hybrid memory system. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, page 82--91, New York, NY, USA, 2017. Association for Computing Machinery.
    [26]
    Ashay Rane and James Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 147--156, New York, NY, USA, 2012. ACM.
    [27]
    Othman Bouizi Sebastien Valat. Numaprof, a numa memory profiler. In Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham, pages 159--170, December 2018.
    [28]
    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. Addresssanitizer: A fast address sanity checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 28--28, Berkeley, CA, USA, 2012. USENIX Association.
    [29]
    François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. Numamma: Numa memory analyzer. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, New York, NY, USA, 2018. Association for Computing Machinery.
    [30]
    R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profiling directed numa optimization on linux systems: A case study of the gaussian computational chemistry code. In 2011 IEEE International Parallel Distributed Processing Symposium, pages 1046--1057, May 2011.
    [31]
    Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011.
    [32]
    L. Zhu, H. Jin, and X. Liao. A tool to detect performance problems of multi-threaded programs on numa systems. In 2016 IEEE Trust- com/BigDataSE/ISPA, pages 1145--1152, 2016.

    Cited By

    View all
    • (2023)NUMAlloc: A Faster NUMA Memory AllocatorProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595276(97-110)Online publication date: 6-Jun-2023
    • (2023)Scaling Up Performance of Managed Applications on NUMA SystemsProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595270(1-14)Online publication date: 6-Jun-2023
    • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing
    June 2021
    506 pages
    ISBN:9781450383356
    DOI:10.1145/3447818
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. NUMA profiling
    2. performance optimization
    3. profiler

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS '21
    Sponsor:

    Acceptance Rates

    ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)103
    • Downloads (Last 6 weeks)13

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)NUMAlloc: A Faster NUMA Memory AllocatorProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595276(97-110)Online publication date: 6-Jun-2023
    • (2023)Scaling Up Performance of Managed Applications on NUMA SystemsProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595270(1-14)Online publication date: 6-Jun-2023
    • (2023)VClinic: A Portable and Efficient Framework for Fine-Grained Value ProfilersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3576934(892-904)Online publication date: 27-Jan-2023
    • (2023)Divide&Content: A Fair OS-Level Resource Manager for Contention Balancing on NUMA MulticoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330999934:11(2928-2945)Online publication date: 1-Nov-2023
    • (2023)Mitigating the NUMA effect on task-based runtime systemsThe Journal of Supercomputing10.1007/s11227-023-05164-979:13(14287-14312)Online publication date: 6-Apr-2023
    • (2022)MemGaze: Rapid and Effective Load-Level Memory Trace Analysis2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00058(484-495)Online publication date: Sep-2022
    • (2021)Improving GHC Haskell NUMA profilingProceedings of the 9th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing10.1145/3471873.3472974(1-12)Online publication date: 22-Aug-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media