Article

Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

Authors:

Peter StrazdinsAuthors Info & Claims

IPDPS '11: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Pages 1046 - 1057

https://doi.org/10.1109/IPDPS.2011.100

Published: 16 May 2011 Publication History

Abstract

The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the \texttt{NUMAgrind} profiling tool which can be used to simplify this process. It extends the \texttt{Val grind} binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. \ Using \texttt{NUMAgrind}, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the \texttt{NUMAgrind} tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.

Cited By

View all

Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Alam MGottschlich JTatbul NTurek JMattson TMuzahid AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)A zero-positive learning approach for diagnosing software performance regressionsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455330(11627-11639)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455330
Chen QGuo M(2015)Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/276645012:2(1-24)Online publication date: 8-Jul-2015
https://dl.acm.org/doi/10.1145/2766450
Show More Cited By

Recommendations

Cooperative NV-NUMA: prolonging non-volatile memory lifetime through bandwidth sharing
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Resistive memory technologies, such as ReRAM and PCM, are potentially promising replacements for DRAM technology. Their limited endurance (and thus short lifetime), however, is a major obstacle to their commercialization. Analytic models and ...
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11: Proceedings of the international symposium on Memory management

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Programming Java on Linux

Comments

Information & Contributors

Information

Published In

IPDPS '11: Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

May 2011

1285 pages

ISBN:9780769543857

Publisher

IEEE Computer Society

United States

Publication History

Published: 16 May 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Alam MGottschlich JTatbul NTurek JMattson TMuzahid AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)A zero-positive learning approach for diagnosing software performance regressionsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455330(11627-11639)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455330
Chen QGuo M(2015)Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/276645012:2(1-24)Online publication date: 8-Jul-2015
https://dl.acm.org/doi/10.1145/2766450
Liu XMellor-Crummey J(2014)A tool to analyze the performance of multithreaded programs on NUMA architecturesACM SIGPLAN Notices10.1145/2692916.255527149:8(259-272)Online publication date: 6-Feb-2014
https://dl.acm.org/doi/10.1145/2692916.2555271
Liu XMellor-Crummey JMoreira JLarus J(2014)A tool to analyze the performance of multithreaded programs on NUMA architecturesProceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2555243.2555271(259-272)Online publication date: 6-Feb-2014
https://dl.acm.org/doi/10.1145/2555243.2555271
Srinivasa ASosonkina MHowell G(2012)Nonuniform memory affinity strategy in multithreaded sparse matrix computationsProceedings of the 2012 Symposium on High Performance Computing10.5555/2338816.2338825(1-8)Online publication date: 26-Mar-2012
https://dl.acm.org/doi/10.5555/2338816.2338825

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations

Cooperative NV-NUMA: prolonging non-volatile memory lifetime through bandwidth sharing

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Programming Java on Linux

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations