Pinpointing data locality bottlenecks with low overhead
X Liu, J Mellor-Crummey - 2013 IEEE International Symposium …, 2013 - ieeexplore.ieee.org
2013 IEEE International Symposium on Performance Analysis of …, 2013•ieeexplore.ieee.org
A wide gap exists between the speed of modern processors and memory subsystems. As a
result, long latencies associated with fetching data from memory often significantly degrade
execution performance. To aid with program tuning, application developers need tools to
analyze memory access patterns and guide them how to reuse data in the fastest levels of a
system's memory hierarchy. In this paper, we describe a novel, efficient, and effective tool for
data locality measurement and analysis. Unlike other tools, our tool uses both statistical …
result, long latencies associated with fetching data from memory often significantly degrade
execution performance. To aid with program tuning, application developers need tools to
analyze memory access patterns and guide them how to reuse data in the fastest levels of a
system's memory hierarchy. In this paper, we describe a novel, efficient, and effective tool for
data locality measurement and analysis. Unlike other tools, our tool uses both statistical …
A wide gap exists between the speed of modern processors and memory subsystems. As a result, long latencies associated with fetching data from memory often significantly degrade execution performance. To aid with program tuning, application developers need tools to analyze memory access patterns and guide them how to reuse data in the fastest levels of a system's memory hierarchy. In this paper, we describe a novel, efficient, and effective tool for data locality measurement and analysis. Unlike other tools, our tool uses both statistical PMU sampling to quantify the cost of data locality bottlenecks and cache simulation to compute reuse distance to diagnose the causes of locality problems. This approach enables us to collect rich information to provide insight into a program's data locality problems. Our tool attributes quantitative measurements of observed memory latency to program variables and dynamically allocated data, as well as code. Our tool identifies data touched by reuse pairs and the accesses involved, identified with their full calling context. Finally, our tool employs both sampling and parallelization to accelerate the computation of representative reuse distance information. Experiments show that with an overhead of only about 13%, our tool provides detailed insights that enabled us to make non-trivial improvements to memory-bound HPC benchmarks.
ieeexplore.ieee.org