A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks
J Marathe, F Mueller, B de Supinski - Proceedings of the 19th annual …, 2005 - dl.acm.org
J Marathe, F Mueller, B de Supinski
Proceedings of the 19th annual international conference on Supercomputing, 2005•dl.acm.orgHigh-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as
clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image
(SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data
within applications and its impact on cache coherence. Sharing patterns that result in
frequent invalidations followed by subsequent coherence misses create cache coherence
bottlenecks with significant performance penalties. Past work on identifying coherence …
clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image
(SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data
within applications and its impact on cache coherence. Sharing patterns that result in
frequent invalidations followed by subsequent coherence misses create cache coherence
bottlenecks with significant performance penalties. Past work on identifying coherence …
High-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data within applications and its impact on cache coherence. Sharing patterns that result in frequent invalidations followed by subsequent coherence misses create cache coherence bottlenecks with significant performance penalties. Past work on identifying coherence bottlenecks based oil tracing memory accesses incurs considerable runtime overhead and does not scale well with increasing problem sizes, which makes it infeasible to use with real-world programs.In this paper, we introduce a novel low-cost, hardware-assisted approach to determine coherence bottlenecks in shared-memory OpenMP applications. We assess the merits of our approach on a contemporary SMP platform. Specifically, we assess the feasibility of lossy tracing to pin-point coherence problems in applications. We evaluate the qualitative and quantitative trade-offs between tracing overhead and accuracy of the generated coherence traffic metrics, correlated to memory access points at the program source level.Our lossy tracing mechanism closely approximates the degree of accuracy of determining coherence misses in full traces for most of the benchmarks we study while reducing run-time execution overhead and trace sizes by one to two orders of magnitude. To the best of our knowledge, this novel method significantly outperforms any of the prior approaches and, for the first time, makes cache coherence analysis feasible for long-running applications.
