We evaluate spatial affinity metrics using a set of representative HPC application benchmarks that vary data structures, storage formats, data layouts and algorithms. We describe the novel insights gained from affinity analysis. As a baseline, we compare against reuse distance (\(\textsf {RD}\)) and show the effectiveness of affinity metrics in distinguishing application performance.
Trace consists of samples of access sequences (~250) and includes instruction address, memory address and sample IDs. The length of sequences in a sample as well as trace sizes depend on sampling rate. The sampling rate varies between benchmarks: miniVite uses 5M, HiParTI-HICOO uses 4M, and XSBench uses 5M. Trace collection imposes overheads from 2.5 × to 6 ×. Trace files for HiParTI-HICOO matrix variants with 220 K to 4.7 M access sequences have sizes from 11 MB to 233 MB.
7.1 HiParTI-HICOO
To study sparse matrix storage formats and tensor data layouts, we choose the HiParTI suite [
22]. In HiParTI, SpMM kernels incorporate matrix storage formats such as
compressed sparse row (CSR),
coordinate (COO), and HICOO [
23]. It also includes multiple tensor reordering variants [
24]. HICOO is a compressed
block storage format for sparse tensors and matrices; it derives from and improves upon the COO format.
7.1.1 Tensor reordering variants.
For computational efficiency, sparse tensor data is generally reordered (indices relabeled) to improve data access locality. We analyze locality patterns of tensor reordering variants: Default (no reorder), Random (random order), BFS (breadth first search-like heuristic approach) and Lexi (lexicographical order) [
24]. All variants are integrated in an MTTKRP kernel implementation with HICOO storage format. Benchmark is run with
nell-2 [
35], a third-order tensor with 77M nonzeroes and a density of 2.4 × 10
− 5.
Aff. Heatmap. Figure
4 shows affinity heatmaps for the hottest memory region. The region includes two objects
factor matrices & output. Heatmaps for the variants show different affinity patterns across reference locations (horizontal axis) for each variant. We make four important observations from the heatmaps.
First, consider the
\(\textsf {SD}\) metric. Figures
4a and
4b show Default and Random variants. In these signatures notice that (a) the shaded box in the
\(\textsf {SD}\) heatmap shows sparse and irregular affinity, (b) the shaded box in the
\(\textsf {SI}\) heatmap for the same locations shows low (good) values, and (c) ① shows a large range of contiguous affinity locations (offsets on the vertical axis, +61 to -7 for Default and +42 to -15 for Random) that extend beyond the shaded box. These observations point that distant affinity locations are accessed at closer intervals and there is minimal correlation between accesses to a block and its neighbors. Hence the two variants have no spatial-temporal locality. Figures
4c and
4d show BFS and Lexi variants. In these signatures, observe that (a) the shaded box in the
\(\textsf {SD}\) heatmap shows high affinity to closely located offset locations, (b) the shaded box in the
\(\textsf {SI}\) heatmap for the same locations shows increasing
\(\textsf {SI}\) values with the increase in offset distance, and (c) affinity neighborhood in ① shows a smaller range (+5 to -4 for BFS, and +7 to -7 for Lexi). This pattern with high affinity to adjacent locations and good intervals in BFS and Lexi, exhibits good spatial-temporal locality and complies with guideline §
2.1 for
good performance.
Second, consider \(\textsf {SD}^{*}\) for self. ② combines values from \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps, and shows that \(\textsf {SD}^{*}\) is low across all variants. Though BFS and Lexi have slightly better \(\textsf {SD}^{*}\), the improvement is small. This suggests that although temporal reuse increases (along with caching opportunities) in BFS and Lexi, the impact will be minor.
Third, consider the
\(\textsf {SA}\) metric. Figures
4a and
4b show Default and Random variants. In these signatures notice that (a) the shaded box in the
\(\textsf {SA}\) heatmap shows sparse
\(\textsf {SA}\) values, (b)
\(\textsf {SI}\) heatmap for the same locations shows low (good) values, and (c) ① shows that both
\(\textsf {SA}\) and
\(\textsf {SI}\) extend to a wider range of affinity locations.
\(\textsf {SA}\) values are scattered between affinity locations with minimal access proximity between neighbors, suggesting that the access pattern is irregular and spatially sparse in Default and Random. In contrast, in Figs.
4c and
4d for BFS and Lexi notice that (a) the shaded box in the
\(\textsf {SA}\) heatmap shows better
\(\textsf {SA}\) and high access proximity to adjacent locations, and (b) as discussed in
\(\textsf {SD}\), both
\(\textsf {SI}\) values and affinity neighborhood range comply with
good performance. For BFS and Lexi, the observations point to irregular access but with good spatial-temporal locality.
Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ combines values from \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps, shows \(\textsf {SA}^{*}\) is high for both BFS and Lexi, whereas Default and Random have no measurable value. This highlights that BFS and Lexi have high anticipatory spatial-temporal locality to + 1 offset location and will highly benefit from hardware prefetching.
We conclude that (a) affinity metrics distinguish the performance of the variants and (b) explain that the tensor reorderings primarily improve
\(\textsf {SA}\) in contrast to
\(\textsf {SD}\). The latter means that the reorderings primarily depend upon hardware prefetching for impact. Our experiments with disabled prefetchers confirm that runtime degrades from 8% to 28% for Lexi and BFS, over a range of large tensors (
nell-2 and
nell-1 [
35],
freebase-music [
17]).
Aff. Signal. We study affinity over the
memory address space for BFS and Lexi with affinity signal plots (Figs.
5a and
5b) that show
\(\textsf {SA}^{*}\) and
\(\textsf {SD}^{*}\) for offset locations (− 1 to + 1). We focus on
\(\textsf {SA}^{*}\) values in the plots, as
\(\textsf {SD}^{*}\) is low for all blocks in both variants.
\(\textsf {SA}^{*}\) for + 1 offset locations in BFS is stagnant at 0.25, whereas Lexi has a significant number of blocks with 0.5. This shows that Lexi has a higher number of blocks with more affinity to adjacent locations than BFS, and these blocks leverage hardware prefetcher to decrease load latency. Access frequency plots at the bottom show that Lexi has less access frequency variation between blocks than BFS. It indicates that memory locations in Lexi are accessed more frequently with more use of the same location, consistent with more reuse and perhaps more memory-level parallelism. Again, we see that
\(\textsf {SA}^{*}\) clearly explains the better performance in Lexi.
Aff. Histogram. Figure
6a and Fig.
6b show the distribution of affinity pairs for
\(\textsf {SA}^{*}\) and
\(\textsf {SD}^{*}\) for all variants. In these plots, a distribution is better when it has more area that is skewed “up and to the right”. Figure
6a shows that Lexi has more affinity pairs (upper section) and affinity pairs at all score levels, including high values (right section). Though Default and Random have few affinity pairs at high score values, their
weighted affinity (section
3.5) is low. The distribution in Fig.
6b shows that pairs in all variants are concentrated towards the lower
\(\textsf {SD}^{*}\) values (left section). Though Lexi has a better
\(\textsf {SD}^{*}\) distribution, the plot does not identify a clear winner.
Thus we see that affinity heatmaps, signals, and histograms all explain the better performance of BFS and Lexi by focusing attention on the impact on \(\textsf {SA}^{*}\). Aff. Vector. First, we consider the baseline locality metric
\(\textsf {RD}\) in table
1 shows the best (lowest) value for Default and Random variants, and the worst value for BFS. We observe that
\(\textsf {RD}\) can be highly misleading as an indicator of application performance.
Second, consider
realized scores. Recall that
realized score quantifies affinity to adjacent locations in the current layout.
\(\textsf {SA}_{r}^{*}\) in table
1 shows high values for Lexi and BFS, indicating the benefits of reordering. Lower
realized\(\textsf {SA}_{r}^{*}\) value for Default and Random represent the sparse layout and access pattern. Though
realized\(\textsf {SD}_{r}^{*}\) is higher for Lexi, the minimal range of values does not distinguish between variants.
Third, consider
potential scores. Recall that
potential scores represent a possible value for affinity under the assumption that layout can be changed dynamically without cost, but it introduces overheads.
Potential vectors in table
1 also show high values for Lexi. In Lexi, the increase in
potential score for
\(\textsf {SA}_{r}^{*}\) is attributed to the − 1 offset location. For the detailed explanation, recall that the − 1 offset location has high temporal proximity (lower
\(\textsf {SI}\), high
\(\textsf {SA}\) in Fig.
4d and high
\(\textsf {SA}^{*}\) in Fig.
5b). It is possible that the − 1 offset location is present in the cache from prior access, and actual
realized affinity is higher.
Potential\(\textsf {SD}_{r}^{*}\) values are similar to
realized as
\(\textsf {SD}\) values are concentrated towards
hot-lines, and as in
realized\(\textsf {SD}_{r}^{*}\) the values are low and do not distinguish between variants.
7.1.2 Matrix variants.
We analyze memory locality effects of sparse matrix storage formats CSR, COO and HICOO in SpMM kernel. SpMM kernel computes
C =
A ×
B where
A is a sparse matrix,
B and
C are dense matrices. In our configuration,
A (blockqp1 [
6]) has a density 1.77 x 10
− 4 and the block structure is shown in table
2; number of columns in
B is set to 4096. We focus on efficient parallel implementations and select CSR, COO-Reduce, HICOO-S variants for analysis. CSR parallelizes rows in
A. COO-Reduce parallelizes the number of non-zeroes and uses a buffer for
C. HICOO-S parallelizes compressed
blocks.
Aff. Heatmap. fig.
7 shows heatmaps for the hottest memory region, matrix
B in all variants. It shows uniform locality patterns where the affinity pattern remains the same across all reference locations (horizontal axis). We make four observations from the heatmaps.
First, consider the \(\textsf {SD}\) metric. In all three variants, (a) yellow box in the \(\textsf {SD}\) heatmap shows that the variants have high \(\textsf {SD}\) values for self location only, (b) yellow box in the \(\textsf {SI}\) heatmap shows low (best) \(\textsf {SI}\) values for the same affinity location self, and (c) ① shows affinity range contains self location only. These observations indicate that no other affinity locations are accessed within the lifetime of each reference block, and blocks are reused within a short interval.
Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps and shows high values in all variants, indicating that reuse is high, with the best values in COO-Reduce.
Third, consider the
\(\textsf {SA}\) metric. For CSR in Fig.
7a and HICOO-S in Fig.
7c,
\(\textsf {SA}\) is high and similar across contiguous locations, and
\(\textsf {SI}\) remains low (good) for all reference blocks. For COO-Reduce in Fig.
7b,
\(\textsf {SA}\) is not uniform and the values are insignificant, as it incorporates the highest access counts of reference blocks, but note that it is accompanied by good
\(\textsf {SI}\) values. Though
\(\textsf {SA}\) for COO-Reduce differs from the other two,
\(\textsf {SI}\) remains similar. These observations point to regular, forward traversal access patterns in all three variants.
Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ combines \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps and shows high values in CSR and HICOO-S variants, indicating that they effectively leverage the prefetcher. Though \(\textsf {SA}^{*}\) for + 1 offset location is low in COO-Reduce, it is accessed in a temporal forward direction.
To summarize, for all variants (a) range of affinity blocks is
self, (b) blocks have high reuse, and (c)
\(\textsf {SI}\) heatmaps show effective prefetching. These observations point to regular, strided access with high spatial-temporal locality and suggest that all variants follow guideline section
2.1 behavior for
good performance, but there are differences in the rate of reuse. We conclude that same-line locality is most important for the matrix variants. Although this type of locality does not strictly correspond to either caching or prefetch schemes, it is important and captured by our affinity metrics.
Aff. Vector. Baseline
\(\textsf {RD}\) values for all variants in table
2 are equal (best value of ~1) and indicate similar behavior, whereas distinct
\(\textsf {SD}^{*}\) values show the trade-off between storage format and locality. Note that COO-Reduce with the highest
\(\textsf {SD}^{*}\) value (high reuse) is not the best-performing variant, because it requires additional buffering as shown in very high access counts (
\(\textsf {A}\)). Though CSR and HICOO-S have smaller
\(\textsf {SD}^{*}\) values, they are efficient because of effective data reuse based on instruction level parallelism (
A in table
2 are 4 × to account for SSE).
It is worth noting that higher reuse, measured by \(\textsf {SD}^{*}\) values translates to superior performance within the bounds of access frequency. So, spatial affinity metrics combined with other characteristics such as access frequency provide complete information about performance. At the same time, spatial affinity metrics provide precise spatial-temporal locality measure than reuse distance (even with best value ~1).
7.2 miniVite
miniVite [
10] is a graph benchmark for community detection that uses Louvain optimization. Variants in our analysis use different hash table implementations for the hottest
map object. v1 uses a C++
unordered_map, an open hash table with an array of keys, each containing a linked list for items, and hence irregular accesses. v2 and v3 use TSL
hopscotch [
15,
38], a closed hash table that stores items in a contiguous array, and replaces irregular accesses with strided ones. v2 uses the default table size and does dynamic resizing. v3 avoids resizing by specifying right-size for each instance.
Aff. Heatmap. Figure
8 shows heatmaps for the
map object, with varying affinity patterns among reference blocks (horizontal axis) in each variant. We begin with five observations.
First, consider the
\(\textsf {SD}\) metric. In Fig.
8a for v1 observe that (a) the shaded box in the
\(\textsf {SD}\) heatmap shows sparse and scattered affinity to contiguous locations, (b) the shaded box in the
\(\textsf {SI}\) heatmap shows low (good)
\(\textsf {SI}\) values for these locations, and (c) ① shows that range of affinity locations extend to a wider neighborhood (+14 to -10 offset locations in vertical axis). Scattered
\(\textsf {SD}\) values within the range, along with good
\(\textsf {SI}\) indicate that distant locations are accessed in close temporal proximity. This pattern points to negligible spatial and temporal affinity to adjacent locations, reflective of irregular accesses in the linked list. For v2 and v3 in Fig.
8b and Fig.
8c, note that (a) the shaded box in the
\(\textsf {SD}\) heatmap shows
\(\textsf {SD}\) values that are congregated towards same-line and adjacent locations, (b) the shaded box in the
\(\textsf {SI}\) heatmap shows a trend of increasing
\(\textsf {SI}\) with offset distances, and (c) ① for v2 and v3 show affinity to low range of adjacent locations (v2: +4 to -7, v3: +7 to -8). Concentrated
\(\textsf {SD}\) along with a good trend in
\(\textsf {SI}\) values indicates better affinity to adjacent locations, evident of the strided traversal of
map.
Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps and shows low \(\textsf {SD}^{*}\) value in all variants. v1’s \(\textsf {SD}^{*}\) values are very low. Both v3 and v2 have better (slightly higher) \(\textsf {SD}^{*}\) values as the traversal limits the reuse of blocks. We infer that cache-friendly and same-line reuse improves in v2 and v3, but it is still relatively low in all variants.
Third, consider the
\(\textsf {SA}\) metric. In Fig.
8a for v1 notice that (a) the shaded box in the
\(\textsf {SA}\) heatmap shows non-uniform
\(\textsf {SA}\) values that are spread over contiguous locations, (b) as noted in
\(\textsf {SD}\) discussions,
\(\textsf {SI}\) values also favor distant locations, and (c) ① shows v1’s anticipation extends to a wide range of contiguous locations.
\(\textsf {SA}\) correlation to distant locations shows that there is no anticipatory access to spatially adjacent locations. In Fig.
8b and Fig.
8c for v2 and v3, (a) the shaded box in the
\(\textsf {SA}\) heatmap shows high
\(\textsf {SA}\) values for closer neighbors, especially in v3, (b) as noted in
\(\textsf {SD}\) discussions,
\(\textsf {SI}\) values show preference to adjacent locations, and (c) ① for v2 and v3 show anticipation to adjacent locations. High
\(\textsf {SA}\) values along with good
\(\textsf {SI}\) within a much smaller range of adjacent locations, indicate anticipatory access with better spatial locality in v3.
Fourth, consider
\(\textsf {SA}^{*}\) to + 1 offset location. For v1 in Fig.
8a ③ (combines
\(\textsf {SA}\) and
\(\textsf {SI}\) heatmaps) shows negligible values, indicating that v1 has no prefetching advantage. From ③ in Fig.
8b and Fig.
8c, we note that v3 has more reference locations with impactful
\(\textsf {SA}^{*}\) to + 1 offset location, pointing to beneficial prefetching than v2.
Finally, temporal locality (④ upper band in all heatmaps) to hot-lines remains significant in all variants, and it is similar between all variants.
Interestingly, no variant has affinity relations that follow guideline section
2.1 for
good performance, but the
\(\textsf {SA}\) and
\(\textsf {SI}\) heatmaps show improving spatial locality from v1 to v3.
In summary, changing map’s data structure from open to closed hash table (a) improves \(\textsf {SA}\) than \(\textsf {SD}\) to adjacent locations and (b) though caching (\(\textsf {SD}\)) for adjacent locations remains low, it is high for hot-lines in all variants. We conclude that the primary explanation of v2’ and v3’s performance is a data structure that exploits hardware prefetching.
Aff. Histogram. Figure
9 shows histogram plots; recall that distribution is better when it has more area that is skewed “up and to the right”. Figure
9a for
\(\textsf {SA}^{*}\) shows that v3 has high distribution of affinity pairs (upper section), with a greater number of pairs at the higher score values (right section). v3 has far more impactful affinity as the weighted value of the area under the curve increases towards high
\(\textsf {SA}^{*}\) values. Figure
9b for
\(\textsf {SD}^{*}\) shows that both v3 and v2 have a better distribution of affinity pairs than v1. Though v2 has almost comparable distribution as v3, a significant portion of the accesses in v2 are unnecessary (resizing) and hinder performance. Again,
\(\textsf {SD}^{*}\) shows an improving trend from v1 to v3, but
\(\textsf {SA}^{*}\) shows better variation and explains the better performance of v3.
Aff. Vector. First, similar values for baseline
\(\textsf {RD}\) in table
3 for v1 and v3 fail to differentiate the performance of these variants.
Second, consider realized scores for quantifying the current layout. \(\textsf {SA}_{r}^{*}\) shows a higher value for v3 when compared to v1, reflecting the beneficial prefetching. Realized\(\textsf {SD}_{r}^{*}\) shows high value for v3 than v1, indicating better reuse. In both cases, though v2 has the highest scores, the impact of the affinity score is reduced as the accesses include added copying along with traversal through the array.
Third, consider potential scores for possible layout changes and their impact on affinity. Potential\(\textsf {SA}_{r}^{*}\) also shows a higher value for v3. Possible options for potential score are spread over a small neighborhood of adjacent locations within the range, indicating that re-layout can lead to better locality. For all variants, \(\textsf {SD}_{r}^{*}\) doesn’t show a difference between realized and potential scores as \(\textsf {SD}\) values are concentrated among the hot-lines.
7.3 Alpaca.cpp
Alpaca.cpp [
9] is an LLM inferencing application implemented in C++, parallelized using Pthreads. It combines efforts such as LLaMA [
39] and Alpaca [
37]. Inferencing in our evaluation uses the LLaMA 7B parameter model quantized with 4-bit integers (
ggml-alpaca-7b-q4.bin) with seed value 1685480810, and default values for other parameters.
Among the 12K lines of code (7 MB binary), location analysis highlights two regions x and y in hotspot function ggml_vec_dot_q4_0. x region corresponds to src0_row with spatially sparse accesses identified by hot-access analysis; y region corresponds to src1_col and it is a hot-contiguous region.
Aff. Heatmap. Heatmaps in fig.
10 show different locality patterns with differing affinity patterns across reference locations (horizontal axis) for each region. We use signatures to differentiate the access patterns in the two regions.
src0_row. We start with
src0_row object and its signature in Fig.
10a. First, consider the
\(\textsf {SD}\) metric.
\(\textsf {SD}\) heatmap shows high values for
self, and ① shows that affinity is limited to + 1, −1 offset locations and
self. This indicates that only adjacent locations are accessed after access to a reference block.
Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps, and shows high values of \(\textsf {SD}^{*}\) (~0.3) for all blocks, and points that these blocks are highly reused.
Third, consider the \(\textsf {SA}\) metric. \(\textsf {SA}\) heatmaps show high values for + 1 offset than − 1 offset. \(\textsf {SI}\) heatmap also shows low (good) \(\textsf {SI}\) values for + 1 offset and high (bad) values for − 1 offset. This shows that the + 1 offset location has a high correlation and the access pattern is regular, strided access with mostly forward traversal.
Fourth, consider \(\textsf {SA}\) for + 1 offset location. ③ shows a combination of low \(\textsf {SI}\) values and higher \(\textsf {SA}\) values, implying that + 1 offset locations are accessed in close temporal proximity.
The observed pattern suggests that access behavior for
src0_row has good spatial-temporal locality as in guideline §
2.1.
src1_col. Now, we discuss
src1_col object’s signature in Fig.
10b. First, consider the
\(\textsf {SD}\) metric. In Fig.
10b observe that (a) the shaded box in the
\(\textsf {SD}\) heatmap shows scattered and sparse
\(\textsf {SD}\) values, (b) the shaded box in the
\(\textsf {SI}\) heatmap points to increasing
\(\textsf {SI}\) for positive offset locations, and high values for
self and negative offsets, and (c) ① shows affinity range that is higher (+6 to -25 offset locations). Scattered affinity to locations across the range, with irregular
\(\textsf {SI}\) values, show that the access pattern in
src1_col traverses a wide range of adjacent locations within each reference block’s lifetime.
Second, consider
\(\textsf {SD}^{*}\) for
self. ② shows high
\(\textsf {SI}\) and low
\(\textsf {SD}\) (~0.1) for
self locations, pointing that reuse remains low and is an example of guideline section
2.5.
Third, consider the \(\textsf {SA}\) metric. Notice that (a) the shaded box in the \(\textsf {SA}\) heatmap shows uniform \(\textsf {SA}\) values for all locations, and (b) as noted in \(\textsf {SD}\) discussions, \(\textsf {SI}\) values show a preference for positive offset locations. We observe that the access pattern in src1_col has beneficial anticipation towards positive offset locations.
Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ with high \(\textsf {SA}^{*}\) value shows that high anticipatory spatial-temporal locality to + 1 offset location, and access pattern that leverages the hardware prefetcher.
We validated that decoder implementation in Alpaca.cpp uses a key-value cache for efficient inferencing. We are exploring other ways to improve locality for src1_col.
Aff. Vector. First, baseline
\(\textsf {RD}\) values in table
4 shows low value and hence better locality for
src0_row and informs about its good temporal locality. Second,
realized scores show high
\(\textsf {SA}_{r}^{*}\) value for
src1_col and low values for
src0_row highlighting the benefits of spatial locality and prefetching in
src1_col. Also,
realized\(\textsf {SD}_{r}^{*}\) shows high values for
src0_row, and low values for
src1_col, pointing to low reuse in
src1_col. Third,
potential score
\(\textsf {SA}_{r}^{*}\) shows that efforts to reorder accesses for improved prefetching should focus on
src1_col.
Potential\(\textsf {SD}_{r}^{*}\) doesn’t show a difference to
realized score suggesting that efforts to improve reuse might need higher levels of refactoring.
Thus, we observe that spatial affinity metrics provide more insights about data layout, access patterns, and their implications on memory performance, than reuse distance analysis.
7.4 XSBench
XSBench [
40,
41] is a proxy application that models Monte Carlo neutron transport algorithm, specifically the calculation of macroscopic neutron cross sections. We evaluate two variants of the
event-based transport model, baseline k0 and optimized k1. Baseline k0 parallelizes cross section lookups for all particles with varying materials and follows random access pattern. Optimized k1 sorts the particles by material and energy and facilitates efficient memory access.
Locality analysis identifies three regions: hot-contiguous region material and nuclide; and two hot-access regions with sparse accesses, energy grid, and grid cross section.
First, we describe signatures for hot-contiguous region material and nuclide.
Aff. Histogram. We exclude heatmaps for
hot-contiguous regions as both variants use the same data structure and affinity patterns are similar. Histogram of affinity pairs for
material and
nuclide in Fig.
11a and Fig.
11b show better number of affinity pairs (skewed “up and to the right”) and better affinity in k1.
Aff. Vector. First, baseline
\(\textsf {RD}\) shows better (lower) values for k0, and misleads about performance. Second,
realized scores in table
5 show better
\(\textsf {SA}_{r}^{*}\) and
\(\textsf {SD}_{r}^{*}\) values for k1, reflecting the optimal accesses in k1. Third, consider
potential scores. The similarity between
realized and
potential scores suggests that
material and
nuclide region has the best possible layout, and optimizations should focus on other regions.
Now, we focus on sparsely accessed regions (energy grid and grid cross section). We use signals to compare against baseline \(\textsf {RD}\), as the scores are not applicable due to highly sparse accesses.
Aff. Signal. Baseline
\(\textsf {RD}\) in table
6 for the sparsely accessed regions (
energy grid and
grid cross section) shows better (lower) values for k0, and misleads about the performance.
For the same regions, affinity signals are shown in Fig.
12. k0’s signal plots are shown in Fig.
12a. The bottom subplot for access frequency shows sparsely accessed blocks. The middle subplot for
self shows varying
\(\textsf {SA}^{*}\) and
\(\textsf {SD}^{*}\). The top subplot for + 1 offset shows no
\(\textsf {SA}^{*}\) to + 1 offset locations. These observations indicate worse temporal as well as spatial locality among all blocks.
k1’s signal plots are shown in Fig.
12b. The bottom subplot for access frequency shows more blocks with higher access frequency. The middle subplot for
self shows varying
\(\textsf {SA}^{*}\) and
\(\textsf {SD}^{*}\), but they are effective as the blocks have high access frequency. The top subplot for + 1 offset shows considerable
\(\textsf {SA}^{*}\) to + 1 locations. These observations indicate that k1 has better spatial-temporal locality. k1’s signatures with better spatial-temporal affinity patterns reflect the optimized memory access patterns in k1.
We conclude that spatial affinity signatures capture differences in access patterns for regions with vastly different footprints, and the performance of k1 depends on both caching and prefetching.