research-article

Open access

MemFriend: Understanding Memory Performance with Spatial-Temporal Affinity

Authors:

Ozgur O KilicAuthors Info & Claims

MEMSYS '24: Proceedings of the International Symposium on Memory Systems

Pages 270 - 284

https://doi.org/10.1145/3695794.3695820

Published: 11 December 2024 Publication History

All formats PDF

Abstract

In HPC applications, memory access behavior is one of the main factors affecting performance. Improving an application’s memory access behavior requires studying spatial-temporal data locality. Existing data locality analyses focus on single locations. We introduce locality metrics between pairs of memory locations that quantify three dimensions of spatial-temporal affinity: temporal access proximity, forward access correlation, and nearby access correlation. We describe methods for distinguishing between potential vs. realized affinity and for reasoning about affinity (or friendship) at multiple resolutions (4D, 3D, 2D, 1D). Finally, we construct spatial-temporal affinity signatures that classify memory behavior and are used to reason about changes in software (data relayout, code refactoring) or hardware (caching, prefetching). We describe methods for signature visualization, interpretation, and quantitative comparison of signatures. We evaluate our methodology using applications with variants that contrast data structures, data layouts and algorithms. We show that spatial-temporal affinity analysis provides novel insights and enables predictive reasoning about application performance.

1 Introduction

In current HPC applications, the memory system is frequently a significant source of performance bottlenecks [29, 44] that affects nearly all machine platforms such as CPUs, GPUs or a heterogeneous combination [8]. Addressing memory system bottlenecks requires tools for diagnosing memory bottlenecks, characterizing application-platform suitability and evaluating memory systems and designs. An open problem in data locality analysis is how to concisely represent locality in a way that both predicts performance and distinguishes between classes of behavior such as access patterns, cache utilization and hardware prefetching.

Prior work on data locality includes metrics such as access frequency, reuse distance and footprint [28, 45, 46]. These metrics are defined with respect to a single memory location. Multiple efforts [4, 11, 12, 19, 27] have reported on the limitations of these metrics, such as their focus on temporal locality and their inability to capture access patterns and guide layout optimizations. Others [4, 11, 12, 27] explore capturing spatial-temporal locality to address the limitations of reuse distance by adding analyses at multiple granularities.

We argue that answering the more general question “What location j is likely to be accessed within the vicinity of an access to location i?” provides a precise and comprehensive view of the application’s spatial-temporal locality. Though previous efforts attempt to answer this question, they take different approaches by limiting the problem to specific temporal windows to pack correlated objects into a cache line. Reference affinity [26, 47, 48, 49] pairs/splits arrays and structures using similarities in their reuse distance values. Miucin and Fedorova [30] classify objects and object fields into communities based on correlation in access patterns. A limitation in these methods is that by using a single metric, it may be difficult to distinguish when it is more important to prioritize optimizations that improve caching vs. those that improve prefetching.

We introduce spatial-temporal affinity metrics that quantify spatial-temporal locality between pairs of memory locations. The three metrics — interval, anticipation, and density — highlight different dimensions of pair-wise locality, respectively: temporal access proximity, forward reference correlation (cf. prefetching), and nearby reference correlation (cf. caching). We describe intuitive rules that guide optimizations based on these spatial-temporal affinity metrics. Affinity metrics enable the identification of friendship or affinity clusters of related memory locations or data objects that guide decisions on object allocation, data layouts, code refactoring, caching, or prefetching.

Affinity analysis has similarities to both statistical correlation and market basket analysis used in recommender systems. As a kind of correlation, naive affinity analysis based on pair-wise comparisons requires quadratic space and time. Therefore, narrowing the space of interesting possibilities is important. We develop efficient location-based, multi-resolution zooming to find hot memory regions by access density and access frequency. As a result, our affinity analysis considers a small fraction of the possible combinations. Other forms of pattern analysis [2, 5, 7, 25] can be used to find correlated memory locations, but they use fundamentally different techniques with large time and space requirements.

We evaluate spatial-temporal affinity analysis using several representative applications. We show that affinity metrics are more informative and predictive of memory behavior when contrasted with several existing locality metrics; and demonstrate that the signatures are predictive of memory performance. Our contributions are as follows:

•

We introduce novel measures (section 3) of spatial-temporal affinity between pairs of memory locations that characterize the degree of temporal access proximity, forward reference correlation, and nearby reference correlation.

•

We describe methods (section 4) for distinguishing between potential vs. realized affinity and for reasoning about affinity at multiple resolutions (4D, 3D, 2D, 1D). We also present interpretive guidelines and comparison methods.

•

We develop the MemFriend tool (memory friendship; sections 5 and 6), a new module for MemGaze [18, 20]. MemFriend constructs spatial-temporal affinity signatures that classify memory behavior and are more predictive of expected performance than single-location reuse analysis.

•

We evaluate spatial-temporal affinity (section 7) by comparing against state-of-the-art metrics for several benchmarks with varying (only) 1) data structures, 2) data layouts, or 3) algorithms.

2 Motivating Multiple Affinity Relations

In this paper, we develop multiple metrics to answer the question “What location j is likely to be accessed within the vicinity of an access to location i?” As shown in Figure 1, each of the three metrics highlights a different dimension of pair-wise locality: interval or temporal reference (access) proximity; anticipation or forward reference correlation (cf. prefetching); and density or nearby reference correlation (cf. caching). Spatial-temporal affinity can therefore be viewed as a 3-tuple on 3 axes. Spatial interval is the distance of a reference interval. Spatial anticipation and density can be thought of as two different measures of the conditional probability of block j occurring given block i, computed as a ratio of j to i. Thus, we usually adopt the syntax (j|i). We call i the reference location and j the affinity location.

We provide the following guidelines for using spatial-temporal affinity to reason about different situations and performance opportunities.

1. We expect high performance when a block and its spatially contiguous blocks have (a) good temporal proximity, i.e., low values of interval(i, j) and (b) good correlation, i.e., high anticipation(j|i) and/or density(j|i).

2. There are opportunities for data relayout when a block i and its correlated j have unfavorable spatial separation with good temporal proximity, i.e., (a) low interval(i, j) for large positive j or small/large negative j and (b) good reference correlation, i.e., high values of anticipation(j|i) and/or density(j|i). Specifically, these conditions identify when a block pair’s logical spatial locality is not aligned with the actual spatial locality.

3. There is potential for code refactoring, such as bringing two time-separated statements together, to exploit latent locality (long-distance reuse) when there are (a) hot blocks with (b) poor temporal proximity, i.e., high interval(i, j) and (c) poor reference correlation, i.e., low anticipation(j|i) and density(j|i).

4. There is potential to insert impactful software prefetching (to preload j, after access to i) when there is (a) poor temporal proximity, i.e., high interval(i, j) to permit a sufficient interval for the prefetching distance window and (b) high correlation between accesses to i and j, i.e., high anticipation(j|i).

5. There is potential for increasing reuse when a block i has (a) poor temporal proximity to itself, i.e., high interval(i, i) and (b) poor nearby reference correlation, i.e., low value of density(i|i).

Figure 1:

3 Spatial-Temporal Affinity

This section defines three location-centric spatial affinity metrics—interval, anticipation, and density—that can be applied at different resolutions.

3.1 Preliminaries

We use the following well known memory access metrics.

Accesses,A(b). Access total for block b.

Access intensity,AI(b). Normalized access fraction (in [0,1]) of block b. Given accesses to blocks in a region r, let \(\textsf {A} _{\textsf {max}}(b)\) be the maximum. Then, \(\textsf {AI} (b) = \frac{ \textsf {A} (b) }{\textsf {A} _{\textsf {max}}(b)}\).

Reuse distance,RD(b). Number of unique blocks accessed between consecutive accesses to block b, is equal to the average \(\textsf {RD}\) for block b.

Reuse interval,RI(b). Number of expected accesses encountered before the reuse of a block b, is equal to the average \(\textsf {RI}\) (in non-unique accesses) of block b.

We introduce and use the following two definitions:

Affinity block pair, (i, j). Affinity is between reference block i, and affinity blocks j.

Lifetime,L (b). Number of (non-unique) accesses between the first and last access to block b.

3.2 Spatial interval

Spatial interval is a generalization of block reuse interval that measures temporal access proximity, where a memory access sequence is an abstract form of time. Spatial interval is important because spatial locality is most beneficial when it occurs within small time windows. Whether it is accessing all elements of a cache line, two distinct cache blocks, or locations with a DRAM module that can be indexed together, the time window must be short enough to be beneficial.

Spatial interval is an integer. Smaller is generally better as it means that block j is accessed more quickly after i.

Definition. For memory block pair (i, j), the spatial interval \(\textsf {SI}_1 (i,j)\) is number of accesses (≥ 0) between an access to block j and the first previous access to block i. The interval is directional in that accesses progress forward from block i to j according to program execution. (When determining interval size, we exclude the access to i and j.)

Usually, it is more convenient to average affinity interval.

Definition. For a memory access sequence and a pair of memory blocks, let \(\mathcal {S\!I} (i,j)\) represent the set of access intervals that partition the sequence. That is, each element α = (a, b) in \(\mathcal {S\!I} (i,j)\) is an interval \(\textsf {SI}_1 (i,j)\) with access indices a and b, where a and b access blocks i and j, respectively. Then, average spatial interval \(\mathcal {S\!I} (i,j)\) is

\begin{equation}\textsf {SI} (i,j) = \frac{ \sum _{\alpha \in \mathcal {S\!I} (i,j)} \textsf {SI}_1 (\alpha _i,\alpha _j) }{ |\mathcal {S\!I} (i,j)| }\end{equation}

(1)

3.3 Spatial anticipation

Spatial anticipation \(\textsf {SA} (j|i)\) captures directional spatial locality or the probability that j is accessed after i. It is a real number within [0,1), where higher is better.

Definition. For block pair (i, j), spatial anticipation is the ratio of a) the number of intervals in the spatial interval set \(\mathcal {S\!I} (i,j)\) and b) \(\textsf {A} (i)\).

\begin{equation}\textsf {SA} (j|i) = \frac{ |\mathcal {S\!I} (i,j)| }{ \textsf {A} (i) }\end{equation}

(2)

As \(\textsf {SA} (j|i)\) defines anticipation of j to i, intuitively it is attributed to prefetching (both measure and preload).

3.4 Spatial density

Spatial density emphasizes the hotness of j over the lifetime of i. It is the probability of accessing block j between accesses to block i. It is a real number within [0,1), where higher is better.

Definition. For block pair (i, j), spatial density \(\textsf {SD} (j|i)\) is the ratio of a) accesses to j within b) the self-spatial interval of i:

\begin{align}\textsf {SD} (j|i) & = \underset {\alpha \in \mathcal {S\!I} (i,i)}{\textsf {avg}} \frac{ \textsf {A} _\alpha (j) }{ \textsf {SI} _\alpha (i, i) }\end{align}

(3)

\begin{align} & \approx \frac{ \textsf {A} (j) }{ \sum _\alpha \textsf {SI} _\alpha (i,i) } = \frac{ \textsf {A} (j) }{ \sum _\alpha \textsf {L} _\alpha (i)}\end{align}

(4)

\begin{align} & \approx \frac{ \textsf {A} (j) }{ \textsf {L} (i)}\end{align}

(5)

where \(\textsf {A} _\alpha (j)\) represents number of j within interval α.

Equation (5) approximates eq. (3) in two ways. First, the sum of fractions in eq. (3) is converted into separate sums in the numerator and denominator. Second, it approximates the denominator in eq. (4) with lifetime of i, which is equal to \(\textsf {A} (i)\) + \(\sum _\alpha \textsf {SI} _\alpha (i,i)\).

\(\textsf {SD} (j|i)\) defines nearby reference correlation of j to i, and is used to measure caching with the assumption of a fully associative cache. Note that, footprint [45, 46] as a metric adequately analyzes caching and associativity. But, footprint by definition uses time windows, whereas \(\textsf {SD} (j|i)\) is defined w.r.to accesses to location i and includes spatial locality.

3.5 Weighted affinity

In practice, it is most important to understand spatial affinity for memory blocks that are frequently accessed.

Definition. For memory block pair (i, j), we define a weighting factor (between (0, 1]) based on access intensity \(\textsf {AI} (i)\) (normalized access frequency) within memory region r.

We represent weighted \(\textsf {SA}\) and \(\textsf {SD}\) by scaling the respective metric in an obvious manner: \(\textsf {AI} (i) \cdot \textsf {SA} (j|i)\) or \(\textsf {AI} (i) \cdot \textsf {SD} (j|i)\).

Note that the weighting factor applies to the reference location (i) only.

The resulting weighted affinities have desirable properties. For example, if spatial anticipation \(\textsf {SA}\) is large, but i is cold, weighted\(\textsf {SA}\) will be small, which is correct. Alternatively, if i is hot but \(\textsf {SA}\) is small, weighted\(\textsf {SA}\) will be small, which again is correct.

If spatial density \(\textsf {SD}\) is large, but i is cold, weighted\(\textsf {SD}\) will be small, which is correct. Alternatively, if i is hot but \(\textsf {SD}\) is small, weighted\(\textsf {SD}\) will be small, which again is correct.

4 Spatial Affinity Scores

As previously observed (section 2), relating spatial interval with each of spatial anticipation and density yields powerful interpretive insights. For example, good spatial anticipation is more impactful when spatial interval is also good. To quantify these insights, we develop affinity scores\(\textsf {SD}^{*}\) and \(\textsf {SA}^{*}\) that coordinate the strengths of each of these affinity relations. For example, \(\textsf {SD}^{*} (j|i)\) captures pair-wise \(\textsf {SD} (j|i)\) and ‘goodness’ of \(\textsf {SI} (i,j)\) with a single metric.

4.1 Scores

To define scores, we characterize the goodness of spatial intervals with a coefficient γ(i, j), based on the observation that smaller \(\textsf {SI}\) is generally better. The coefficient maps smaller intervals to higher values. We then define affinity scores with respect to an affinity block pair (i, j) and its γ(i, j).

Definition. For each affinity pair (i, j) spatial anticipation score \(\textsf {SA}^{*} {}(j|i)\) is

\begin{equation}\textsf {SA}^{*} {}(j|i) = \gamma (i,j) \textsf {SA} (j|i)\end{equation}

(6)

Definition. Similarly, for each location i and its affinity block j, \(\textsf {SD}^{*} {}(j|i)\) is

\begin{equation}\textsf {SD}^{*} {}(j|i) = \gamma (i,j) \textsf {SD} (j|i)\end{equation}

(7)

The goodness coefficient is based on the goodness rank g(i, j) for the spatial interval \(\textsf {SI} (i,j)\). The rank ranges from 1 to \(n_\textsf {r}\) (lower is better); we often choose \(n_\textsf {r} = 5\) to cover interval values in each sample of memory access sequences in the trace. Since a smaller \(\textsf {SI}\) is better, we define the rank as:

\begin{equation}g (i,j) = \textsf {min}\left(n_\textsf {r}, \left \lfloor \frac{\textsf {SI} (i,j) }{n_\textsf {si}} + 1 \right \rfloor \right)\end{equation}

(8)

We suggest two ways to set \(n_\textsf {si}\) that are independent of workload. One is relative to the load queue size (loads in flight). Another is loads to fill some fraction of cache lines. Our experiments use 1/4 the load queue size.

The goodness coefficient γ(i, j) maps better ranks to higher values in (0, 1]:

\begin{equation}\gamma (i,j) = \frac{n_\textsf {r} - g (i,j)+1 }{n_\textsf {r}}\end{equation}

(9)

4.2 Multi-resolution scores

To facilitate top-down analysis using scores, we extend scores to multiple resolutions, enabling both rapid high-level comparisons and detailed understanding. We aggregate the affinities for each individual reference block as a weighted sum, and then aggregate that for a region of blocks.

For each reference block i in region r, the aggregate spatial anticipation and density scores are, respectively, \(\textsf {SA}^{*} {}(i)\) and \(\textsf {SD}^{*} {}(i)\):

\begin{equation}\textsf {SA}^{*} {}(i) = \sum _{j} \textsf {AI} (i) \cdot \textsf {SA}^{*} {}(j|i) \forall i \ne j\end{equation}

(10)

\begin{equation}\textsf {SD}^{*} {}(i) = \sum _{j} \textsf {AI} (i) \cdot \textsf {SD}^{*} (j|i)\end{equation}

(11)

where \(\textsf {AI} (i)\) is the affinity weight (section 3.5).

The corresponding scores for region r are:

\begin{equation}\textsf {SA}_{r}^{*} = \sum _{i} \sum _{j} \textsf {AI} (i) \cdot \textsf {SA}^{*} (j|i)\end{equation}

(12)

\begin{equation}\textsf {SD}_{r}^{*} = \sum _{i} \sum _{j} \textsf {AI} (i) \cdot \textsf {SD}^{*} (j|i)\end{equation}

(13)

4.3 Potential vs. realized score

We distinguish between potential and realized affinity to focus on possible candidates for data layout optimizations. Realized represents affinity in the current layout, and potential indicates affinity that could be realized with changes to access order or data layout. For a reference location i, realized score includes a subset of affinity locations (j) such as hot-lines and ‘relevant contiguous locations’ only (\(\textsf {SA}^{*}\): + 1, +2 offsets; \(\textsf {SD}^{*}\): − 1, +1 offsets, and ‘self’), whereas potential score includes all affinity locations.

5 Affinity Signatures

We develop affinity signatures that concisely represent spatial-temporal affinity in accesses, enable the use of our diagnostic rules to reason about likely performance, and compare between variants and/or different data structures of applications.

5.1 Affinity heatmap

Figure 2:

For a given affinity metric, an affinity heatmap is a matrix that represents three affinity dimensions: reference location i, affinity location j, and the affinity value for the pair (i, j). A signature can include heatmaps for multiple metrics and is thus 4D.

Consider the affinity matrix in fig. 2(a). Matrix columns denote reference locations (i), rows denote affinity locations (j), and each (i, j) cell represents a metric value. Matrix rows for j are organized into two bands. The lower band shows contiguous locations that are spatially related to block i, and demonstrate locality within contiguous locations. An upper band shows other blocks that highlight opportunities for reorganization, for example, hot blocks in execution, pages or regions.

Figure 2(b) shows the transformation into a heatmap. Note that in this example, the cell (colors) represents the transformation rather than the normal values. The heatmap’s affinity locations in the lower band are transformed to relative addresses, i.e., offset (+/-) locations. Blocks b₀ - b₃ that are spatially contiguous are represented by relative offset locations (b₁ is +1 for b₀, and b₁ is -1 for b₂). Hot blocks r₁ and r₂ are shown as absolute locations.

Figure 2(c) shows the signature, a combination of three heatmaps. MemFriend’s visualization shows \(\textsf {SD}\), \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps separately rather than combined scores because it allows a) quick estimation of spatial-temporal locality based on \(\textsf {SI}\) and b) preserves the qualifying distinction implied by \(\textsf {SI}\) for both \(\textsf {SD}^{*}\) and \(\textsf {SA}^{*}\). The signature includes the affinity value dimension, resulting in a 4D figure.

Each of the three heatmaps (\(\textsf {SD}\), \(\textsf {SA}\) and \(\textsf {SI}\)) in fig. 2c shows reference locations that are sampled based on hotness and ordered by it, and affinity locations that are sampled to capture most affinity pairs. Each (i, j) pair has a fixed location across the three heatmaps. Affinity locations show hot-lines in upper band with absolute locations. Hot-lines are a subset of the hottest (cache-line sized) blocks in the application and highlight temporal locality and also opportunities for reorganization.

5.2 Comparing affinity signatures

It is important to compare signatures between variants of an application. To compare signatures, we introduce an alternative 3D/4D representation along with methods for condensing signatures into lower dimensional representations.

5.2.1 Affinity signal.

An affinity signal shows the same three dimensions as the affinity matrix but changes the grouping. Figure 5 shows a signal plot with four subplots. The top three subplots show the most important affinity locations (j), i.e., relative offsets that are essential to study reuse and the impact of prefetching. These subplots show scores \(\textsf {SD}^{*}\) and \(\textsf {SA}^{*}\) that integrate \(\textsf {SI}\) for each (i, j) pair (section 4). The fourth subplot shows the access frequency of i to focus on access distribution between blocks. These plots show affinity values for all observed reference locations and are ordered by address locations.

5.2.2 Affinity histogram.

To give a high level perspective, an affinity histogram shows the distribution of affinity values. To compare impacts, we focus on affinity scores and prefer a continuous probability density function (PDF). Figure 6 shows a histogram as a PDF for each spatial affinity score. The result condenses three dimensions in the signature, namely the range of affinity metrics and both affinity dimensions (the reference and affinity axes).

For example, if the \(\textsf {SD}^{*}\) value for pair (2,3) is 0.3, it is represented in the histogram bin for \(\textsf {SD}^{*}\) 0.3. To pinpoint critical affinity relations, histograms focus on the hot-cluster section in the heatmap that covers significant reference locations (hot subset of i), and all its affinity locations. Both \(\textsf {SD}^{*}\) and \(\textsf {SA}^{*}\) are real numbers, with high values indicating better affinity; hence higher number of block pairs in the high-valued score bins indicates better locality.

5.2.3 Affinity vector.

An affinity vector represents each affinity score as a scalar, where each scalar compresses a single histogram. The scalar includes the importance of i, as defined in weighted affinity (section 3.5).

6 Affinity Analysis and Zooming

As a pair-wise analysis, naive affinity analysis requires quadratic space and time O(A²), where A represents unique memory blocks in a trace. This type of analysis, which views addresses at a single resolution, is only practical when address blocks are large, i.e., objects (regions) or pages. For efficient analysis at smaller levels of memory blocks, i.e., word or cache line, we use multi-resolution analysis that focuses on smaller but significant (hot) segments within regions of interest that are identified using location-based zooming.

Figure 3:

6.1 Multi-resolution analysis & complexity

fig. 3 (a) shows single resolution and fig. 3(b) shows multi-resolution analysis. Within the entire memory region (A blocks) single resolution analysis is applicable at the region level for regions of interest shown in red boxes. Within the regions of interest, multi-resolution analysis chooses hot sub-region (R blocks) marked in a blue box in fig. 3(a), and applies (cache-line sized) block level resolution.

Multi-resolution analysis significantly reduces analysis complexity in practice. The number of blocks A in the entire memory region is far greater than the blocks R in sub-regions. Blocks in excluded regions are consolidated into segments that cover the region and are considered as significant locations in the analysis. For a sub-region with R blocks, time and space complexity for multi-resolution analysis is O(NR), where N includes R and other significant locations.

6.2 Zooming

To find regions of interest in the memory access trace, we use zooming as shown in fig. 3(c). Zooming uses a recursive tree structure. First, the entire memory region is divided into heap and stack. The heap segment is chosen as the root node for further examination. Zooming proceeds top-down and counts access frequencies at each level. Zooming identifies two types of hot regions: hot-contiguous and hot-access.

To find hot-contiguous regions, each examined segment is divided into fixed-size (s_c) chunks. Contiguous chunks with an aggregated access frequency of at least t% of the parent’s access are identified as sub-regions, e.g., A, B, and C. t is a controllable parameter; we typically use 10 based on empirical evidence. Sub-regions are analyzed at further levels using recursively reduced sizes for s_c. Zooming ends when s_c reaches a specific size (configurable). The final hot-contiguous regions in fig. 3(c) are A₁, A₂, and C₁.

To find hot-access regions, regions with high access counts are identified by filtering hot instructions and their associated regions from the trace. Qualifying sub-regions H₁ and H₂ with size s_c are identified using the same access frequency policy, where segments with t% of the parent’s access qualify as a child.

The final regions of interest are A₁, A₂, C₁, H₁, and H₂. From these regions, multi-resolution analysis is applied to blocks in sub-region A₂₁, and significant locations can include A₁, A₂, C₁, H₁, and H₂.

7 Evaluation

We evaluate spatial affinity metrics using a set of representative HPC application benchmarks that vary data structures, storage formats, data layouts and algorithms. We describe the novel insights gained from affinity analysis. As a baseline, we compare against reuse distance (\(\textsf {RD}\)) and show the effectiveness of affinity metrics in distinguishing application performance.

Benchmarks. We study variants (OpenMP implementations) in the benchmarks listed below.

•

Reordering of sparse tensors using HiParTI-HICOO [24]

•

Sparse matrix storage formats using SpMM kernel in HiParTI-HICOO [23]

•

Graph clustering using miniVite [10]

•

Large-language models (transformers) via Alpaca.cpp [9]

•

Particle transport algorithm using XSBench [41]

Trace collection. We conduct experiments on an Intel Xeon® Gold 6126, and Intel Core i9-12900KF (Alder Lake hybrid) with 8 Performance-cores and 8 Efficient-cores. Binaries are instrumented using open-sourced MemGaze [20] framework to capture memory access trace based on Processor Tracing hardware. A mapping table for translating instrumented binary to the original is used to attribute regions to data objects.

Trace consists of samples of access sequences (~250) and includes instruction address, memory address and sample IDs. The length of sequences in a sample as well as trace sizes depend on sampling rate. The sampling rate varies between benchmarks: miniVite uses 5M, HiParTI-HICOO uses 4M, and XSBench uses 5M. Trace collection imposes overheads from 2.5 × to 6 ×. Trace files for HiParTI-HICOO matrix variants with 220 K to 4.7 M access sequences have sizes from 11 MB to 233 MB.

Analysis Overhead. Affinity analysis time depends on the number of sub-regions considered for multi-resolution analysis and trace size. In the case of HiParTI-HICOO matrix variants with trace sizes from 11 MB to 233 MB, and 10 sub-regions, analysis runtime ranges from 2 seconds to 20 seconds. Signature visualization of heatmaps, histograms, and signals take ~20 seconds.

7.1 HiParTI-HICOO

To study sparse matrix storage formats and tensor data layouts, we choose the HiParTI suite [22]. In HiParTI, SpMM kernels incorporate matrix storage formats such as compressed sparse row (CSR), coordinate (COO), and HICOO [23]. It also includes multiple tensor reordering variants [24]. HICOO is a compressed block storage format for sparse tensors and matrices; it derives from and improves upon the COO format.

7.1.1 Tensor reordering variants.

For computational efficiency, sparse tensor data is generally reordered (indices relabeled) to improve data access locality. We analyze locality patterns of tensor reordering variants: Default (no reorder), Random (random order), BFS (breadth first search-like heuristic approach) and Lexi (lexicographical order) [24]. All variants are integrated in an MTTKRP kernel implementation with HICOO storage format. Benchmark is run with nell-2 [35], a third-order tensor with 77M nonzeroes and a density of 2.4 × 10^{− 5}.

Figure 4:

Table 1:

Region	Variant	Run time	A	\(\textsf {RD}\)	Realized		Potential
Region	Variant	Run time	A	\(\textsf {RD}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)
factor	Default	9.5 s	1.83M	2.53	0.3	0.15	0.4	0.2
matrices	Random	9.0 s	1.81M	2.67	0.4	0.2	0.5	0.2
&	BFS	4.4 s	1.84M	5.07	2.9	0.5	3.5	0.5
output	Lexi	3.4 s	1.81M	3.24	3.3	0.7	5.3	0.8

Table 1: HiParTI-HICOO tensor reordering variants: Data locality and affinity vector.

Figure 5:

Figure 6:

Aff. Heatmap. Figure 4 shows affinity heatmaps for the hottest memory region. The region includes two objects factor matrices & output. Heatmaps for the variants show different affinity patterns across reference locations (horizontal axis) for each variant. We make four important observations from the heatmaps.

First, consider the \(\textsf {SD}\) metric. Figures 4a and 4b show Default and Random variants. In these signatures notice that (a) the shaded box in the \(\textsf {SD}\) heatmap shows sparse and irregular affinity, (b) the shaded box in the \(\textsf {SI}\) heatmap for the same locations shows low (good) values, and (c) ① shows a large range of contiguous affinity locations (offsets on the vertical axis, +61 to -7 for Default and +42 to -15 for Random) that extend beyond the shaded box. These observations point that distant affinity locations are accessed at closer intervals and there is minimal correlation between accesses to a block and its neighbors. Hence the two variants have no spatial-temporal locality. Figures 4c and 4d show BFS and Lexi variants. In these signatures, observe that (a) the shaded box in the \(\textsf {SD}\) heatmap shows high affinity to closely located offset locations, (b) the shaded box in the \(\textsf {SI}\) heatmap for the same locations shows increasing \(\textsf {SI}\) values with the increase in offset distance, and (c) affinity neighborhood in ① shows a smaller range (+5 to -4 for BFS, and +7 to -7 for Lexi). This pattern with high affinity to adjacent locations and good intervals in BFS and Lexi, exhibits good spatial-temporal locality and complies with guideline §2.1 for good performance.

Second, consider \(\textsf {SD}^{*}\) for self. ② combines values from \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps, and shows that \(\textsf {SD}^{*}\) is low across all variants. Though BFS and Lexi have slightly better \(\textsf {SD}^{*}\), the improvement is small. This suggests that although temporal reuse increases (along with caching opportunities) in BFS and Lexi, the impact will be minor.

Third, consider the \(\textsf {SA}\) metric. Figures 4a and 4b show Default and Random variants. In these signatures notice that (a) the shaded box in the \(\textsf {SA}\) heatmap shows sparse \(\textsf {SA}\) values, (b) \(\textsf {SI}\) heatmap for the same locations shows low (good) values, and (c) ① shows that both \(\textsf {SA}\) and \(\textsf {SI}\) extend to a wider range of affinity locations. \(\textsf {SA}\) values are scattered between affinity locations with minimal access proximity between neighbors, suggesting that the access pattern is irregular and spatially sparse in Default and Random. In contrast, in Figs. 4c and 4d for BFS and Lexi notice that (a) the shaded box in the \(\textsf {SA}\) heatmap shows better \(\textsf {SA}\) and high access proximity to adjacent locations, and (b) as discussed in \(\textsf {SD}\), both \(\textsf {SI}\) values and affinity neighborhood range comply with good performance. For BFS and Lexi, the observations point to irregular access but with good spatial-temporal locality.

Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ combines values from \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps, shows \(\textsf {SA}^{*}\) is high for both BFS and Lexi, whereas Default and Random have no measurable value. This highlights that BFS and Lexi have high anticipatory spatial-temporal locality to + 1 offset location and will highly benefit from hardware prefetching.

We conclude that (a) affinity metrics distinguish the performance of the variants and (b) explain that the tensor reorderings primarily improve \(\textsf {SA}\) in contrast to \(\textsf {SD}\). The latter means that the reorderings primarily depend upon hardware prefetching for impact. Our experiments with disabled prefetchers confirm that runtime degrades from 8% to 28% for Lexi and BFS, over a range of large tensors (nell-2 and nell-1 [35], freebase-music [17]).

Aff. Signal. We study affinity over the memory address space for BFS and Lexi with affinity signal plots (Figs. 5a and 5b) that show \(\textsf {SA}^{*}\) and \(\textsf {SD}^{*}\) for offset locations (− 1 to + 1). We focus on \(\textsf {SA}^{*}\) values in the plots, as \(\textsf {SD}^{*}\) is low for all blocks in both variants. \(\textsf {SA}^{*}\) for + 1 offset locations in BFS is stagnant at 0.25, whereas Lexi has a significant number of blocks with 0.5. This shows that Lexi has a higher number of blocks with more affinity to adjacent locations than BFS, and these blocks leverage hardware prefetcher to decrease load latency. Access frequency plots at the bottom show that Lexi has less access frequency variation between blocks than BFS. It indicates that memory locations in Lexi are accessed more frequently with more use of the same location, consistent with more reuse and perhaps more memory-level parallelism. Again, we see that \(\textsf {SA}^{*}\) clearly explains the better performance in Lexi.

Aff. Histogram. Figure 6a and Fig. 6b show the distribution of affinity pairs for \(\textsf {SA}^{*}\) and \(\textsf {SD}^{*}\) for all variants. In these plots, a distribution is better when it has more area that is skewed “up and to the right”. Figure 6a shows that Lexi has more affinity pairs (upper section) and affinity pairs at all score levels, including high values (right section). Though Default and Random have few affinity pairs at high score values, their weighted affinity (section 3.5) is low. The distribution in Fig. 6b shows that pairs in all variants are concentrated towards the lower \(\textsf {SD}^{*}\) values (left section). Though Lexi has a better \(\textsf {SD}^{*}\) distribution, the plot does not identify a clear winner. Thus we see that affinity heatmaps, signals, and histograms all explain the better performance of BFS and Lexi by focusing attention on the impact on \(\textsf {SA}^{*}\).

Aff. Vector. First, we consider the baseline locality metric \(\textsf {RD}\) in table 1 shows the best (lowest) value for Default and Random variants, and the worst value for BFS. We observe that \(\textsf {RD}\) can be highly misleading as an indicator of application performance.

Second, consider realized scores. Recall that realized score quantifies affinity to adjacent locations in the current layout. \(\textsf {SA}_{r}^{*}\) in table 1 shows high values for Lexi and BFS, indicating the benefits of reordering. Lower realized\(\textsf {SA}_{r}^{*}\) value for Default and Random represent the sparse layout and access pattern. Though realized\(\textsf {SD}_{r}^{*}\) is higher for Lexi, the minimal range of values does not distinguish between variants.

Third, consider potential scores. Recall that potential scores represent a possible value for affinity under the assumption that layout can be changed dynamically without cost, but it introduces overheads. Potential vectors in table 1 also show high values for Lexi. In Lexi, the increase in potential score for \(\textsf {SA}_{r}^{*}\) is attributed to the − 1 offset location. For the detailed explanation, recall that the − 1 offset location has high temporal proximity (lower \(\textsf {SI}\), high \(\textsf {SA}\) in Fig. 4d and high \(\textsf {SA}^{*}\) in Fig. 5b). It is possible that the − 1 offset location is present in the cache from prior access, and actual realized affinity is higher. Potential\(\textsf {SD}_{r}^{*}\) values are similar to realized as \(\textsf {SD}\) values are concentrated towards hot-lines, and as in realized\(\textsf {SD}_{r}^{*}\) the values are low and do not distinguish between variants.

7.1.2 Matrix variants.

We analyze memory locality effects of sparse matrix storage formats CSR, COO and HICOO in SpMM kernel. SpMM kernel computes C = A × B where A is a sparse matrix, B and C are dense matrices. In our configuration, A (blockqp1 [6]) has a density 1.77 x 10^{− 4} and the block structure is shown in table 2; number of columns in B is set to 4096. We focus on efficient parallel implementations and select CSR, COO-Reduce, HICOO-S variants for analysis. CSR parallelizes rows in A. COO-Reduce parallelizes the number of non-zeroes and uses a buffer for C. HICOO-S parallelizes compressed blocks.

Table 2:

Figure 7:

Aff. Heatmap. fig. 7 shows heatmaps for the hottest memory region, matrix B in all variants. It shows uniform locality patterns where the affinity pattern remains the same across all reference locations (horizontal axis). We make four observations from the heatmaps.

First, consider the \(\textsf {SD}\) metric. In all three variants, (a) yellow box in the \(\textsf {SD}\) heatmap shows that the variants have high \(\textsf {SD}\) values for self location only, (b) yellow box in the \(\textsf {SI}\) heatmap shows low (best) \(\textsf {SI}\) values for the same affinity location self, and (c) ① shows affinity range contains self location only. These observations indicate that no other affinity locations are accessed within the lifetime of each reference block, and blocks are reused within a short interval.

Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps and shows high values in all variants, indicating that reuse is high, with the best values in COO-Reduce.

Third, consider the \(\textsf {SA}\) metric. For CSR in Fig. 7a and HICOO-S in Fig. 7c, \(\textsf {SA}\) is high and similar across contiguous locations, and \(\textsf {SI}\) remains low (good) for all reference blocks. For COO-Reduce in Fig. 7b, \(\textsf {SA}\) is not uniform and the values are insignificant, as it incorporates the highest access counts of reference blocks, but note that it is accompanied by good \(\textsf {SI}\) values. Though \(\textsf {SA}\) for COO-Reduce differs from the other two, \(\textsf {SI}\) remains similar. These observations point to regular, forward traversal access patterns in all three variants.

Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ combines \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps and shows high values in CSR and HICOO-S variants, indicating that they effectively leverage the prefetcher. Though \(\textsf {SA}^{*}\) for + 1 offset location is low in COO-Reduce, it is accessed in a temporal forward direction.

To summarize, for all variants (a) range of affinity blocks is self, (b) blocks have high reuse, and (c) \(\textsf {SI}\) heatmaps show effective prefetching. These observations point to regular, strided access with high spatial-temporal locality and suggest that all variants follow guideline section 2.1 behavior for good performance, but there are differences in the rate of reuse. We conclude that same-line locality is most important for the matrix variants. Although this type of locality does not strictly correspond to either caching or prefetch schemes, it is important and captured by our affinity metrics.

Aff. Vector. Baseline \(\textsf {RD}\) values for all variants in table 2 are equal (best value of ~1) and indicate similar behavior, whereas distinct \(\textsf {SD}^{*}\) values show the trade-off between storage format and locality. Note that COO-Reduce with the highest \(\textsf {SD}^{*}\) value (high reuse) is not the best-performing variant, because it requires additional buffering as shown in very high access counts (\(\textsf {A}\)). Though CSR and HICOO-S have smaller \(\textsf {SD}^{*}\) values, they are efficient because of effective data reuse based on instruction level parallelism (A in table 2 are 4 × to account for SSE).

It is worth noting that higher reuse, measured by \(\textsf {SD}^{*}\) values translates to superior performance within the bounds of access frequency. So, spatial affinity metrics combined with other characteristics such as access frequency provide complete information about performance. At the same time, spatial affinity metrics provide precise spatial-temporal locality measure than reuse distance (even with best value ~1).

7.2 miniVite

miniVite [10] is a graph benchmark for community detection that uses Louvain optimization. Variants in our analysis use different hash table implementations for the hottest map object. v1 uses a C++ unordered_map, an open hash table with an array of keys, each containing a linked list for items, and hence irregular accesses. v2 and v3 use TSL hopscotch [15, 38], a closed hash table that stores items in a contiguous array, and replaces irregular accesses with strided ones. v2 uses the default table size and does dynamic resizing. v3 avoids resizing by specifying right-size for each instance.

Table 3:

Region	Variant	Run time	A	\(\textsf {RD}\)	Realized		Potential
Region	Variant	Run time	A	\(\textsf {RD}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)
map	v1	9.1 s	301.8K	2.8	4.3	1.3	4.8	1.4
(hash	v2	6.7 s	487.4K	3.3	7.5	2.9	8.1	2.9
table)	v3	4.9 s	284.7K	2.9	6.0	1.9	7.1	2.1

Table 3: miniVite: Data locality and affinity vector for map object.

Figure 8:

Figure 9:

Aff. Heatmap. Figure 8 shows heatmaps for the map object, with varying affinity patterns among reference blocks (horizontal axis) in each variant. We begin with five observations.

First, consider the \(\textsf {SD}\) metric. In Fig. 8a for v1 observe that (a) the shaded box in the \(\textsf {SD}\) heatmap shows sparse and scattered affinity to contiguous locations, (b) the shaded box in the \(\textsf {SI}\) heatmap shows low (good) \(\textsf {SI}\) values for these locations, and (c) ① shows that range of affinity locations extend to a wider neighborhood (+14 to -10 offset locations in vertical axis). Scattered \(\textsf {SD}\) values within the range, along with good \(\textsf {SI}\) indicate that distant locations are accessed in close temporal proximity. This pattern points to negligible spatial and temporal affinity to adjacent locations, reflective of irregular accesses in the linked list. For v2 and v3 in Fig. 8b and Fig. 8c, note that (a) the shaded box in the \(\textsf {SD}\) heatmap shows \(\textsf {SD}\) values that are congregated towards same-line and adjacent locations, (b) the shaded box in the \(\textsf {SI}\) heatmap shows a trend of increasing \(\textsf {SI}\) with offset distances, and (c) ① for v2 and v3 show affinity to low range of adjacent locations (v2: +4 to -7, v3: +7 to -8). Concentrated \(\textsf {SD}\) along with a good trend in \(\textsf {SI}\) values indicates better affinity to adjacent locations, evident of the strided traversal of map.

Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps and shows low \(\textsf {SD}^{*}\) value in all variants. v1’s \(\textsf {SD}^{*}\) values are very low. Both v3 and v2 have better (slightly higher) \(\textsf {SD}^{*}\) values as the traversal limits the reuse of blocks. We infer that cache-friendly and same-line reuse improves in v2 and v3, but it is still relatively low in all variants.

Third, consider the \(\textsf {SA}\) metric. In Fig. 8a for v1 notice that (a) the shaded box in the \(\textsf {SA}\) heatmap shows non-uniform \(\textsf {SA}\) values that are spread over contiguous locations, (b) as noted in \(\textsf {SD}\) discussions, \(\textsf {SI}\) values also favor distant locations, and (c) ① shows v1’s anticipation extends to a wide range of contiguous locations. \(\textsf {SA}\) correlation to distant locations shows that there is no anticipatory access to spatially adjacent locations. In Fig. 8b and Fig. 8c for v2 and v3, (a) the shaded box in the \(\textsf {SA}\) heatmap shows high \(\textsf {SA}\) values for closer neighbors, especially in v3, (b) as noted in \(\textsf {SD}\) discussions, \(\textsf {SI}\) values show preference to adjacent locations, and (c) ① for v2 and v3 show anticipation to adjacent locations. High \(\textsf {SA}\) values along with good \(\textsf {SI}\) within a much smaller range of adjacent locations, indicate anticipatory access with better spatial locality in v3.

Fourth, consider \(\textsf {SA}^{*}\) to + 1 offset location. For v1 in Fig. 8a ③ (combines \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps) shows negligible values, indicating that v1 has no prefetching advantage. From ③ in Fig. 8b and Fig. 8c, we note that v3 has more reference locations with impactful \(\textsf {SA}^{*}\) to + 1 offset location, pointing to beneficial prefetching than v2.

Finally, temporal locality (④ upper band in all heatmaps) to hot-lines remains significant in all variants, and it is similar between all variants.

Interestingly, no variant has affinity relations that follow guideline section 2.1 for good performance, but the \(\textsf {SA}\) and \(\textsf {SI}\) heatmaps show improving spatial locality from v1 to v3.

In summary, changing map’s data structure from open to closed hash table (a) improves \(\textsf {SA}\) than \(\textsf {SD}\) to adjacent locations and (b) though caching (\(\textsf {SD}\)) for adjacent locations remains low, it is high for hot-lines in all variants. We conclude that the primary explanation of v2’ and v3’s performance is a data structure that exploits hardware prefetching.

Aff. Histogram. Figure 9 shows histogram plots; recall that distribution is better when it has more area that is skewed “up and to the right”. Figure 9a for \(\textsf {SA}^{*}\) shows that v3 has high distribution of affinity pairs (upper section), with a greater number of pairs at the higher score values (right section). v3 has far more impactful affinity as the weighted value of the area under the curve increases towards high \(\textsf {SA}^{*}\) values. Figure 9b for \(\textsf {SD}^{*}\) shows that both v3 and v2 have a better distribution of affinity pairs than v1. Though v2 has almost comparable distribution as v3, a significant portion of the accesses in v2 are unnecessary (resizing) and hinder performance. Again, \(\textsf {SD}^{*}\) shows an improving trend from v1 to v3, but \(\textsf {SA}^{*}\) shows better variation and explains the better performance of v3.

Aff. Vector. First, similar values for baseline \(\textsf {RD}\) in table 3 for v1 and v3 fail to differentiate the performance of these variants.

Second, consider realized scores for quantifying the current layout. \(\textsf {SA}_{r}^{*}\) shows a higher value for v3 when compared to v1, reflecting the beneficial prefetching. Realized\(\textsf {SD}_{r}^{*}\) shows high value for v3 than v1, indicating better reuse. In both cases, though v2 has the highest scores, the impact of the affinity score is reduced as the accesses include added copying along with traversal through the array.

Third, consider potential scores for possible layout changes and their impact on affinity. Potential\(\textsf {SA}_{r}^{*}\) also shows a higher value for v3. Possible options for potential score are spread over a small neighborhood of adjacent locations within the range, indicating that re-layout can lead to better locality. For all variants, \(\textsf {SD}_{r}^{*}\) doesn’t show a difference between realized and potential scores as \(\textsf {SD}\) values are concentrated among the hot-lines.

7.3 Alpaca.cpp

Alpaca.cpp [9] is an LLM inferencing application implemented in C++, parallelized using Pthreads. It combines efforts such as LLaMA [39] and Alpaca [37]. Inferencing in our evaluation uses the LLaMA 7B parameter model quantized with 4-bit integers (ggml-alpaca-7b-q4.bin) with seed value 1685480810, and default values for other parameters.

Among the 12K lines of code (7 MB binary), location analysis highlights two regions x and y in hotspot function ggml_vec_dot_q4_0. x region corresponds to src0_row with spatially sparse accesses identified by hot-access analysis; y region corresponds to src1_col and it is a hot-contiguous region.

Table 4:

Region	A	\(\textsf {RD}\)	Realized		Potential
			\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)
src0_row	4.8M	1.25	0.4	0.7	0.6	0.7
src1_col	28.2M	2.99	1.4	0.2	2.8	0.26

Table 4: Alpaca.cpp: Data locality and affinity vector.

Figure 10:

Aff. Heatmap. Heatmaps in fig. 10 show different locality patterns with differing affinity patterns across reference locations (horizontal axis) for each region. We use signatures to differentiate the access patterns in the two regions.

src0_row. We start with src0_row object and its signature in Fig. 10a. First, consider the \(\textsf {SD}\) metric. \(\textsf {SD}\) heatmap shows high values for self, and ① shows that affinity is limited to + 1, −1 offset locations and self. This indicates that only adjacent locations are accessed after access to a reference block.

Second, consider \(\textsf {SD}^{*}\) for self. ② combines \(\textsf {SD}\) and \(\textsf {SI}\) heatmaps, and shows high values of \(\textsf {SD}^{*}\) (~0.3) for all blocks, and points that these blocks are highly reused.

Third, consider the \(\textsf {SA}\) metric. \(\textsf {SA}\) heatmaps show high values for + 1 offset than − 1 offset. \(\textsf {SI}\) heatmap also shows low (good) \(\textsf {SI}\) values for + 1 offset and high (bad) values for − 1 offset. This shows that the + 1 offset location has a high correlation and the access pattern is regular, strided access with mostly forward traversal.

Fourth, consider \(\textsf {SA}\) for + 1 offset location. ③ shows a combination of low \(\textsf {SI}\) values and higher \(\textsf {SA}\) values, implying that + 1 offset locations are accessed in close temporal proximity.

The observed pattern suggests that access behavior for src0_row has good spatial-temporal locality as in guideline §2.1.

src1_col. Now, we discuss src1_col object’s signature in Fig. 10b. First, consider the \(\textsf {SD}\) metric. In Fig. 10b observe that (a) the shaded box in the \(\textsf {SD}\) heatmap shows scattered and sparse \(\textsf {SD}\) values, (b) the shaded box in the \(\textsf {SI}\) heatmap points to increasing \(\textsf {SI}\) for positive offset locations, and high values for self and negative offsets, and (c) ① shows affinity range that is higher (+6 to -25 offset locations). Scattered affinity to locations across the range, with irregular \(\textsf {SI}\) values, show that the access pattern in src1_col traverses a wide range of adjacent locations within each reference block’s lifetime.

Second, consider \(\textsf {SD}^{*}\) for self. ② shows high \(\textsf {SI}\) and low \(\textsf {SD}\) (~0.1) for self locations, pointing that reuse remains low and is an example of guideline section 2.5.

Third, consider the \(\textsf {SA}\) metric. Notice that (a) the shaded box in the \(\textsf {SA}\) heatmap shows uniform \(\textsf {SA}\) values for all locations, and (b) as noted in \(\textsf {SD}\) discussions, \(\textsf {SI}\) values show a preference for positive offset locations. We observe that the access pattern in src1_col has beneficial anticipation towards positive offset locations.

Fourth, consider \(\textsf {SA}^{*}\) for + 1 offset location. ③ with high \(\textsf {SA}^{*}\) value shows that high anticipatory spatial-temporal locality to + 1 offset location, and access pattern that leverages the hardware prefetcher.

We validated that decoder implementation in Alpaca.cpp uses a key-value cache for efficient inferencing. We are exploring other ways to improve locality for src1_col.

Aff. Vector. First, baseline \(\textsf {RD}\) values in table 4 shows low value and hence better locality for src0_row and informs about its good temporal locality. Second, realized scores show high \(\textsf {SA}_{r}^{*}\) value for src1_col and low values for src0_row highlighting the benefits of spatial locality and prefetching in src1_col. Also, realized\(\textsf {SD}_{r}^{*}\) shows high values for src0_row, and low values for src1_col, pointing to low reuse in src1_col. Third, potential score \(\textsf {SA}_{r}^{*}\) shows that efforts to reorder accesses for improved prefetching should focus on src1_col. Potential\(\textsf {SD}_{r}^{*}\) doesn’t show a difference to realized score suggesting that efforts to improve reuse might need higher levels of refactoring.

Thus, we observe that spatial affinity metrics provide more insights about data layout, access patterns, and their implications on memory performance, than reuse distance analysis.

7.4 XSBench

XSBench [40, 41] is a proxy application that models Monte Carlo neutron transport algorithm, specifically the calculation of macroscopic neutron cross sections. We evaluate two variants of the event-based transport model, baseline k0 and optimized k1. Baseline k0 parallelizes cross section lookups for all particles with varying materials and follows random access pattern. Optimized k1 sorts the particles by material and energy and facilitates efficient memory access.

Figure 11:

Figure 12:

Table 5:

Region	Variant	Run	A	\(\textsf {RD}\)	Realized		Potential
		time			\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)	\(\textsf {SA}_{r}^{*}\)	\(\textsf {SD}_{r}^{*}\)
material and	k0	52 s	1.5M	2.2	1.24	0.95	1.24	0.95
nuclide	k1	14 s	1.5M	2.3	1.25	0.97	1.27	0.97

Table 5: XSBench-event variants: Data locality and affinity vector for material and nuclide region.

Locality analysis identifies three regions: hot-contiguous region material and nuclide; and two hot-access regions with sparse accesses, energy grid, and grid cross section.

First, we describe signatures for hot-contiguous region material and nuclide.

Aff. Histogram. We exclude heatmaps for hot-contiguous regions as both variants use the same data structure and affinity patterns are similar. Histogram of affinity pairs for material and nuclide in Fig. 11a and Fig. 11b show better number of affinity pairs (skewed “up and to the right”) and better affinity in k1.

Aff. Vector. First, baseline \(\textsf {RD}\) shows better (lower) values for k0, and misleads about performance. Second, realized scores in table 5 show better \(\textsf {SA}_{r}^{*}\) and \(\textsf {SD}_{r}^{*}\) values for k1, reflecting the optimal accesses in k1. Third, consider potential scores. The similarity between realized and potential scores suggests that material and nuclide region has the best possible layout, and optimizations should focus on other regions.

Table 6:

Region	Variant	A	\(\textsf {RD}\)
energy	k0	32.7K	1.0
grid	k1	253.2K	7.3
grid cross	k0	9.3M	0.0
section	k1	9.4M	0.2

Table 6: XSBench-event variants: Reuse distance and access frequency for energy grid and grid cross section regions.

Now, we focus on sparsely accessed regions (energy grid and grid cross section). We use signals to compare against baseline \(\textsf {RD}\), as the scores are not applicable due to highly sparse accesses.

Aff. Signal. Baseline \(\textsf {RD}\) in table 6 for the sparsely accessed regions (energy grid and grid cross section) shows better (lower) values for k0, and misleads about the performance.

For the same regions, affinity signals are shown in Fig. 12. k0’s signal plots are shown in Fig. 12a. The bottom subplot for access frequency shows sparsely accessed blocks. The middle subplot for self shows varying \(\textsf {SA}^{*}\) and \(\textsf {SD}^{*}\). The top subplot for + 1 offset shows no \(\textsf {SA}^{*}\) to + 1 offset locations. These observations indicate worse temporal as well as spatial locality among all blocks.

k1’s signal plots are shown in Fig. 12b. The bottom subplot for access frequency shows more blocks with higher access frequency. The middle subplot for self shows varying \(\textsf {SA}^{*}\) and \(\textsf {SD}^{*}\), but they are effective as the blocks have high access frequency. The top subplot for + 1 offset shows considerable \(\textsf {SA}^{*}\) to + 1 locations. These observations indicate that k1 has better spatial-temporal locality. k1’s signatures with better spatial-temporal affinity patterns reflect the optimized memory access patterns in k1.

We conclude that spatial affinity signatures capture differences in access patterns for regions with vastly different footprints, and the performance of k1 depends on both caching and prefetching.

8 Related Work

We introduce new spatial-temporal affinity metrics, and scalable affinity analysis that can operate at multiple resolutions.

Affinity and correlation. In the closest work [30], Miucin et al. measure co-occurence of pair of data objects within a window to detect communities between objects to guide data layout. It utilizes co-occurence within a defined window as a measure of affinity to pack objects into a cache line and reduce cache miss rate. In contrast, we use generalized windows with the inclusion of temporal locality and lifetime, and use strength of pair of metrics to identify locality signatures. Also, our multi-resolution analysis is applicable for objects as well as memory locations within large data objects.

Another related concept, reference affinity [49] measures affinity between two data objects based on similarities in their single-location reuse. Reference affinity has been extended to add more insights: weakly affined references in [48]; hierarchical data locality in [47]; conditional probability based co-occurence between pair of affine data in [26]. Additionally, Ning et al. [31] propose improved field affinity to measure affinity between structure fields to reorganize structure layouts within cache lines to increase cache reuse. Both reference affinity and field affinity follow actor-centric formulation for analyzing accesses to user guided access sites. In contrast, our location-centric analysis covers spatial-temporal locality and reuse of memory locations to describe affinity between locations.

Sobel et al. [36] present an improved algorithm (with linear complexity) for co-occurence [26] counting for multiple pattern sets. In our effort, we use zooming to restrict the pattern sets for affinity analysis, as the reduced pattern set focuses on significant locations, and is sufficient to construct affinity signatures. If many more pattern sets are needed for capturing affinity signature, one could use [36] as an alternative.

Spatial-temporal probability. Anghel et al. [1] introduce multi-dimensional spatial-temporal locality for single address locations, measure the probability of memory accesses with specific spatial distance and reference windows, and visualize the probability distribution as a heatmap. Our work focuses on the correlation between multiple locations and generalizes reference windows as lifetime in spatial density, and forward temporal distance in spatial interval.

Spatial locality measurements. Multiple efforts have recognized the limitations (single location, temporal dimension) of using reuse distance for spatial locality analysis and explored novel ways to measure spatial locality. Gu et al. [11] measure the change of reuse distance at different block granularities to identify program components to improve data layout. Multi-spectral reuse distance [4] uses reuse distance histograms at different data block sizes, and footprint metrics to study access patterns and identify appropriate page sizes. Maeda et al. [27] implement hierarchical reuse distance to preserve locality details at multiple granularities to aid in design exploration. Gupta et al. [12] measure aggregated spatial locality with specific neighborhood and window sizes to evaluate the effects of cache specifications and their impact on performance. These methods are constrained to specific neighborhood and window sizes, whereas we use multi-resolution analysis for expanding the neighborhood, and use measured temporal distance \(\textsf {SI}\) to quantify locality.

Along with addressing limitations of reuse distance analysis, added benefits in our work include: novel spatial-temporal affinity metrics and visualization to show access patterns and highlight affinity between locations; and combinations of metrics to identify opportunities for improving data layout.

Pattern analysis.

Pattern analysis in various forms has been exploited in a variety of fields. We focus on pattern analysis methods used to improve prefetching in storage systems as it is related to our effort. Occurrence of frequent sequences [7, 25] in I/O access patterns is used to identify block correlations within specified windows to direct prefetching of blocks. These efforts use temporal [25] or spatial-temporal [7] information to identify correlations between blocks within specific window to improve I/O response times. These techniques exemplify the impact of correlation between multiple blocks and influence our work in finding highly correlated memory locations.

There has been recent interest in learning access patterns using machine learning. This work ranges from learning optimal replacement policies [3, 16, 34, 43] to learning patterns to assist prefetching [13, 14, 21, 32, 33, 42]. In general, these approaches require substantial training time or data. Also, most prefetching approaches are critically dependent on program control structure, not data locality.

9 Conclusions

We present novel spatial affinity metrics that capture affinity between pairs of memory locations along multiple dimensions and at multiple resolutions. We show how to characterize spatial-temporal affinity with efficient location-based multi-resolution analysis that enables analysis of full applications using large memory footprints. Our evaluations demonstrate that affinity signatures can predictively represent important and differing forms of spatial-temporal locality, including differences that arise from variant data structures, layouts, and access patterns. We conclude that spatial affinity metrics provide more interpretive insight into the impact of spatial-temporal locality compared to prior methods and metrics that rely on single location. Our future work includes applying our work to application data objects with dynamic lifetimes, memory allocation and data layouts, and hardware-software co-design.

Acknowledgments

This research is supported by the U.S. Department of Energy (DOE) through the Office of Advanced Scientific Computing Research’s “Advanced Memory to Support Artificial Intelligence for Science” and “Orchestration for Distributed & Data-Intensive Scientific Exploration”. PNNL is operated by Battelle for the DOE under Contract DE-AC05-76RL01830.

References

[1]

Andreea Anghel, Gero Dittmann, Rik Jongerius, and R Luijten. 2013. Spatio-Temporal Locality Characterization. In 1st Workshop on Near-Data Processing. 1–5.

Abstract

1 Introduction

2 Motivating Multiple Affinity Relations

3 Spatial-Temporal Affinity

3.1 Preliminaries

3.2 Spatial interval

3.3 Spatial anticipation

3.4 Spatial density

3.5 Weighted affinity

4 Spatial Affinity Scores

4.1 Scores

4.2 Multi-resolution scores

4.3 Potential vs. realized score

5 Affinity Signatures

5.1 Affinity heatmap

5.2 Comparing affinity signatures

5.2.1 Affinity signal.

5.2.2 Affinity histogram.

5.2.3 Affinity vector.

6 Affinity Analysis and Zooming

6.1 Multi-resolution analysis & complexity

6.2 Zooming

7 Evaluation

7.1 HiParTI-HICOO

7.1.1 Tensor reordering variants.

7.1.2 Matrix variants.

7.2 miniVite

7.3 Alpaca.cpp

7.4 XSBench

8 Related Work

9 Conclusions

Acknowledgments

References

Index Terms

Recommendations

Trading cache hit rate for memory performance

A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads

Reshaping cache misses to improve row-buffer locality in multicore systems

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations