Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

AMC: Access to Miss Correlation Prefetcher for Evolving Graph Analytics

Abhishek Singh12, Christian Schulte3, Xiaochen Guo12
1Lehigh University 2Samsung Semiconductor, Inc., USA 2 Columbia University
Abstract

Modern memory hierarchies work well with applications that have good spatial locality. Evolving (dynamic) graphs are important applications widely used to model graphs and networks with edge and vertex changes. They exhibit irregular memory access patterns and suffer from a high miss ratio and long miss penalty. Prefetching can be employed to predict and fetch future demand misses. However, current hardware prefetchers can not efficiently predict for applications with irregular memory accesses.

In evolving graph applications, vertices that do not change during graph changes exhibit the same access correlation patterns. Current temporal prefetchers use one-to-one or one-to-many correlation to exploit these patterns. Similar patterns are recorded in the same entry, which causes aliasing and can lead to poor prefetch accuracy and coverage. This work proposes a software-assisted hardware prefetcher for evolving graphs. The key idea is to record the correlations between a sequence of vertex accesses and the following misses and then prefetch when the same vertex access sequence occurs in the future. The proposed Access-to-Miss Correlation (AMC) prefetcher provides a lightweight programming interface to identify the data structures of interest and sets the iteration boundary to update the correlation table. For the evaluated applications, AMC achieves a geomean speedup of 1.5×\times× as compared to the best-performing prefetcher in prior work (VLDP). AMC can achieve an average of 62% accuracy and coverage, whereas VLDP has an accuracy of 31% and coverage of 23%.

I INTRODUCTION

Single-thread performance is vital to system performance improvement. Data prefetching is a proven technique to improve single-thread performance by overlapping the miss penalty gap with computation to hide long miss latency. A prefetcher predicts and fetches a cache line before its demand miss in faster memory (L1D, L2C) from slower memory (LLC, main memory). This helps to reduce pipeline stalls on waiting for memory. Over the past few decades, numerous prefetching mechanisms have been proposed targeting on different types of memory access patterns. The key differences reside in the types of exploited correlations (e.g., PC-address [6, 29, 30], PC-offset [38, 6], address-address [26, 51], etc.) and the types of targeted patterns (e.g., stride [29, 30], stream [19], irregular [67, 23, 66, 51]).

Evolving graphs (a.k.a. dynamic graphs) [17, 16, 32, 70, 13, 47, 36, 35] are the graphs that change over time. The two types of graph dynamics are vertex dynamic, wherein the vertices set changes during computations, and edge dynamic, wherein the edges are added and deleted from time to time. Many important applications [22, 37, 12, 16] use dynamic graphs to model complex relationships that change over time, such as recommendation systems [57], internet of things [63], and social networks [49].

Refer to caption
Figure 1: Prefetcher coverage and accuracy of PageRankDelta [53] on amazon [33] graph

Existing hardware prefetchers [67, 23, 26, 15, 64, 68, 66, 51] have tried to exploit repeating patterns on different correlations. However, due to the lack of contextual information in building these correlations, existing prefetchers achieve limited performance improvement for dynamic graph applications. On the other side of the spectrum, software prefetchers rely on programmer's expertise to issue prefetch instructions for future needs. This may help to fetch accurately but can increase instruction count, which might cancel out the performance gain. Additionally, software prefetchers have less or no knowledge of run-time system dynamics (e.g., control flow, bandwidth utilization, cache conflict), which makes timely and accurate prefetching very challenging. Recently proposed hardware-software cooperative prefetchers  [56, 60, 8] fully utilize the inherent relationship between data structures in static graphs. These prefetchers use sequential dependencies for prefetching different data structures in the graphs, which can lead to late prefetchers. For example, DROPLET [8] triggers vertex data prefetching only when DRAM services edge miss, which is often too late [68].

An ideal irregular prefetcher should be (1) able to prefetch accurate irregular data and strategically place it in caches that strike a balance between cache contention and coverage. (2) able to perform timely prefetch, i.e., adapt to application phases that changes. (3) able to cover misses that stalls processor. Vertex property array access in graph analytics is responsible for most misses due to indirections used in graph data structures [28]. The vertex-neighbor relationship in graph analytics typically remains intact even in the dynamic graphs, when vertex/edges are added/deleted at run time. One can exploit this relationship to develop a correlation between vertex accesses and misses on other data structures, thus adding contextual information to the correlation. This paper proposes a novel software-assisted “Access-to-Miss Correlation (AMC) hardware Prefetcher” to issue accurate and timely prefetches at the L2 cache. AMC selects L2 cache as its prefetch destination. This is based on the observation in DROPLET [8] that using the L2 cache for prefetching leads to negligible cache pollution for graph applications. AMC uses L1 cache word accesses and L2 misses to form fine grain access-to-miss correlation. Using lower-level cache access as a trigger to prefetch higher-level cache misses provides good prefetch timeliness. A lightweight programming model (Section IV) allows the programmer to choose a data structure (e.g., vertex array) as the target data structure. The AMC prefetcher records the cache misses in between target data structure accesses to create access-to-miss correlation entries and updates these entries at run time. In addition, AMC exploits the existing one-to-one correlation between the graph data structures (frontier-vertex array) as well.

Fig 1 shows the prefetching accuracy and L2 miss coverage comparison of the proposed AMC prefetcher with five existing prefetchers when running a PageRankDelta (PGD) application [53]: two spatial prefetchers (Bingo [6], VLDP [51], and three temporal prefetchers (MISB [67], ISB [23], and RnR [68]). The key innovations of the proposed  AMC prefetchers are:

  • AMC uses a lightweight programming model to record access-to-miss correlations for prefetching in dynamic graphs. Previous hardware-software cooperative prefetchers [56, 8, 60, 68] either record the miss sequence directly or rely on programmer/compilation technique to analyze detailed graph data structure dependency. The proposed AMC's lightweight interface only require programmer to identify only two data structures and uses underlying hardware to develop an access-to-miss correlation for prefetching, which adapts to the changing nature of dynamic graphs.

  • AMC exploits a novel many-to-many correlation between target data structure accesses and other data misses. Existing prefetchers  [26, 67, 23, 6, 15, 51, 66] use one-to-one, one-to-many, or many-to-one correlations, which cannot distinguish similar memory access patterns and can lead to inaccurate prefetches [65]. AMC uses a sequence of target accesses as the triggering event to provide contextual information to distinguish similar memory access patterns accurately.

  • AMC uses an on-chip SRAM to cache miss stream in FIFO order and compress the miss stream to reduce off-chip traffic and storage. Prior works [43, 67, 23, 66, 51, 26] either used a tabular or associative cache to store misses, which leads to a sizeable on-chip area to store metadata. AMC stream metadata in FIFO order to simplify the on-chip storage (Table VIII) and use BaseΔΔ\Deltaroman_Δ compression [46] to reduce off-chip traffic and storage.

II BACKGROUND AND RELATED WORKS

Prefetcher Design Correlation Style What to Prefetch Storage Format When to Prefetch
AMC (Proposed) many-to-many: target access stream - miss addr stream misses other than the target data structure compressed miss stream target addr access
RnR [68] one-to-many: window count - offset stream defined by software Irregular data structure offset software assist, replay timing control mechanism
ISB [23] one-to-many: PC - addr stream No constraint TLB dependent compressed format cache access
MISB [67] one-to-many: PC - addr stream No constraint 8-byte (single mapping) cache access
Bingo [6] one-to-many: PC - addr/offset stream, addr - addr stream No constraint on-chip tage-like history table cache miss in a new page
VLDP [51] one/many-to-one: delta/page offset - delta No constraint cascaded recent delta table cache access
TABLE I: Comparison to other prefetchers

This section discusses the background of evolving graph applications, prior work on prefetchers, and recently proposed accelerators for graph applications.

II-A Evolving Graphs

Applications that use graph-based algorithms and data structures in real-world scenarios where the relationship constantly evolves between entities are known as evolving graph applications. This work uses two types of evolving graph application: iterative graph algorithm with early convergence and graph applications with changing input graphs. Early convergence iterative graph algorithms, like PGD [53] offer an optimization over PageRank [44]. PGD typically requires fewer iterations as compared to PageRank and has a faster runtime. This is possible because PGD only updates vertices in an iteration whose PageRank value has changed by more than some δ𝛿\deltaitalic_δ-fraction. Therefore, in every iteration, a set of active vertices are involved in the PageRank calculation, resulting in less computation but non-repetitive irregular memory access patterns. Section III discusses performance challenges with such irregular patterns. For graph applications with changing input, the method explained in Section VI is used as the inputs to dynamic graph applications, which is similar to prior work [70, 35].

Evolving graph applications typically use a frontier array, which is a bit map, to keep track of the vertices participating in the upcoming iteration or computation. This establishes a one-to-one correlation between the frontier and vertex accesses. Additionally, the inherent vertex-to-neighbor correlation is a one-to-many correlation between the vertex and its neighbor accesses. This can be used to fetch data structures related to the vertex present in the frontier. AMC take advantage of these two properties of evolving graphs to build the correlations explained in Section III.

II-B Prior Work on Prefetchers

AMC is a hardware-software cooperative prefetcher. The closest related works in the same category are RnR [68], DROPLET [8], and Prodigy [56].

DROPLET [8] uses a specialized malloc function to identify a graph application’s targeted data structure (vertex and vertex property). DROPLET generates the addresses for an indirectly accessed vertex property value by prefetching the edge array. DROPLET [8] triggers vertex data prefetching only when DRAM services edge, which is often too late [68]. Prodigy [56] uses either compiler profiling or program annotation to generate data flow graphs of graph data structures. It uses demand access to the vertex node to prefetch the next vertex node and waits for the vertex node to be filled at the destination cache to initiate prefetch for its outgoing edges using the prefetched data. This requires a complete software stack change, including rewriting code, compiler, and OS just to optimize the prefetching of graph data structures. The RnR prefetcher [68] targets on long, repetitive, irregular memory access patterns in iterative algorithms. It improves cache miss coverage and accuracy by recording in the initial iteration and replaying the miss patterns for prefetching in the following iterations. In dynamic graphs, wherein the vertex/edges change over time, RnR does not work well. AMC solves this issue by recording the access-to-miss correlations that are preserved in dynamic graphs.

Similar to Prodigy, a case study on a HW-SW cooperative prefetcher presented in MetaSys [60] also rely on sequential dependency between data structures. For dynamic graphs, another limitation of these prefetcher  [56, 60, 3] is their inability adapt to runtime dynamics (conditional branch). For example, PGD avoids redundant computation by examining only the vertex whose page rank value changed by a set threshold in the previous iteration. These prefetcher fails to account for control-flow knowledge for prefetching. AMC overcomes the dependency challenge by using a single data structure as the triggering data structure to prefetch all of the other misses. In order to adapt to vertex and edge changes in the dynamic graphs, AMC continuously updates the correlations in every iteration and uses the latest one to prefetch.

Temporal prefetchers [66, 67, 26, 23] record memory access and then correlate it to either its PC or the previous access. They typically have high metadata storage overhead because they store a long sequence of memory addresses and inability to delete useless metadata. The closest related works to AMC in this category are ISB [23] and MISB [67]. ISB [23] uses TLB and structural addresses to map physical memory addresses to structural addresses and store them using PC localization. It suffers from high metadata overhead, does not scale with large page sizes, and does not work with modern hierarchical TLBs. MISB [67] solves this problem by employing the next-line prefetcher for metadata access as the structural address space is spatial and removing the TLB dependency to manage metadata caches. Unfortunately, when application's input size grows, the metadata also grows. The problem is more severe for on-chip only prefetchers like Triage [66] because they do not have off-chip metadata to fall back on to record growing metadata. AMC prefetcher continuously updates the correlation table with only latest ones and compresses the metadata to reduce the storage overhead. Temporal prefetchers [23, 67, 66, 26, 18] also suffer from aliasing problem [65]. Multiple addresses can correlate to the same trigger event, which causes aliasing. AMC prefetcher solves this problem by linking multiple target accesses with miss stream, thus adding contextual information to correlation (Section III). DVR [42] is a recently proposed architecture over VR [41] targeting C[hash(B[hash(A[i ])])]. DVR utilizes vector functional units and vector registers to execute indirect memory instructions in advance. It groups together instructions with the same offset to a single vector instruction. However, DVR’s progress in extracting MLP can be impeded if a previous instruction encounters branch misprediction, and it heavily relies on core structures such as ROB, VRAT, and stride detectors. An evolving graph uses a conditional branch instead of a hash function and may not utilize vector functional units effectively. However, it may be less effective in scenarios where the number of iterations is low, such as in BFS, because the DVR points themselves. This is according to a study by [42]. In contrast, AMC is decoupled from the core’s microarchitecture components. It records and intelligently replays indirect memory data structure to improve performance. It depends on the previous iteration recording to extract MLP and fully utilize L2’s MSHR without competing demand loads.

Spatial prefetchers  [51, 6, 15, 19] exploit the address delta similarity between cache accesses among different memory regions, which arise due to a fixed and regular memory layout of data objects. Such memory address patterns are common in server applications [48] (e.g., OLTP, DSS). The advantage of such spatial prefetchers is that they require less metadata. VLDP [51] targets irregular access patterns within a page. Unlike a regular access pattern prefetcher, VLDP tries to predict a common pattern amongst past deltas. It uses TAGE-like table [50] to solve the aliasing, leading to better accuracy. TAGE-like history refers to using multiple history lengths stored in various tables. In this approach, the prefetcher looks up multiple history tables to generate predictions rather than depending on a single history table for predicting future memory access patterns. It considers multiple history lengths and offers better prediction accuracy than a single history table. Moreover, it can adapt to changing application phases and capture complex correlations between memory access patterns, making it applicable to dynamic graph applications. Finally, it lowers the aliasing probability by utilizing multiple history tables. BINGO[6] also uses an optimized Tage-like table wherein the multiple history tables are fused into a single unified table and looked up multiple times with different history lengths. This reduces the overall storage overhead of traditional Tage-like predictors. AMC also uses multiple trigger accesses. The novelty of AMC is to use accesses of only the targeted data structure as the trigger, which helps to improve the prefetching accuracy for evolving graph applications.

Software-based prefetching [4, 40, 2] for linked data structures requires programmer/compiler analysis to identify pointer-chasing access responsible for cache misses. This often requires significant effort to generate effective prefetch requests well ahead of demand requests to generate timely prefetches. Ainsworth and Jones proposed [1] a configurable prefetcher aimed at improving performance for graph workloads. However, it only targets specific traversals for a certain graph format. Event-triggered programmable prefetchers [3] employs an array of mini programmable prefetcher units to target heterogeneous access patterns using compiler profiling and maximize memory level parallelism, particularly in A[B[C[i]]], wherein array C can be prefetched in before its demand, which can lead to the prefetching of arrays B and C. ATP[10] explains the hardware complexity of Ainsworth and Jones’s proposed prefetcher for indirect memory access and has timeliness problem similar to DROPLET [8]. ATP uses instructions to communicate data structure knowledge and a similar strategy as IMP to calculate linked data structures. Additionally, AMC does not rely on data-based prefetching, but instead relies on its previous recordings for prefetching. These compiler profiling [4, 40, 2, 1, 3, 24] require software stack change, including rewriting the code, updating the compiler, and changing the operating system to optimize the prefetching of data structures. Additionally, they do not adapt well to run-time changes due to context switches or speculation misprediction. Furthermore, software prefetching increases the overall instruction code size.

Table I summarizes the key differences between AMC and its closely related prefetchers.

II-C Dynamic graph accelerators

Accelerators for dynamic graphs have been proposed [70, 13, 47, 36, 9, 62, 69] as stand-alone accelerators or near/in-memory processing engines. These accelerators often necessitate custom hardware design and programming models. AMC leverages the existing software and hardware framework and makes modest modification of the current system to provide performance improvements comparable to those of the dynamic graph accelerators. These accelerators employ graph prefetchers responsible for prefetching neighbors and their property data, which uses a similar strategy as  [56, 60, 8] to prefetch graph data structures and suffer sequential dependency between graph data structures.

III Motivation and Key Idea

Dynamic graph applications exhibit non-repetitive, irregular memory access patterns. These patterns are difficult to predict using existing prefetchers [66, 26, 43, 23, 67] that use history tables to record and correlate access addresses with either the corresponding PC or previous access addresses. These prefetchers can be categorized as using one-to-many or one-to-one correlations based on the number of accesses linked to a single trigger event. Take PGD [53] as an example. Vertices whose Page Rank value has changed by more than set δ𝛿\deltaitalic_δ-fraction in previous iteration is active in current iteration. Hence, the set of vertices present in the current iteration will differ from their previous and successor iterations as shown in Fig 2. In PGD (a push-based algorithm), the vertices send their PageRank value to their neighbors to update their Page Ranks in every iteration. Since the active vertices might change in every iterations, the correlations might also change from its previous iteration.

Refer to caption
Figure 2: Active vertices in PGD across the iterations.

For this particular example in Fig 2, the active vertex set in iteration 1 consists of all the vertices in the graph. The active vertices change to four (1, 4, 6, 7) in iteration 2. According to the dependency of indirect data structure accesses among the three arrays (V: vertex array, N: neighbor array, P: vertex property array), as shown in Fig 3, the memory access (misses are marked by *) sequence would look like the following:

Refer to caption
Figure 3: PGD traversal on a graph.

Iteration 1: (all vertices are active) V[1], N[2]*, P[2]*, N[3], P[3]*, V[2], N[1], P[1]*, N[3], P[3]*, V[3], N[4]*, P[4]*, N[5]*, P[5]*, N[6]*, P[6]*, V[4], N[3], P[3]*, ….

Iteration 2: (vertex 1, 4, 6, 7 are active) V[1], N[2]*, P[2]*, N[3]*, P[3]*, V[4], N[3]*, P[3]*, V[6], N[3]*, P[3]*, V[7], N[5]*, P[5]*

The letter b in N[b], which follows V[a], represents the name of vertex a’s neighbor. Therefore, vertex b is one of vertex a’s neighbors. The address-to-address correlation-based prefetchers [26, 43] records correlation between adjacent addresses during runtime. In iteration 2, on-demand access to vertex 2 to check whether this vertex is active in the current iteration triggers prefetcher, leading to useless prefetching of vertex 2’s neighbor (vertex 3) that will not be accessed in iteration 2.

PC Address Stream
A V[1], V[2], V[3], V[3], V[4], V[5], V[6], V[7]
B N[1], N[2], N[4], N[5], N[6], N[3], N[7], N[5]
C P[1], P[2], P[4], P[5], P[6], P[3], P[7], P[5]
TABLE II: MISB Correlation.

Recent prefetchers [66, 67, 23] combine PC localization with address correlation to build correlations as shown in Table II. The accuracy for MISB is 14% whereas the coverage is 7% for covering iteration 2 misses. In this example, it is assumed that each element in the array (V, N, and P) occupies a single cache line for simplicity.

Spatial prefetchers [51, 19, 6, 15] are limited to record access within the physical page and then prefetch them into the next demand page. Assuming, the vertex, neighbor, vertex property and frontier array lies in four consecutive pages. VLDP develops correlation between the page offsets in OPT and block offset in DPTs (various trigger length) as shown in Table III. Considering Iteration 2 access (shown above) pattern wherein all the accesses to neighbor and vertex property array are L2 misses for baseline with no prefetcher. The accuracy for VLDP is 43% whereas the coverage is 21% for covering iteration 2 misses. This shows that MISB suffers from aliasing problem and correlation style of VLDP can overcome this problem to provide comparatively better accuracy.

Delta Prediction
1 1
DPT 1.
Delta Prediction
1, 1 1
1, -2 3
DPT 2.
Delta Prediction
1, 1, 1 1
1, -2, 3 1
DPT 3.
Offset Prediction
1 1
OPT.
TABLE III: VLDP Correlation.

With some data structure knowledge, identifying vertex-neighbor correlation is possible during graph traversal. This knowledge can be either provided by application program interface or using compiler analysis [56]. The key idea of AMC is to use “access-to-miss correlation” between target data structure accesses and other misses to add contextual correlation for correlation-based prefetchers, which can adapt to vertex and edge changes in dynamic graph applications.

These inter-data structure correlations are relatively easy to extract in the source code of dynamic graph applications. Algorithm 1 shows AMC's light-weight interface in PGD application. AMC function calls are explained in Table V and Section IV. First, the programmer needs to identify the target regular data structure, which will act as a trigger agent to record and prefetch stored miss stream. This data structure is mostly a vertex array in graph analytics (delta in PGD). Second, the programmer needs to identify the data structure that accounts for storing active vertices in the iteration (frontier in PGD).

Trigger access Miss stream
V[1] N[2], P[2] , P[3]
V[1], V[2] P[1], P[3]
V[2], V[3] N[4], P[4], P[5], N[6]
V[3], V[4] P[3]
V[5], V[6] N[3]
V[6], V[7] N[5]
TABLE IV: AMC Correlation Recording.

AMC prefetcher builds access-to-miss correlations between L1 target data accesses and L2 misses (excluding L2 target data miss). These L2 misses are the misses that happen in the time frame between two L1 target accesses. L1 target data access is a trigger event to prefetch correlated miss stream associated with it. AMC prefetcher observes access patterns in the previous iteration and build correlation entries as shown in Table IV. In iteration 2, AMC has 60% accuracy and 43% coverage over baseline with no prefetcher. AMC prefetcher records virtual addresses of the target data accesses to facilitate faster lookup in AMC Cache (Section V-C1) and builds fine-grain correlations between vertex accesses and misses. Virtual addressing enables AMC to lookup AMC Cache in parallel with the L1 data cache accesses before address translation.

Algorithm 1 PGD using AMC prefetcher
1:procedure Init():
2:    Frontier = {1,…,1}
3:    Delta = {1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG,…,1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG}
4:    nghSum = {0, …, 0}
5:    PR = {0,…,0}
6:    error = \infty
7:    AMC.AddrTBase(Delta, N)
8:    AMC.AddrFBase(Frontier, N)
9:    return 1
10:
11:procedure Update(s, d):
12:    atomic_increment(nghSum[d], Delta[s]degree(s)𝐷𝑒𝑙𝑡𝑎delimited-[]𝑠𝑑𝑒𝑔𝑟𝑒𝑒𝑠\frac{Delta[s]}{degree(s)}divide start_ARG italic_D italic_e italic_l italic_t italic_a [ italic_s ] end_ARG start_ARG italic_d italic_e italic_g italic_r italic_e italic_e ( italic_s ) end_ARG)
13:    return 1
14:
15:procedure Compute(i):
16:    Delta[i] = α×\alpha\timesitalic_α × nghSum[i]
17:    PR[i] = PR[i] + Delta[i]
18:    return (abs(Delta[i] >>> δ𝛿\deltaitalic_δ))
19:
20:procedure PGD(G, α,ϵ𝛼italic-ϵ\alpha,\epsilonitalic_α , italic_ϵ):
21:    AMC.init()
22:    INIT()
23:    while (error >>> ϵitalic-ϵ\epsilonitalic_ϵ): do
24:         Frontier = EDGEMAP(G, Frontier, UPDATE)
25:         Frontier = VERTEXMAP(Frontier, COMPUTE)
26:         error = sum of nghSum entries;
27:         AMC.update()     
28:    AMC.end()
29:    return PR

IV Software Support for AMC prefetcher

This section describes AMC functions, architectural state registers, and OS support that are required by the proposed design. An example of using AMC's programming interface for PGD is demonstrated as well.

Function Definition
AMC.init() Set ASID for permission check, allocate memory for AMC storage
AMC.AddrFBase (addr, size) Add base address with its corresponding size for frontier data structure
AMC.AddrTBase (addr, size) Add base address and corresponding size for target data structure
AMC.update() Set prefetching phase, metadata storage management, resets target access count register
AMC.end() Free AMC storage memory space
TABLE V: AMC Function Calls

IV-A AMC Functions and Architectural State Registers

AMC requires the following additional architectural registers: (1) two pairs of address range registers to hold the start address and size of the target and frontier data structure, (2) a prefetch phase register to enable prefetching after an initial iteration and recycle the off-chip AMC space in successive iterations, (3) target access count register, (5) miss count register and (6) four pairs of off-chip AMC storage registers to hold the head and tail pointers for the current and the next AMC miss addresses and AMC index storage.

AMC uses the address space identifiers (ASIDs) to distinguish access streams from different processes to do permission checks. The target recorder and frontier buffer in Fig. 4 use a pair of address range registers to filter out the target and frontier accesses from L1 data load accesses. OS allocates off-chip memory space to store both the current and the next AMC miss addresses and AMC index storage on AMC.init() function call. AMC reserves up to 20% input size for off-chip AMC storage (Section VII-D). One can re-purpose unused architectural registers as these special registers for AMC prefetcher. An ASID register stores the ASID of the current process using the prefetcher. Target access count register count number of L1 data cache access performed to target data structure (Section V-A). Miss count register counts the number of L2 misses recorded per AMC entry used by compressor unit (Section V-B). Target access count is used to identify an unique target access during an iteration, whereas the “miss count” counts the number of L2 misses following a target access.

The AMC.AddrTBase(addr, size), and AMC.AddrFBase (addr, size) function provides the system with information of the target and frontier data structure. The setting of the target address range register happens at the memory allocation time. Using target address range, AMC prefetcher can recognize whether access is within the target range (Section V-A). AMC.update() controls prefetching phase register, metadata storage management, and reset the target access count register (Section V-A). A set state of the prefetching phase register denotes the prefetching is enabled, whereas an unset state of the prefetching is disabled. OS does not disable prefetching except during the context switch (Section IV-B). AMC.end() function frees up AMC off-chip storage and reset all the AMC architecture register and invalidate all the entries in AMC Cache at the end of the execution.

Furthermore, the OS needs to allocate off-chip memory space to store two AMC miss addresses and AMC index for recording phase and prefetching phase (shown in Fig. 4). One stores the correlation from the previous iteration to perform prefetching and another to learns the correlation in the current iteration. OS maintains the off-chip address range registers. At the iteration boundary, the AMC invalidates its prefetching phase memory space and reuses it to store the upcoming iteration’s correlation. In short, both recording phase and prefetching phase perform role reversal at the iteration boundary.

IV-B OS Extension

The OS is responsible for the process management, interrupts service, I/O, virtualization, and resource management of different cores in the system. In case of long latency events such as page faults or interrupt service routine, the OS needs to switch out the current process, known as context switch, and handle the event. Conventional prefetchers either flush the metadata entries or save them in memory to retrieve them later. AMC prefetcher can reuse its old metadata after being context-switched back again only when there is no swapping of physical pages from the process. During a context switch, the physical pages swap when memory runs out of physical pages to allocate to the new context switch process, which is not typical. Suppose the page’s swap, the AMC resets its metadata, disables prefetching, and restarts from the recording phase. The dynamic graph applications consist of multiple iterations. Therefore, AMC can quickly perform its recording phase in the current iteration and start prefetching in the next iteration.

IV-C An Example of using AMC's Programming Interface

The PGD algorithm from Ligra [53] suite is modified to demonstrate how to use AMC's programming interface (Algorithm 1). Line 1 - 9 initializes the data structures used in the algorithm. Line 11 - 13 calculates the page rank value for the vertex present in frontier (active vertex in current iteration). Line 15 - 18 calculates the set of vertices for the next iteration. Purple-colored function calls are AMC functions. Line 21 initializes the AMC prefetcher registers, allocates off-chip memory for AMC metadata storage, and resets all of the AMC Cache entries (Section V) as well as all of the architectural state registers. Line 7 and 8 defines the virtual address range for the target data structure and the frontier with their corresponding size (N is the number of elements).

Line 27 denotes the boundary of an iteration. It sets the prefetching phase register to enable prefetching after the initial iteration. Additionally, after every iteration, it invalidates the off-chip AMC storage used for prefetching miss stream for the current iteration and reuses it to record in the next iteration. It does not invalidate the correlations recorded in the current iteration. Finally, line 28 terminates the AMC prefetcher and frees up its off-chip memory space.

V AMC architecture

Refer to caption
Figure 4: An overview of proposed AMC architecture. These blocks are private to the core.

.

The AMC prefetcher adds a few architectural components to a conventional cache hierarchy as shown in Fig 4. Binder, Compressor, and off-chip storage space are used during correlation recording explained in Section V-A and V-B. Frontier buffer, Target Recorder, AMC Index Identifier, AMC Cache Prefetcher, AMC Cache, and Decompressor are the components used during prefetching explained in Section V-C. Architectural Registers and Target Recorder are the common components between building correlation and prefetching.

V-A Correlation Recording

AMC records many-to-many correlations between accesses to a target data structure and L2 misses. Target data accesses identified in the L1 cache act as trigger events to prefetch to the L2 cache. Hence, the AMC needs to record target data structure’s accesses at the L1 data cache and the following misses from the L2 cache to build the correlations. An AMC correlation entry consists of two target accesses (2×\times×64 bits) and up to 20 misses (20×\times×46 bits). A Target Recorder is used to identify the target L1 accesses, which can hold up to two most recent target accesses. The target recorder includes a target access counter that increments on every demand target access to the L1 data cache. This counter is reset when the AMC.update() function is invoked.

The L2 misses that do not belong to the target address range are tagged with the latest target access count value at the time of the L2 miss and forwarded to a Binder to build a correlation entry. A Miss Count holds the number of misses belonging to the same target access count. When a miss with a different target access count value arrives at the Binder or when the Miss Count reaches 20, this entry is compressed and sent to memory, the Miss Count is reset, and a new correlation entry is initiated. The access count retrieves the correlated target accesses in the Target Recorder.

V-B Storing Correlations

Refer to caption
Figure 5: Compressor Design.

The AMC prefetcher maintains two off-chip metadata storage simultaneously: one for recording phase, the other for prefetching. Each of these two metadata storage contains two tables: Miss Addresses and AMC Index. AMC prefetcher stores the correlations learned during the current iteration and uses the correlations learned from the last iteration for prefetching. At the end of every iteration when AMC.update() is evoked, the head pointers of these two memory spaces are swapped to allow the latest correlations to be used for prefetching while recycling the memory space for recording new correlations. This continuous learning while prefetching helps to capture the changes in vertices and edges of the dynamic graphs.

The compressed miss addresses store miss address entries with different sizes in compact format. When a new entry is compressed and sent to memory. The tail pointer of Miss Addresses of recording phase is incremented based on this entry’s compressed size for storing the next entry. The AMC Index store the target addresses and metadata of each correlation entry. Each AMC Index entry consists of two target addresses, compression mode, a pointer to the Miss Addresses, and the number of correlated misses.

A lightweight BaseΔΔ\Deltaroman_Δ compression variant [46] is designed to save off-chip bandwidth and reduce metadata storage. An AMC entry stores up to 20 physical addresses (52-bit physical memory) without block offset of the misses (20×\times×46 bit without compression). AMC uses the physical block address of the first miss as the base (46 bit) and three different sizes of deltas (1, 2, or 4 byte) to compress the other misses. A 2-byte-ΔΔ\Deltaroman_Δ example is shown in Fig 6. When all of the addresses can be represented with Base+ΔΔ\Deltaroman_Δ, this entry can be compressed with the corresponding size of delta. Fig 5 illustrates a high-level view of the compressor design. The compressor uses the Miss Count to activate the number subtraction units equal to the number of misses. Three delta sizes are tested in parallel. The smallest compressable delta size is selected using delta selection logic.

Refer to caption
Figure 6: 46-bit base 2-byte-ΔΔ\Deltaroman_Δ compression example.

The target access addresses do not undergo compression because the target addresses play a critical role in AMC Cache, which will be explained in Section V-C1 and Section V-C2. Compressing the target access addresses adds delay to the critical path and might lead to late prefetches. Instead, only the delta of the target accesses is recorded by the target recorder using target start address stored in the architectural address range register (Section IV). For the evaluated workloads, the compression ratios for 20 recorded uncompressed misses using different deltas are 4.5 (920/206) for 1-byte-ΔΔ\Deltaroman_Δ (best-case), 2.51 (920/366) for 2-byte-ΔΔ\Deltaroman_Δ, and 1.34 (920/686) for 4-byte-ΔΔ\Deltaroman_Δ (worst-case).

V-C AMC Cache

Refer to caption
Figure 7: An illustration of AMC Cache.

AMC uses an on-chip cache to prefetch the access-to-miss address correlations recorded during the last iteration. These correlation entries are inserted into the cache in sequence and evicted in FIFO order. Fig 7 illustrates a high-level view of the AMC Cache design. The three main components of AMC Cache are (1) AMC Cache’s tag, storing target addresses as the tags for cache lookup, (2) target metadata RAM, storing the location of the corresponding entry in the AMC Cache, and (3) a Compressed Miss RAM, stores the compressed misses.

V-C1 AMC Cache Lookup

AMC Cache tag is a content-addressable memory (CAM) [31] that allows comparing a target address. In this case, AMC uses the addresses identified by the Target Recorder to lookup the AMC Cache tag. A matched tag returns the associated Target Metadata RAM entry. Each entry in the Target Metadata RAM consists of a valid bit (1 bit), the number of misses in the entry (5 bits), the compression mode (2 bits: 1/2/4-byte-ΔΔ\Deltaroman_Δ), and the head pointer of miss addresses stored in the Compressed Miss RAM. On an AMC Cache hit, AMC uses the number of misses and the head pointer to extract the corresponding compressed miss addresses stored in the Compressed Miss RAM. AMC prefetcher passes these compressed miss addresses to a BaseΔΔ\Deltaroman_Δ decompressor to generate L2 prefetch candidates. It is possible to have multiple hits in the AMC Cache tag because a correlation with more than 20 misses can be split into multiple correlation entries. In case of multiple hits, AMC extracts corresponding entries one-by-one to decompress the miss addresses. AMC Cache hit entries are written back to off-chip recording phase metadata storage for the next iteration.

V-C2 AMC Cache Insertion

AMC keeps track of the processing progress of the frontier array to issue timely prefetches. A Frontier Buffer is used to identify and record the addresses of the frontier accesses, similar to the Target Recorder. These frontier addresses are then used to determine when to prefetch AMC Index entries to the AMC Index Identifier. The AMC Index Identifier caches a continuous subset of the AMC Index entries from the off-chip AMC Index. When two frontier accesses record to an entry of frontier buffer, AMC uses frontier deltas to lookup AMC Index Identifier. Frontier deltas are obtained by subtracting frontier access address with frontier start address stored in architectural address range register. The Address Calculation aligns the frontier delta with the target delta size to obtain the corresponding target delta using this equation target_delta = frontier_delta ×\times× (target_size/frontier_size). The frontier and target size are obtained using architectural registers.

On an AMC Index Identifier hit, AMC Cache prefetcher prefetches the corresponding entry from off-chip AMC Miss Addresses to the AMC Cache. AMC uses the pointer to the entry in the Miss Addresses, compression mode, and the number of misses to prefetch the miss addresses. The Target Metadata RAM stores the hit entry of the AMC Index Identifier with a pointer to the Compressed Miss RAM that is copied from the tail pointer of the Compressed Miss RAM. The miss addresses are stored at the current tail pointer of Compressed Miss RAM, which is then advanced based on size of the compressed misses for the next entry. The AMC Index Identifier invalidates all previous entries, including the hit entry, to fill up the AMC Index Identifier with the next set of AMC Index entries.

The purpose of AMC Index Identifier is to prefetch metadata on-chip to reduce the delay of prefetching AMC entries to AMC Cache. The AMC Index Identifier stores a range of Target entries from off-chip AMC Index. Once the set range of AMC Index entries is populated to AMC Index Identifier, the head pointer points to the past last entry being fetched into AMC Index Identifier for the next refill. On an AMC Index Identifier miss, if the latest frontier delta is greater than the latest delta of the last entry of the AMC Index Identifier, the entries of the AMC Index Identifier are invalidated and replaced by next batch of target entries from off-chip AMC Index.

V-C3 AMC Cache Replacement

AMC Cache uses a FIFO replacement to invalidate earlier entries when there are no invalid entries in the AMC Cache tag or no space in the Compressed Miss RAM. This simplifies the cache design. There is no eviction from AMC Cache to the off-chip metadata storage.

VI Experimental setup

Core Parameters 4 OoO cores, 4GHz, 4-wide, 256 ROB, 64 LQ, 64 SQ, perceptron branch predictor [25]
L1 D/ICache private, 64KB, 8-way, 4 cycles, 64B block, MSHR: 8
L2 Cache private, 256KB, 8-way, 12 cycles, 64B block, MSHR: 16, next-line prefetcher
LLC Cache shared, 8MB, 16-way, 42 cycles, 64B block, MSHR: 128
Memory Controller FCFS, read queue size = 64, write queue size = 32 write queue draining: high/low threshold = 75%/25%
Main Memory DDR4, 8Gb (x16 I/Os), 2400 MT/s, 1 channel, 1 rank, 16 banks, tRCD = tRP = tCL = 17 cycles
TABLE VI: Processor configuration (baseline)

This work uses ChampSim [11], a trace-based simulation infrastructure, to evaluate the performance of the proposed AMC. ChampSim has been used in prefetching competitions. ChampSim’s cache system implements FIFO read and prefetch queues. It accurately models bank and bus contention, page table, TLB caches, and TLB functions such as page table walks. The core parameters are modeled based on Intel i7-6700 [20] and shown in Table VI. The memory timing constraint comes from Micron MT40A2G4 DDR4-2400-CL17 data sheet [58]. The on-chip area and energy consumption of BaseΔΔ\Deltaroman_Δ Compressor are estimated using 45nm Synopsys standard cell library [55] (RTL synthesis) and scaled down to 22nm. To realize the energy benefits of the proposed work, we developed an analytical model based on McPAT [34], CACTI [7], and Micron DDR4 SDRAM SystemPower calculator [39]. CACTI is used to get per-access energy for different levels of cache. McPAT is used to get the energy consumed by the core. We modify the Micron DDR4 SDRAM System-Power calculator to model memory energy consumption with current numbers from Micron MT40A2G4. ChampSim does not model OS implications of context switches.

This work evaluated common dynamic graph kernels [53] and real-word graphs [33] that are run until completion. The dynamic graph kernel PGD and Connected Components (CC) are from Ligra. This work modifies BFS and BellmanFord kernels from Ligra [53] using a strategy similar to [70, 35] i.e., these kernels are simulated twice with two different inputs to create a dynamic graph situation similar to existing techniques [70, 35, 52, 59, 61]. For the first time, 80% of the vertices are randomly selected; for the second time, 10% of vertices from the first input graph are randomly deleted and 10% of vertices from the original input are added.

Datasets Vertex (Million) Edges (Million) Degree Type
Amazon 0.4 3.39 9 Product network
Stanford 0.28 2.31 9 Web graph Stanford
Youtube 1.16 2.99 3 Online social network
Road-CA 1.97 5.53 3 Road network California
ComDblp 0.43 0.36 1 DBLP collaboration network
Google 0.88 5.11 6 Web graph Google
NotreDame 0.33 1.5 5 Web graph Notre Dame
TABLE VII: Input Datasets [33],

Table VII lists the real-world data sets used for evaluation. The simulation setup uses different data sets for all the kernels. This different input set is because a few inputs, e.g., Road-CA for PGD, require weeks to finish. All the evaluated kernels use the Single Program Multiple Data model [14] similar to RnR [68]. Every task executes the same program along with graph partitioning [27, 54]. In the evaluated simulation setup, the master process is responsible for initializing all the data structures in the algorithm and partitioning the graph into four partitions using METIS [27]. These partitions are assigned to each worker to process to perform their computation. Once the worker completes all the computations, the master process collects the updated data structures and finishes the overall graph analysis.

Bingo [6] 119kB 16K entry history table, degree: 32
VLDP [51] 998B OPT 128B, DHB 222B, DPT 648B, degree: 4
RnR [68] 1KB Window size 512, Buffer size = 256, degree: 512
MISB [67] 49kB 32kB cache, 17kB bloom filter, degree: 32
AMC24kB 29kB 24kB AMC Cache, 5kB BaseΔΔ\Deltaroman_Δ compressor, 100-entry target recorder, 100-entry AMC index identifier, 100-entry frontier buffer, degree: correlated stream
TABLE VIII: On-Chip Storage cost of evaluated prefetchers.

VII Evaluation results

Refer to caption
Figure 8: Speedup over baseline configuration (Table VI).

The baseline system uses the next-line prefetcher as the L2 data prefetcher to evaluate AMC prefetcher and other prefetchers. The modern systems  [21] employs composite prefetcher [30, 45] to target different access patterns in the kernel. Therefore, it becomes important to evaluate prefetcher design in composite settings. No specific data structure range is assigned to the next line prefetcher to keep the environment as close to reality as possible. AMC prefetcher is compared against five prefetchers: (1) Bingo [6], (2) ISB  [23], (3) MISB [67], (4) VLDP [51], and (5) RnR [68]. All these prefetchers are trained on L1 data cache access/miss and assigned as L2 prefetcher except RnR. RnR prefetcher is trained at L2 as presented in its original proposal to have a fair comparison with similar software-assisted hardware prefetcher.

Table VIII describes the configurations of all these prefetchers. With the exception, the configuration of ISB [23] uses an ideal on-chip metadata cache with zero access latency, and infinite size, the degree of the prefetching set to the number of correlated stream lengths. The “IDEAL” case is analyzed by having an infinite-sized L2 cache. As AMC prefetcher uses more than one trigger access to initiate prefetching, this work is compared to VLDP and BINGO, which uses a similar lookup strategy to solve the aliasing problem with temporal prefetchers. Ultimately, we compare it to RnR, which lies in the same category that achieves close to 100% accuracy on long repeating irregular kernels.

VII-A Performance

VII-A1 Speedup

The speedup is defined as PrefetcherIPCBaselineIPC𝑃𝑟𝑒𝑓𝑒𝑡𝑐𝑒𝑟𝐼𝑃𝐶𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒𝐼𝑃𝐶\frac{PrefetcherIPC}{BaselineIPC}divide start_ARG italic_P italic_r italic_e italic_f italic_e italic_t italic_c italic_h italic_e italic_r italic_I italic_P italic_C end_ARG start_ARG italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e italic_I italic_P italic_C end_ARG (Fig 8). The proposed work evaluates BFS and BellmanFord on the second run when the input graph changes similar to previous dynamic graph accelerators [70, 35, 52, 59, 61]. The total number of iterations for PGD and CC depends on graph input. The initial iteration traverses all the vertices in the graph for PGD and CC. The vertices become active for the next iteration depending on the vertex property value. AMC‘s off-chip metadata storage recycles metadata that will not be used in future iterations and breaks the dependency between application footprint and metadata size. Additionally, AMC uses multiple trigger access similar to BINGO and VLDP to form accurate correlations and differentiate between similar patterns that ISB/MISB cannot. For PGD, and CC AMC prefetcher performs 1.71×\times×, 2.04×\times× (geomean) respectively better than baseline whereas VLDP performs 1.17×\times× and 1.05×\times× respectively.

AMC prefetcher becomes more accurate in correlating miss streams with iterations because the concurrent recording phase creates a new correlation with every iteration. AMC does not rely on histories beyond the last iteration. This is because evolving graphs do not typically change rapidly, and the two adjacent iterations have enough similarities. AMC prefetches all the addresses except the target data structures because they are vertex arrays that are contiguous, hence the address range is bonded. These data structures can be prefetched using the next-line prefetcher, which is separated from the AMC prefetcher.

For BFS and BellmanFord, the performance improvement could be better than the PGD and CC. This low performance improvement is because the end-to-end evaluation consists of only two instances. With more instances, performance improvement will increase. AMC prefetcher performs about 1.40×\times× and 1.25×\times× (geomean) better than baseline whereas VLDP performs 1.14×\times× and 1.10×\times× for BFS and BellmanFord respectively. As RnR works for long repetitive iterations, the dynamic graph will only behave close to static when the percentage change of vertex/edge added/subtracted is marginally small. RnR performs marginally better than the baseline. The primary reason AMC is better than RnR [68] in handling dynamic graphs is its adaptability. Unlike RnR, which replays the same recorded irregular memory access pattern from the initial iteration, AMC updates its association table for every iteration.

Through a quantitative comparison, the RnR [68] paper analyzes DROPLET’s [8] prefetching strategy on timeliness. DROPLET and PRODIGY [56] require access to the value to calculate the next prefetching candidate’s address. This dependency causes prefetching delay. We use the DROPLET model similar to the RnR paper to model PRODIGY, and AMC performs about 1.56X (geomean) better than PRODIGY. Prodigy has pointed out that it cannot account for additional control-flow information that leads to cache thrashing [56]. Domino’s [5] (many-to-many correlation) performance is worse than MISB. Quantitatively AMC performs 1.6x (geomean) better than Domino (degree: 4). Since Prodigy, Domino performs worse than the baseline and is therefore excluded from further evaluation.

VII-A2 Miss Coverage

Refer to caption
Figure 9: L2 miss coverage.

The coverage is UsefulPrefetchersTotalBaselineMisses𝑈𝑠𝑒𝑓𝑢𝑙𝑃𝑟𝑒𝑓𝑒𝑡𝑐𝑒𝑟𝑠𝑇𝑜𝑡𝑎𝑙𝐵𝑎𝑠𝑒𝑙𝑖𝑛𝑒𝑀𝑖𝑠𝑠𝑒𝑠\frac{UsefulPrefetchers}{TotalBaselineMisses}divide start_ARG italic_U italic_s italic_e italic_f italic_u italic_l italic_P italic_r italic_e italic_f italic_e italic_t italic_c italic_h italic_e italic_r italic_s end_ARG start_ARG italic_T italic_o italic_t italic_a italic_l italic_B italic_a italic_s italic_e italic_l italic_i italic_n italic_e italic_M italic_i italic_s italic_s italic_e italic_s end_ARG (Fig 9). It refers to the total number of baseline misses covered by the prefetcher. It conveys the prefetcher’s effectiveness in predicting upcoming misses. The AMC prefetcher’s uncovered miss in the iteration is due to changing active vertex set and cold miss in the initial iteration. AMC prefetcher on average covers about 59.43% and 45% misses in L2. For BFS and BellmanFord, the coverage improvement is much more significant than PGD and CC.

The active vertex set for graph traversal for Bellmanford change marginally (2-7%); therefore, AMC prefetcher acts as RnR on static graph. AMC prefetcher covers on an average about 54.51% and 89.95% L2 misses. The exception here is amazon input for BFS, which covers about 10% less than MISB. This exception is because in BFS, if the parent node gets changed, the whole graph traversal changes, and thus, the recorded miss stream by AMC becomes useless. Overall, AMC performs better than MISB on this input for BFS because even if the recorded miss stream will not be the demand access stream in the next stream. AMC do not issue those miss streams and therefore have high accuracy (Section VII-A3) and do not cause cache pollution. Spatial prefetchers (VLDP and Bingo) have low coverage because the graph does not exhibit spatial location due to their large size and data-dependent accesses. RnR suffers from low coverage (1.7%) because successive iterations do not have the exact same memory access pattern correlated with irregular access count as the initial iteration.

VII-A3 Accuracy

Refer to caption
Figure 10: Prefetch accuracy.

The accuracy is UsefulPrefetchersTotalPrefetchers𝑈𝑠𝑒𝑓𝑢𝑙𝑃𝑟𝑒𝑓𝑒𝑡𝑐𝑒𝑟𝑠𝑇𝑜𝑡𝑎𝑙𝑃𝑟𝑒𝑓𝑒𝑡𝑐𝑒𝑟𝑠\frac{UsefulPrefetchers}{TotalPrefetchers}divide start_ARG italic_U italic_s italic_e italic_f italic_u italic_l italic_P italic_r italic_e italic_f italic_e italic_t italic_c italic_h italic_e italic_r italic_s end_ARG start_ARG italic_T italic_o italic_t italic_a italic_l italic_P italic_r italic_e italic_f italic_e italic_t italic_c italic_h italic_e italic_r italic_s end_ARG (Fig 10). It is the ratio of useful prefetchers to the total number of prefetchers issued. The useful prefetchers bring the cache block into the cache level before its demand access arrives. As AMC prefetcher uses access to miss correlation along with the many-to-many correlation style, the probability of issuing inaccurate prefetchers reduces. On average, AMC prefetcher achieves an accuracy of 55%, 63.7%, 65%, and 66.4% for PGD, CC, BFS, and BellmanFord, respectively.

Similarly, VLDP uses many-to-one correlation to issue accurate prefetchers and, compared to other prefetchers, has better accuracy of 31%. In the case of PGD with Youtube input, the accuracy of VLDP is much better than AMC. This indicates that having the finer granularity of correlation (within a page) works better for some graph layouts than vertex-vertex dependent correlations.

VII-A4 Timeliness

Refer to caption
Figure 11: AMC timeliness.

Timeliness (Fig 10) is how soon a prefetcher can prefetch a cache block against its reference time. The overprediction in AMC arises from the change in vertex/edges in the dynamic graph every iteration. As in BFS, the dynamic graph change marginally affect the overall graph traversal path for Road-CA input; therefore, the overprediction is lowest. In addition, AMC can benefit from a throttling mechanism to delay prefetches and gain the lost coverage from early prefetchers.

VII-A5 Additional Off-chip Traffic

Refer to caption
Figure 12: Additonal off-chip traffic.

The additional off-chip is PrefDramAccessDemandDramAccessDemandDramAccess𝑃𝑟𝑒𝑓𝐷𝑟𝑎𝑚𝐴𝑐𝑐𝑒𝑠𝑠𝐷𝑒𝑚𝑎𝑛𝑑𝐷𝑟𝑎𝑚𝐴𝑐𝑐𝑒𝑠𝑠𝐷𝑒𝑚𝑎𝑛𝑑𝐷𝑟𝑎𝑚𝐴𝑐𝑐𝑒𝑠𝑠\frac{PrefDramAccess~{}-~{}DemandDramAccess}{DemandDramAccess}divide start_ARG italic_P italic_r italic_e italic_f italic_D italic_r italic_a italic_m italic_A italic_c italic_c italic_e italic_s italic_s - italic_D italic_e italic_m italic_a italic_n italic_d italic_D italic_r italic_a italic_m italic_A italic_c italic_c italic_e italic_s italic_s end_ARG start_ARG italic_D italic_e italic_m italic_a italic_n italic_d italic_D italic_r italic_a italic_m italic_A italic_c italic_c italic_e italic_s italic_s end_ARG (Fig 12). PrefDramAccess is the number of main memory accesses with prefetchers. DemandDramAccess is the number of main memory accesses in the baseline. On average, ISB and MISB issue 4×\times× more prefetch than AMC. The high accuracy of AMC prefetcher and compressed metadata are the main reason for relatively low additional off-chip traffic. We break down additional off-chip traffic to determine the metadata traffic for prefetchers that use off-chip metadata storage (Fig 13). AMC metadata traffic on an average is 25% compared to 493% and 54% for ISB and MISB respectively. The overall average additional off-chip traffic is 155%, 958%, 151%, 458%, and 56% for Bingo, ISB, VLDP, MISB, and AMC respectively.

Refer to caption
Figure 13: Off-chip metadata traffic.

VII-B Energy Overhead

Refer to caption
Figure 14: Energy comparison between baseline and AMC prefetcher.

Fig 14 shows the energy breakdown and comparison between baseline and AMC prefetcher. The energy consumption for AMC prefetcher reduces in all the categories (core, cache, memory). AMC prefetcher consume on an average 1.28×\times× less energy than baseline. This is chiefly because of reduced static energy consumption of core, caches, and DRAM because of reduced overall execution time.

VII-C Hardware Overhead

AMC requires a moderate amount of logic per core and set of architectural and internal registers (Section IV) for compression and developing the correlation between L1 data accesses and L2 misses. The AMC Cache requires about 29 kB for each core (78.3E-3 mm2) for each core. BaseΔΔ\Deltaroman_Δ compressor unit occupies about 13.3E-3 mm2 per core. The overall on-chip area is 0.2% of the total on-chip area (46.19 mm2).

VII-D Storage Overhead

Refer to caption
Figure 15: Off-chip metadata storage overhead.

From Fig 15, the off-chip storage is always below 25% of the input size. If the kernel access pattern shows poor spatial locality, AMC needs to record a missing stream that can span multiple pages, which reduces the compression ratio of the overall missed stream.

VII-E Miss Size Sensitivity

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: Miss size sensitivity.

To select an appropriate miss stream size of an AMC entry. AMC assumes infinitely sized AMC Cache with infinitely sized miss stream size of an AMC entry. Fig 16 shows that the number of AMC entries with a miss size greater than 20 is less than 1% for the evaluated kernel and input. Another observation is 20 misses per AMC entry ensures evaluated kernel-input pair cover at least 74% of entries. Increasing the number of misses beyond 20 does not yield much performance improvement that justifies the additional hardware overhead compared. Consequently, AMC records 20 misses per entry.

VIII Conclusion

This work proposes a novel lightweight software-assisted AMC hardware prefetcher to improve prefetching accuracy and miss coverage for dynamic graph application. By allowing programmers to identify the target data structure, the proposed AMC prefetcher uses the “many-to-many” correlation style that adds contextual information to solve the aliasing problem and adapt to graph changes. AMC prefetcher stores compressed metadata both on-chip and off-chip, thus efficiently utilizing the memory bandwidth and space.

References

  • [1] Sam Ainsworth and Timothy M Jones. Graph prefetching using data structure knowledge. In Proceedings of the 2016 International Conference on Supercomputing, pages 1–11, 2016.
  • [2] Sam Ainsworth and Timothy M Jones. Software prefetching for indirect memory accesses. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 305–317. IEEE, 2017.
  • [3] Sam Ainsworth and Timothy M Jones. An event-triggered programmable prefetcher for irregular workloads. ACM Sigplan Notices, 53(2):578–592, 2018.
  • [4] Hassan Al-Sukhni, Ian Bratt, and Daniel A Connors. Compiler-directed content-aware prefetching for dynamic data structures. In 2003 12th International Conference on Parallel Architectures and Compilation Techniques, pages 91–100. IEEE, 2003.
  • [5] Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Domino temporal data prefetcher. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 131–142. IEEE, 2018.
  • [6] Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. Bingo spatial data prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 399–411, 2019.
  • [7] Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. Cacti 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim., 2017.
  • [8] Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. Analysis and optimization of the memory hierarchy for graph processing workloads. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 373–386. IEEE, 2019.
  • [9] Abanti Basak, Zheng Qu, Jilan Lin, Alaa R Alameldeen, Zeshan Chishti, Yufei Ding, and Yuan Xie. Improving streaming graph processing performance using input knowledge. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1036–1050, 2021.
  • [10] Mustafa Cavus, Resit Sendag, and Joshua J. Yi. Informed prefetching for indirect memory accesses. ACM Trans. Archit. Code Optim., 17(1), mar 2020.
  • [11] ChampSim. Champsim simulator. https://github.com/ChampSim/ChampSim, 2020.
  • [12] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J Franklin, Joseph M Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R Madden, Fred Reiss, and Mehul A Shah. Telegraphcq: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 668–668, 2003.
  • [13] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. Kineograph: Taking the pulse of a fast-changing and connected world. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, page 85–98, New York, NY, USA, 2012. Association for Computing Machinery.
  • [14] Frederica Darema, David A George, V Alan Norton, and Gregory F Pfister. A single-program-multiple-data computational model for epex/fortran. Parallel Computing, 7(1):11–24, 1988.
  • [15] Michael Ferdman, Stephen Somogyi, and Babak Falsafi. Spatial memory streaming with rotated patterns. 1st JILP Data Prefetching Championship, 29, 2009.
  • [16] Kathrin Hanauer, Monika Henzinger, and Christian Schulz. Recent advances in fully dynamic graph algorithms. arXiv preprint arXiv:2102.11169, 2021.
  • [17] Frank Harary and Gopal Gupta. Dynamic graph models. Mathematical and Computer Modelling, 25(7):79–87, 1997.
  • [18] Zhigang Hu, Margaret Martonosi, and Stefanos Kaxiras. Tcp: Tag correlating prefetchers. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., pages 317–326. IEEE, 2003.
  • [19] Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G Abraham. Effective stream-based and execution-based data prefetching. In Proceedings of the 18th annual international conference on Supercomputing, pages 1–11, 2004.
  • [20] Intel. Intel i7-6700 (skylake), 4.0 ghz (turbo boost), 14 nm. https://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-datasheet-vol-1.html, 2020.
  • [21] Intel. Intel® 64 and ia-32 architectures optimization reference manual. file:///Users/explore/Downloads/248966-046A-software-optimization-manual.pdf, 2023.
  • [22] Anand Padmanabha Iyer, Li Erran Li, Tathagata Das, and Ion Stoica. Time-evolving graph processing at scale. In Proceedings of the fourth international workshop on graph data management experiences and systems, pages 1–6, 2016.
  • [23] Akanksha Jain and Calvin Lin. Linearizing irregular memory accesses for improved correlated prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 247–259, 2013.
  • [24] Saba Jamilan, Tanvir Ahmed Khan, Grant Ayers, Baris Kasikci, and Heiner Litz. Apt-get: Profile-guided timely software prefetching. In Proceedings of the Seventeenth European Conference on Computer Systems, EuroSys ’22, page 747–764, New York, NY, USA, 2022. Association for Computing Machinery.
  • [25] Daniel A Jiménez and Calvin Lin. Dynamic branch prediction with perceptrons. In Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, pages 197–206. IEEE, 2001.
  • [26] Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. In Proceedings of the 24th annual international symposium on Computer architecture, pages 252–263, 1997.
  • [27] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
  • [28] Anirudh Mohan Kaushik, Gennady Pekhimenko, and Hiren Patel. Gretch: a hardware prefetcher for graph analytics. ACM Transactions on Architecture and Code Optimization (TACO), 18(2):1–25, 2021.
  • [29] Sushant Kondguli and Michael Huang. T2: A highly accurate and energy efficient stride prefetcher. In 2017 IEEE International Conference on Computer Design (ICCD), pages 373–376. IEEE, 2017.
  • [30] Sushant Kondguli and Michael Huang. Division of labor: A more effective approach to prefetching. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 83–95. IEEE, 2018.
  • [31] Anargyros Krikelis and Charles C Weems. Associative processing and processors. Computer, 27(11):12–17, 1994.
  • [32] Pradeep Kumar and H Howie Huang. Graphone: A data store for real-time analytics on evolving graphs. ACM Transactions on Storage (TOS), 15(4):1–40, 2020.
  • [33] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [34] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 469–480, 2009.
  • [35] Mugilan Mariappan and Keval Vora. Graphbolt: Dependency-driven synchronous processing of streaming graphs. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–16, 2019.
  • [36] Andrew McCrabb, Eric Winsor, and Valeria Bertacco. Dredge: Dynamic repartitioning during dynamic graph execution. In 2019 56th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2019.
  • [37] Andrew McGregor. Graph stream algorithms: a survey. ACM SIGMOD Record, 43(1):9–20, 2014.
  • [38] Pierre Michaud. Best-offset hardware prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 469–480. IEEE, 2016.
  • [39] Micron. Micron system power calculators, 2020.
  • [40] Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of parallel and Distributed Computing, 12(2):87–106, 1991.
  • [41] Ajeya Naithani, Sam Ainsworth, Timothy M Jones, and Lieven Eeckhout. Vector runahead. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 195–208. IEEE, 2021.
  • [42] Ajeya Naithani, Jaime Roelandts, Sam Ainsworth, Timothy M Jones, and Lieven Eeckhout. Decoupled vector runahead. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, pages 17–31, 2023.
  • [43] Kyle J Nesbit and James E Smith. Data cache prefetching using a global history buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA’04), pages 96–96. IEEE, 2004.
  • [44] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
  • [45] Samuel Pakalapati and Biswabandan Panda. Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 118–131. IEEE, 2020.
  • [46] Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 377–388, 2012.
  • [47] Shafiur Rahman, Mahbod Afarin, Nael Abu-Ghazaleh, and Rajiv Gupta. Jetstream: Graph analytics on streaming data with event-driven hardware accelerator. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1091–1105, 2021.
  • [48] Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V Adve, and Luiz André Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, pages 307–318, 1998.
  • [49] David Sayce. The number of tweets per day in 2020, 2022. https://www.dsayce.com/social-media/tweets-day/.
  • [50] André Seznec. A 256 kbits l-tage branch predictor. Journal of Instruction-Level Parallelism (JILP) Special Issue: The Second Championship Branch Prediction Competition (CBP-2), 9:1–6, 2007.
  • [51] Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H Pugsley, and Zeshan Chishti. Efficiently prefetching complex address patterns. In Proceedings of the 48th International Symposium on Microarchitecture, pages 141–152, 2015.
  • [52] Xiaogang Shi, Bin Cui, Yingxia Shao, and Yunhai Tong. Tornado: A system for real-time iterative analysis over evolving data. In Proceedings of the 2016 International Conference on Management of Data, pages 417–430, 2016.
  • [53] Julian Shun and Guy E Blelloch. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 135–146, 2013.
  • [54] Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, and Michael W Mahoney. Parallel local graph clustering. arXiv preprint arXiv:1604.07515, 2016.
  • [55] Synopsys. Synopsys Standard Cell Libraries. https://www.synopsys.com/dw/ipdir.php?ds=dwc_standard_cell, 2019. [Version P-2019.03, March 2019].
  • [56] Nishil Talati, Kyle May, Armand Behroozi, Yichen Yang, Kuba Kaszyk, Christos Vasiladiotis, Tarunesh Verma, Lu Li, Brandon Nguyen, Jiawen Sun, et al. Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 654–667. IEEE, 2021.
  • [57] Hao Tang, Guoshuai Zhao, Xuxiao Bu, and Xueming Qian. Dynamic evolution of multi-graph based collaborative filtering for recommendation systems. Knowledge-Based Systems, 228:107251, 2021.
  • [58] Micron Technology. 8gb: x4, x8, x16 ddr4 sdram features. https://www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/8gb_ddr4_sdram.pdf.
  • [59] Pourya Vaziri and Keval Vora. Controlling memory footprint of stateful streaming graph processing. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 269–283, 2021.
  • [60] Nandita Vijaykumar, Ataberk Olgun, Konstantinos Kanellopoulos, F Nisa Bostanci, Hasan Hassan, Mehrshad Lotfi, Phillip B Gibbons, and Onur Mutlu. Metasys: A practical open-source metadata management system to implement and evaluate cross-layer optimizations. ACM Transactions on Architecture and Code Optimization (TACO), 19(2):1–29, 2022.
  • [61] Keval Vora, Rajiv Gupta, and Guoqing Xu. Kickstarter: Fast and accurate computations on streaming graphs via trimmed approximations. In Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems, pages 237–251, 2017.
  • [62] Qinggang Wang, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Xiaofei Liao, Hai Jin, Wenbin Jiang, and Fubing Mao. Grasu: A fast graph update library for fpga-based dynamic graph processing. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 149–159, 2021.
  • [63] Zhenyu Wen, Renyu Yang, Peter Garraghan, Tao Lin, Jie Xu, and Michael Rovatsos. Fog orchestration for internet of things services. IEEE Internet Computing, 21(2):16–24, 2017.
  • [64] Thomas F Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Practical off-chip meta-data for temporal memory streaming. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pages 79–90. IEEE, 2009.
  • [65] Hao Wu et al. Practical irregular prefetching. PhD thesis, The University of Texas at Austin, 2020.
  • [66] Hao Wu, Krishnendra Nathella, Joseph Pusdesris, Dam Sunwoo, Akanksha Jain, and Calvin Lin. Temporal prefetching without the off-chip metadata. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 996–1008, 2019.
  • [67] Hao Wu, Krishnendra Nathella, Dam Sunwoo, Akanksha Jain, and Calvin Lin. Efficient metadata management for irregular data prefetching. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pages 1–13. IEEE, 2019.
  • [68] Chao Zhang, Yuan Zeng, John Shalf, and Xiaochen Guo. Rnr: A software-assisted record-and-replay hardware prefetcher. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 609–621. IEEE, 2020.
  • [69] Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. Minnow: Lightweight offload engines for worklist management and worklist-directed prefetching. ACM SIGPLAN Notices, 53(2):593–607, 2018.
  • [70] Jin Zhao, Yun Yang, Yu Zhang, Xiaofei Liao, Lin Gu, Ligang He, Bingsheng He, Hai Jin, Haikun Liu, Xinyu Jiang, et al. Tdgraph: a topology-driven accelerator for high-performance streaming graph processing. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 116–129, 2022.