2.1 Virtual Memory in Modern Systems
Contrary to the original single-tasking OSs that only supported the execution of a single program at the same time, modern multi-tasking OSs running on current CPUs allow the execution of multiple independent processes simultaneously. Although this improves drastically the user experience, it has a huge impact on how the OS manages the different processes, as each of them has to be mapped to a different memory location, and therefore, uses different memory addresses. To reduce the burden on programmers, the virtual memory abstraction allows every process in the system to use the same memory addresses, being the OS responsible for translating the addresses used by each process (VA) to a different and unique address in the memory (PA).
The part of the CPU that is responsible for this translation is called the MMU. A general structure of an MMU is illustrated in Figure
1(a). The interior of the MMU consists mostly of two types of components:
Page table Walker (
PTW) and
Translation Lookaside Buffers (
TLB). The PTW operates by following a chain of page table entries, each of which describes a portion of the VA space and its corresponding PA. The page table entries are stored in a hierarchical structure known as the multi-level page table, which is managed by the OS. In multi-level page tables, the VA space is divided into several levels, each level containing a separate page table. When a VA needs to be translated to a PA, the translation process starts at the highest level page table and
walks its way down through the hierarchy until it reaches the final lowest level page table. The PTW is also responsible for ensuring that memory accesses are properly authorized and directed to the correct physical memory location. It does this by checking the page table entries to ensure that the program has the necessary privileges to access the memory location, and by checking for any memory access violations or errors.
However, translating the PA every time is not efficient. Therefore, the second component, the TLB, is introduced and responsible for storing recently utilized VA to PA mappings. The TLB serves as a cache to speed up the VA-to-PA translation process so that if the same VA is accessed in a short period of time, the MMU can quickly retrieve the corresponding PA from the TLB without having to perform the entire translation process. However, due to area and latency constraints, the size of TLBs is usually small, compared to the size of caches, and often a VA can not be found in the TLB (called
TLB miss). In these cases, the MMU will then perform the translation using the page tables stored in the main memory, which is a slow process due to the long latency of the modern main memory structure. Results from [
24] show that a TLB miss can incur an average of 135 cycles overhead on page walks on a modern x86-64 architecture, especially bottlenecking memory-intensive workloads. The trend of having larger and larger caches to enhance data locality results in lower TLB cache coverage and, therefore, higher TLB misses, with the subsequent overhead in memory translation. Furthermore, the use of virtual memory today is not limited to CPUs but rather frequently used in accelerators and co-processors, such as APUs, where the same bottlenecks can be found, yet more exacerbated.
2.3 Performance Advantages of the APU and its Large Address Translation Overhead
The escalating demand for server workloads, particularly for social network analytics, web search engines, and biomedical applications, underscores the growing significance of their core component, graph processing [
8]. Graph processing typically uses sparse data formats such as
Compressed Sparse Row/Column (
CSR/CSC) to manage a large amount of data efficiently. Then,
Sparse Matrix-Vector Multiplication (
SpMV) is used to manipulate and process data. It is well-known that SpMV, and of course, graph processing, are computing and memory-intensive tasks in the field [
8]. Therefore, heterogeneous systems, (like APU systems), have been proposed to bring performance improvements for such workloads by introducing dedicated acceleration units, i.e., iGPU. Compared with CPU, APUs can exhibit greater benefits.
To closely examine the advantages of APU over CPU and at the same time evaluate the room for improvement, we performed an analysis of the performance obtained from the state-of-the-art graph benchmarks Pannotia [
7] running with the gem5 architectural simulator [
5]. The gem5 simulator is configured to mimic a real system. Table
1, shows the parameters used to run the experiments, most of which come from the default simulator settings, and some which are scaled down to better match the behavior of a real system.
Figure
2 gives the speedup of the APU system against the CPU-only system for each application in the Pannotia benchmark. The results depicted in the figure unequivocally show that the APU system surpasses the CPU system by a considerable margin, with an overall performance improvement of approximately 132 times. Based on the performance evaluation of APU and CPU systems using the Pannotia benchmark, it is evident that the GPU-based APU system substantially outperforms the traditional CPU system, even when handling irregular GPU workloads. This observation underscores the significance of embracing GPU-accelerated architectures, especially in application domains characterized by irregular workloads, in order to leverage the exceptional performance and efficiency advantages they provide.
Despite the tremendous advantages brought by the APU system, the irregular memory access patterns of sparse data lead SpMV and graph processing applications to have a large memory address translation overhead and performance degradation [
11,
16,
29].
The fundamental reason behind this phenomenon is the fact that while the cache sizes have increased over the decades to mitigate the impact of data cache misses, TLBs have not followed the same growth, and the extent to which physical memory addresses can be accommodated within the TLB has not kept the pace, thus, leading to low TLB coverage on the cache and a large number of TLB misses and page walks. Due to the performance cost of a PTW, this eventually introduces large memory address translation times and performance overhead. Furthermore, a larger cache makes the problem more severe, as the PTW needs to consume additional cycles to access the cache hierarchy to map virtual memory addresses used by applications to physical memory addresses used by the hardware. The work in [
11] shows that the address translation overhead is increasing in the existing systems with increasing cache sizes.
This phenomenon can be observed in Figure
3, where we show the number of
misses per thousand instructions (
MPKI) for both TLB and last level cache (left
y-axis) for each application of the Pannotia benchmark. The TLB misses induce MMU overhead, which we define as the percentage of total CPU cycles spent on address translation, and that is plotted as a red dashed line in the right
y-axis.
On the one hand, benchmarks like BC, Colormax, and Colormin have relatively low MPKI values for both TLB and Cache, suggesting efficient utilization of memory resources. The low MPKI values may be indicative of regular or semi-regular memory access patterns and good spatial and temporal data locality. These properties help to minimize memory misses and thereby increase the overall performance of the benchmarks. On the other hand, benchmarks like FW, PagerankSPMV, Pagerank, or SsspELL, have higher MPKI values for either TLB, cache, or both, which suggests more complex or irregular memory access patterns. This could be due to the nature of the problems they are solving, which may involve complex data structures or irregular algorithms. High MPKI values indicate poor spatial and temporal data locality, leading to increased cache and TLB misses, which shows great pressure for cache and MMU.
Furthermore, we compared the experimental results with a system featuring an Ideal MMU (i.e., an MMU with 0 cycle address translation overhead), to investigate the TLB misses induced overhead. The overhead is depicted as the red dash line in Figure
3, with the right
y-axis.
PagerankSPMV and
SsspELL, as the two benchmarks with the highest TLB missrate, show the most dramatic overhead of approximately 50% and 37%, respectively, consistent with their previously high TLB MPKI values. These overheads underscore the importance of effective memory management for irregular workloads and reveal that the APU is significantly affected by address translation, leading to a degradation in overall performance.
Similar behavior is present in CPU systems. For example, [
11,
31] show that traditional CPU-only systems suffer from around 17% address translation overhead, reaching more than 30% as the size of the cache increases.
Recognizing the promising capability of the APU system in the computation of irregular workloads and similar access pattern problems they are facing, we are motivated to implement the IAS into the APU system. The anticipated benefits of this integration will be elucidated in the subsequent sections. In summary, the results show that address translation is a serious problem in modern systems (both CPU and APU). In the next section, the state-of-the-art to mitigate the translation overhead will be discussed.
2.4 State-of-the-art
Researchers from around the world have proposed a variety of solutions to address the translation overhead problem. First, to extend the address range the TLB can reach, Kwon et al. propose a framework called Ingens [
19] for huge page automatic support in OSs. Ingens promotes or demotes huge pages according to the number of physically resident pages and their access frequency. The experimental results demonstrate that Ingens has the ability to mitigate tail latency and memory bloat, significantly improving performance for essential applications such as Web services and Redis [
19]. However, the idea of huge pages can also cause several other problems, which, for example, can lead to internal fragmentation and memory waste.
To improve the efficiency of address translation of the GPU system, researchers propose a second method to address the address translation overhead called Mosaic [
3]. They propose using address-translation-aware caches and memory management algorithms that significantly reduce address-translation overhead. However, the proposal cannot handle intermediate sizes larger than 4 KB. [
20] shows that Mosaic does not work well with low-contiguity pages and struggles with the workloads of large memory footprints due to the 2 MB page-size limitation.
Several researchers have proposed the idea of virtual caches. In particular, Wood et al. [
28] propose the use of a global VA space for addressing caches. However, their approach involves translations from virtual to global virtual address spaces using fixed-size program segments. More flexible paging systems have since replaced these segments. Their simulation methodology is based on a trace-driven simulator and cannot estimate the overall full-system performance. A similar idea is also introduced in GPU-addressing. Yoon et al. [
29] proposes to turn the physical cache system into a virtual cache system for GPU systems. Virtual caches are designed to take the TLB off the critical path, thereby moving address translation to the memory side. In such systems, processes are addressed in the cache using their private VA as a namespace. Using a virtual index cache, the GPU can immediately retrieve data from the cache, and, therefore, the cache can filter most of the TLB misses. This strategy can significantly reduce address translation overhead [
29]. However, the method requires the system to handle the problem of control logic required to resolve synonyms and homonyms across VAs (i.e., multiple VAs mapping to the same PA, or a single VA mapping to multiple PAs). Consequently, virtual hierarchies are difficult to incorporate into modern systems [
11].
The use of IAs is another promising proposed solution. Zhang et al. [
30] propose an idea of an IA space that translates the VAs at 256 MB granularity. However, it works only well for GB-scale memory. When faced with TB-scale memory, it seems powerless. Although they simulate performance using a complete system simulator called Mambo [
6], it is not open source, only available on limited platforms, and only supports the PowerPC architecture. Hajinazar et al. [
13] propose an IA space that consists of fixed-size virtual blocks for application use. However, to implement such an address space inside the system, additional tools, and software modifications are needed. The simulation is based on a modified DRAM simulator, called Ramulator [
17], which simulates the performance by using the collected trace of representative regions of the benchmark, which cannot reflect the overall performance improvement.
Designed for the modern data center system, Sid et al. propose an IA approach called the Midgard address space [
11]. This idea uses variable VMA sizes in a flexible manner, converting the two address spaces into three address spaces. The main contribution of this work is that, by using three different address spaces in the system, the address translation overhead can be greatly reduced.
The mapping mechanism from VAs to IAs is depicted in Figure
4(a). Midgard IA space employs the OS concept of
Virtual Memory Areas (
VMAs) to produce a single IA space in which all processes’ VMAs can be mapped uniquely. A VMA is a contiguous range of memory used by an OS to manage and keep track of the VA space allocated to processes. When many processes are present in a system, the shared library will be assigned to the same IA, while the rest of the private data will be assigned to other IAs. Each private VA will be mapped to a unique IA through the IA mapping process. Unlike the conventional system, which translates VAs to PAs in a unit of fixed size, the IAS system translates VAs into IAs in the unit of VMAs. However, in real-world computer systems, programs use much fewer VMAs than pages to represent their VA space. Thus, fewer hardware resources are required during the transition from VA to IA space in the granularity of VMAs than with conventional virtual-physical translation. Similarly, the number of mappings from VMAs to
Intermediate Memory Addresses (
IMAs) is fewer, and it brings fewer TLB misses for front-end translation, which is lightweight. At the back-end translation from IA space to PA space, due to the translation granularity of pages, more hardware resources are needed, like TLB entries and multilevel page tables. Besides, facing TLB miss at the back-end MMU, the PTW needs to walk through the page table level by level, first from the cache and then from memory if the cache misses. Furthermore, the latency of accessing memory is much higher than that of accessing cache, which is a heavyweight translation. However, thanks to a larger cache, it can filter a high number of TLB misses. We show the impact of cache in the next section. By implementing this kind of mapping mechanism using QEMU [
4], [
11] evaluates the full potential of Midgard address space, which shows more than a 30% decrease in address translation overhead compared to the conventional system. However, their work is based on tracing information from emulation and using average memory access time as a metric, which can not show the cycle-accurate result and overall system performance improvement. To evaluate the overall performance improvement, we introduce IAS into the cycle-accurate full-system simulation gem5. We will discuss the detailed implementation of IAS with a little more depth in the next section.