Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers

Published: 01 March 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Page-based virtual memory relies on TLBs to accelerate the address translation. Nowadays, the gap between application workloads and the capacity of TLB continues to grow, bringing many costly TLB misses and making the TLB a performance bottleneck. Previous studies seek to narrow the gap by exploiting the contiguity of physical pages. One promising solution is to group pages that are both virtually and physically contiguous into a memory range. Recording range translations can greatly increase the TLB reach, but ranges are also hard to index because they have arbitrary bounds. The processor has to compare against all the boundaries to determine which range an address falls in, which restricts the usage of memory ranges.
    In this article, we propose a tagged-pointer-based scheme, FlexPointer, to solve the range indexing problem. The core insight of FlexPointer is that large memory objects are rare, so we can create memory ranges based on such objects and assign each of them a unique ID. With the range ID integrated into pointers, we can index the range TLB with IDs and greatly simplify its structure. Moreover, because the ID is stored in the unused bits of a pointer and is not manipulated by the address generation, we can shift the range lookup to an earlier stage, working in parallel with the address generation. According to our trace-based simulation results, FlexPointer can reduce nearly all the L1 TLB misses, and page walks for a variety of memory-intensive workloads. Compared with a 4K-page baseline system, FlexPointer shows a 14% performance improvement on average and up to 2.8x speedup in the best case. For other workloads, FlexPointer shows no performance degradation.

    1 Introduction

    Virtual memory is a fundamental component of modern computer systems and has long been an active research area [23, 25, 30, 57, 57, 58]. It offers the illusion of exclusive large address space to each process, shifting the burden of memory management and process isolation from programmers to the operating system and hardware. Page-based virtual memory is the most widely adopted implementation of a virtual memory system. A page-based virtual memory divides the physical memory into fixed-size pages and uses a per-process page table to map virtual pages to physical ones. Before memory access, the system must look up the virtual address in the page table and translate it to a physical address. To accelerate the translation, modern systems rely on Translation Lookaside Buffers (TLBs) to cache the most recently used page table entries (PTEs).
    Unfortunately, the TLB itself is becoming a new performance bottleneck. For years, the memory capacity of computer systems has been growing continuously. With the help of emerging technologies, vendors are providing multi terabytes of physical memory [27, 28, 54]. At the same time, program memory usages also follow this trend. The working set size of a big-memory application can easily take up tens of gigabytes [6, 24, 33]. However, the capacity of TLB stays stagnant in general. Using the common 4 KB page size, a typical 64-entry L1 TLB can only cover 256 KB of physical memory, far less than the working set size of a modern big-memory program. When accessing an address not covered by the TLB, the system has to go through a costly page walk to fetch the page table entry from memory, which harms the program’s performance. This problem is called the limited TLB reach.
    Because TLB translation locates in the execution critical path, the timing constraints make it difficult to increase the L1 TLB capacity. A large L2 TLB with thousands of entries can mitigate some of the page walk impacts, but previous studies show that page table walks may still induce execution-time overheads of up to 50% [6, 9, 22, 30, 33]. Another common method to ease the TLB translation pressure is to use huge pages. However, huge pages also suffer from internal fragmentations, inducing memory bloat problems.
    To solve the limited TLB reach problem, many researchers seek to exploit the contiguity of physical pages so that one TLB entry can map multiple pages [6, 33, 48, 49, 56]. TLB coalescing techniques are mostly hardware-oriented techniques, e.g., CoLT [49] and cluster TLB [48]. During a page walk, the processor will fetch a cache line that contains multiple PTEs. A coalescing TLB searches pages located in a contiguous or clustered region in those PTEs and map them with one TLB entry. Restricted by the capacity of a cache line, such techniques can only increase the coverage of a TLB entry by a small multiple (usually 8–16).
    Another type of method is the range translation [6, 33]. Figure 1 illustrates parts of an application’s address space mapped with range translations. A range is a memory region both virtually and physically contiguous, and it can have an arbitrary size so that a TLB entry can map much more pages than coalescing techniques. To translate an address in a range, the processor adds DELTA (the difference between physical and virtual addresses) of the range to the address. However, because a range can start and end anywhere, the translation unit has to compare the address being translated with every existing range boundary. Such comparisons greatly complicate the in-memory range table structure and limit its applicable pipeline stages.
    Fig. 1.
    Fig. 1. Range translation.
    In this article, we propose FlexPointer, a novel technique to tackle the challenges in expanding the TLB reach. Our key observation is that a program only allocates a few very large objects. If we create memory ranges from these few large objects, then we can assign every range with a unique ID and record them with a small table. Conventional 64-bit systems typically adopt a 48-bit virtual address space, so we leverage the unused bits in a pointer to pass the range ID to hardware, implementing a fast indexing mechanism for ranges. Although processor vendors are beginning to increase the virtual space to 57 bits by adding the 5-level paging feature [29, 54], current mainstream applications do not have hard requirements for a 57-bit space. So in this work, we focus on systems adopting a 48-bit virtual address space.
    With the help of this unique ID, FlexPointer greatly simplifies the structure of a range TLB (RTLB) by removing the need for multiple expensive comparisons in hardware. Furthermore, the ID allows us to move the range TLB lookup to an earlier pipeline stage and improve the latency. FlexPointer can reduce a great proportion of misses in L1 TLB, thus avoiding many L2 TLB searches and reducing page walks. Moreover, previous range-based methods have to organize the in-memory range table as a B-tree or a linked list [33] because of the indexing problem. But now we can organize it as an array with the range ID, simplifying the range table walk when a rare RTLB miss happens.
    We modify kernel memory management functions to provide contiguous physical pages for large objects. According to our trace-based simulation results, FlexPointer can reduce nearly all the L1 TLB misses and page walks for a variety of memory-intensive programs. Further performance model evaluations against a 4K-page baseline system show that our method offers an 14% performance improvement on average, and up to 2.8x speedup in the best case. FlexPointer requires no change to application source codes and only minor modifications to the hardware.

    2 Background

    2.1 Virtual Memory Infrastructures

    One of the primary goals of the virtual memory system is to provide each process with the illusion that it has a private address space [18, 19, 30]. In the page-based virtual memory, the OS partitions the address space of a process into virtual pages and maps them to physical pages, then hardware performs address translation according to the mappings at runtime. The architecture defines how software and hardware work in cooperation. In this section, we focus on the virtual memory system of the x86-64 architecture.
    The x86-64 architecture supports three-page sizes, 4 KB, 2 MB, and 1 GB. The base page size is 4 KB, while 2 MB and 1 GB pages are called huge pages. The OS maintains a page table for every process, which is also stored in memory as part of the virtual address space. The page table uses PTEs for recording mappings from virtual pages to physical frames, as well as information for security and other purposes. To save space, the page table is usually organized as a radix tree. For every memory access, hardware must first translate from virtual address to physical address with the page table. The lookup process in the page table is called page table walks. Page table walks require multiple memory reads, e.g., translating a 4 KB page in a 4-level-paging x86 architecture requires four reads. To accelerate the translation, processors use TLBs to cache recently used PTEs. Since page walks are still costly, often requiring hundreds of cycles, processors also adopt MMU caches to contain recently walked upper-level PTEs.
    The OS kernel has several software components for memory management. A user program gets virtual memory from the OS by calling syscalls like brk() or mmap(), and OS will either adjust existing virtual memory areas or create a new one. On the other hand, physical memory is managed by the buddy allocator, which tracks all of the free physical memory with free lists that contain power-of-two-sized blocks. In most cases, the program does not get physical memory immediately after calling such syscalls. Actual allocations of physical memory are delayed until the first access to the newly allocated virtual area. The first access to such an area will trigger a page fault and invoke the buddy allocator to get physical memory. This strategy is called demand paging.

    2.2 TLB Reach Increasing Techniques

    Huge pages. Since one TLB entry records a recently used PTE of a single page, an intuitive approach to increase the TLB reach is using a larger page size to increase the coverage of a TLB entry. Huge pages are common in modern computer systems [44, 50, 59]. The x86-64 architecture supports huge pages of 2 MB and 1 GB, backed up by various OS mechanisms, e.g., Transparent Huge Pages (THP) [14] and HugeTLBFS [15] in Linux. Available page sizes in x86-64 are sparse, which can bring internal fragmentation problems. For example, an allocation request of 1 MB requires 256 PTEs when using 4 KB base pages, but it also wastes half of the memory space if served with a 2 MB huge page. The lack of available page sizes puts some of the allocations in a dilemma.
    Some architectures provide more page size choices. Intel Itanium [12] separates the address space into eight areas and allows each area to configure its own page size. Itanium organizes huge pages with a hash page table, but without aggressive modification to the conventional 4-level page table, it can only be used to reduce page walk overheads. HP Tunable Base Page Size [20] allows OS to adjust the base page size. It still faces the internal fragmentation problem, so HP advises the base page size should be no more than 16 KB. Shadow Superpage [21, 55] adds a new translation level in the memory controller to merge non-contiguous physical pages into a huge page in a shadow memory space. Although it grants TLB a larger coverage, all memory traffic needs to be translated again in the memory controller, adding an extra latency for all memory accesses.
    Tailored Page Sizes (TPS) [24] expand the available page sizes to all power-of-two sizes. A tailored page consists of \(2^N\) base pages and aligns on its size. For a properly aligned tailored page, TPS makes the PTE of the first base page a real PTE and others as alias PTEs. TPS utilizes the fact that larger pages have fewer page frame number (PFN) bits to identify from the real PTE the tailored page size. To correctly fetch the real PTE during a page walk, all alias PTEs are pointed to the real one, requiring one additional memory access in the page walk.
    TLB coalescing. TLB coalescing approaches [48, 49, 56] leverage default OS allocator behaviors to pack multiple PTEs into one TLB entry. Because of locality, a default on-demand OS allocator tends to map a small set of contiguous virtual pages either to contiguous physical pages (Sub-blocked TLB [56] and CoLT [49]) or to a clustered set of physical pages (Clustered TLB [48]). The main drawback of such techniques is that the contiguity and cluster generated by a default allocator are limited.
    Most of these techniques are pure hardware solutions. They check PTEs in the same cache line and determine if they can be packed into one TLB entry, so their effectiveness is also restricted by the relatively small search window. Anchor TLB [46] instead searches the page table to achieve a more global view. Anchor TLB inserts anchor entries at a certain interval, then searches for physical contiguity starting from anchor entries and records the end address of contiguous region. Upon a TLB miss, Anchor TLB searches the TLB again for a corresponding anchor entry and compares the address being translated with the end address. If the address is in the contiguous range, then it can be translated using the anchor entry. As the comparison increases the lookup cost, Anchor TLB only searches for anchor entries after an L2 TLB miss.
    Segment-like techniques. Some early processors use segments to manage virtual memory. A segment can be viewed as the mapping between contiguous virtual memory to contiguous physical memory. Since a segment can be much larger than a page, some approaches partly bring back segmentation by separating special areas in the virtual address space to increase the translation coverage. Direct Segment [6, 22] allows the programmer to explicitly set one segment for a big-memory application. Two registers are added to mark the start and end of the segment. Any virtual addresses in this area can be translated by adding the offset between the virtual start address and the physical start address. Do-it-yourself Virtual Memory Translation (DVMT) [2] similarly adds two registers to separate a special area from the virtual address space, but it allows more complicated translations. DVMT launches a dedicated thread for translation when accessing an address in the special area. Direct Segment and DVMT both require intensive source code modifications, and their separated special virtual area will not appear in the conventional page table.
    Redundant Memory Mappings (RMM) [33] pave another way by adding an additional range table. For an allocation large enough, RMM eagerly allocates contiguous physical pages for it instead of demand paging. This pre-allocation strategy creates large memory ranges that are both virtually and physically contiguous, and addresses in those ranges can be translated by adding the offset like in Direct Segment [6]. RMM adds a standalone range table to hold the information of such ranges, while the conventional page table still holds the complete memory mapping. Compared with Direct Segment, RMM supports multiple ranges and is transparent to programmers. However, RMM has to compare an address with all range boundaries to decide which range it belongs to. Such comparisons are expensive, so RMM only queries the range TLB after an L1 TLB miss.

    2.3 Tagged Pointers

    Tagged pointers are widely used in the memory safety field. Detecting a memory safety violation often requires finding an object’s information using only a pointer. This is very similar to the indexing problem of memory ranges. The use of tagged pointers in memory protection techniques can be classified into two main categories. One uses the tag to search a metadata table [10, 35, 43, 52, 53, 61], and the other uses the tag and the address to calculate object boundaries [11, 37, 38]. The former needs to assign every object an ID, but since the object number is very huge in some programs, such techniques must either compress the ID or change the pointer representation for more spare bits. On the other hand, calculating the object boundary often puts extra alignment and/or size constraints on memory allocations, which can cause memory fragmentations.

    3 FlexPointer

    RMM observes that many applications naturally exhibit an abundance of contiguity in their virtual address space, and it only requires a few memory ranges to cover over 90% of the application’s memory [33]. Other researchers find that in a variety of applications, the vast majority of memory objects are small and generally short-lived [13, 31]. As for the remaining few large objects, they have a long lifetime in scientific and some commercial/desktop applications. Moreover, applications typically nearly reach their peak heap footprint in an early execution stage [31].
    We further investigate the memory proportion large objects take in a variety of benchmarks. Table 1 shows the number of objects above 64 KB and their proportion in the peak memory footprint. We can find that in many memory-intensive applications, the footprints of these objects can cover over 90% of the application’s virtual memory. Furthermore, an application only has no more than a few hundreds of such objects concurrently.
    Table 1.
    BenchmarksTotal Size (KB)Peak VmSize (KB)Memory Coverage#Object
    602.gcc78,8807,845,7121.005%273
    605.mcf5,045,0955,046,45699.973%7
    607.cactuBSSN6,890,3866,897,01699.904%178
    620.omnetpp14,142257,3445.495%7
    623.xalancbmk16,114502,8803.204%12
    631.deepsjeng7,043,8507,056,33299.823%6
    654.roms10,621,03510,657,68499.656%111
    657.xz15,708,81815,716,64499.950%5
    graph5003,474,2683,477,50499.907%25
    XSBench5,796,9495,797,83699.985%8
    gups8,401,2108,401,88099.992%6
    NPB:CG20,934,75820,935,07699.998%30
    NPB:MG3,547,3543,547,71299.990%20
    Table 1. Large Object (>64 KB) Number and Coverage
    The above results lead to a key observation: we can purposely assign contiguous physical pages for large objects and create memory ranges from them. Since such objects cover a great proportion of the whole program address space, creating virtually and physically contiguous ranges from them allows many addresses to be translated with the range translation, thus vastly expanding the TLB reach. On the other hand, because there are only a few hundred such objects, we can assign each range a unique ID and pass the ID to hardware through tagged pointers. Conventional 64-bit systems do not use the high 16 bits of a pointer, except for the most significant bit used to judge whether a user or kernel pointer. The unused 15 bits are enough to encode IDs for a few hundred ranges. With the help of unique IDs, we can easily find the range translation for a virtual address.
    Based on this observation, we propose a tagged-pointer-based technique, FlexPointer, to break the TLB reach bottleneck. With the help of range ID, FlexPointer greatly simplifies the range TLB (RTLB) structure, turning it into a traditional cache. Moreover, because the address generation process only manipulates the low 48 bits of a virtual address, there is no need to wait until it finishes to start the range lookup. In this way, FlexPointer can begin the range translation in parallel with the address generation, reducing both L1 TLB misses, and page walks with simple hardware.

    3.1 Overview

    FlexPointer adopts the range translation concept in RMM [33] and Direct Segment [6]. A range consists of contiguous virtual pages mapped to contiguous physical pages. Pages in a range have uniform protection bits, e.g., read/write/executable. A range is base-page-aligned and has an arbitrary number of pages. We use two addresses, BASE and LIMIT, to define a range. Because of the virtual and physical contiguity, addresses in a range share a common DELTA, which is defined as \((physical\_address-virtual\_address)\) . To translate a virtual address with range translation, the processor adds DELTA to it. Figure 1 shows parts of an application’s address spaces mapped with range translations.
    A system using range translations has three main components. (i) The creation of memory ranges. (ii) The management of range information. (iii) The hardware that efficiently utilizes range translations. As stated before, FlexPointer creates a range and assigns it a unique ID when receiving an allocation request of a very large size. To assure the physical contiguity of a range, we use the eager paging strategy. Instead of allocating physical pages at the first access, eager paging allocates physical pages at the allocation request time. We instrument malloc() functions to catch allocation requests and modify kernel memory management functions, e.g., mmap(), to eagerly allocate physical memory. We also add a kernel range table to record range translations. In our current prototype, mappings in the range table are also kept in the page table to ensure the correctness of other memory subsystems.
    Similarly, we use a range TLB to utilize range translations in hardware as in RMM [33]. But we can pass the range ID to hardware through the pointer tag, allowing the range TLB to work in parallel with the address generation. During a memory access, the processor decides whether to search the range TLB or the page TLB according to the pointer tag. The tag also decides which table to search when a range/page TLB miss happens.

    3.2 Range Creation

    Figure 2 shows how FlexPointer creates ranges. In FlexPointer, we divide allocation requests into two categories according to request sizes. For small allocation requests, they are served with the default malloc() procedure; while for large allocation requests, malloc() will directly call a newly added syscall, mmap_range(). mmap_range() creates a virtual memory area for the request and invokes the buddy allocator to eagerly allocate physical memory.
    Fig. 2.
    Fig. 2. Creation of ranges.
    We implement a wrapper function for malloc() which judges allocation requests by a pre-defined threshold. According to a previous study, setting the threshold to 48 KB or lower may classify too many objects as large objects, e.g., more than 40,000 in gcc [13]. On the other hand, this study also shows that a 64 KB threshold can keep the number of large objects under 1,000 for many applications, which is also proved by the results in Table 1. So in our wrapper function, we set the threshold to 64 KB.
    The eager paging strategy ensures the newly allocated virtual pages are backed by physical pages, but we also need to modify the buddy allocator to provide physical contiguity. Various OS proposals offer different allocation strategies to address the issue of maximizing memory contiguity [3, 39, 45, 62]. Among them, contiguity-aware (CA) paging [3] introduces an indexing structure, contiguity_map, to record physical contiguity in the system. We adopt this structure to help implement the buddy allocator. We put no rounding constraint or upper limit on the range size and only require a range to be aligned on the base page size (4 KB on x86 architectures).
    When mmap_range() is called, it inserts a new entry into the range table and uses the index of the entry as the range ID. We maintain a FIFO free index list to help find the inserting position of a new entry. If the free index list is empty, the new entry is inserted at the end of the range table; otherwise, mmap_range() takes an index from the free list. In the munmap() syscall, the kernel checks high bits of the incoming address to judge if it is unmapping a range. For a tagged address, munmap() should reclaim the corresponding range table entry and put its index in the free list. We will discuss the structure of range table in Section 3.4.

    3.3 Pointer Encoding

    Figure 3 illustrates our pointer format. Conventionally, the most significant bit of a 64-bit pointer is used for distinguishing user and kernel pointers, so we leave this bit alone. There are 15 unused bits in the pointer, and we use 13 bits of them for ID storage. The other 2 bits are left for future use, e.g., as selection bits to cooperate with other tagged-pointer techniques. Normally, the high bits of an untagged pointer are all 0s or 1s, so we exclude 0x0 and 0x1FFF from IDs. According to the object counts in Table 1, no benchmark examined produces more than 1,000 ranges when setting the threshold to 64 KB. This encoding scheme grants us 8,190 available IDs, fairly enough for current benchmarks.
    Fig. 3.
    Fig. 3. Pointer format.

    3.4 Range Table

    FlexPointer adds a per-process range table to record range translations. Figure 4 shows the structure of a range table entry. BASE and LIMIT define the range in the virtual address space, and DELTA defines it in the physical space. Current systems use a 48-bit address space, so the virtual page number (VPN) and physical page number (PPN) are both 36 bits. Since DELTA represents the difference between VPN and PPN, it also takes 36 bits to store. In the conventional page table, a PTE uses its low 12 bits to record attributes of the page, e.g., r/w bits. We also adopt these attributes in our range table (Attr in Figure 4).
    Fig. 4.
    Fig. 4. Structure of a range table entry.
    The range table is organized as an array. When handling a large allocation request, mmap_range() needs to assign the range an ID. As mentioned in Section 3.2, we use a free index list to guide the insertion of an entry. mmap_range() gets an index from the free list and uses it as the range ID, recording it in the high bits of the return pointer. If there is no free entry in the range table, mmap_range() should fall back to the default demand paging strategy.
    We use Next Range ID (NRID) and Previous Range ID (PRID) to help handle mremap(). NRID and PRID are both 13 bits, the same length as the range ID in the pointer tag. When calling mremap() to enlarge a range, there may not be enough physical memory for in-place resizing. One possible solution is to fall back to normal paging. But to further exploit contiguity in such cases, we let mremap() create a sub-range: the extended part shall get a contiguous physical region, which may not be contiguous with the original range. A sub-range inherits the original range’s ID because they are the same object from the perspective of the user program. However, the sub-range should have its own range table entry because it has a different DELTA.
    When mremap() is called, it first checks the high bits of the input pointer. For a tagged pointer, mremap() fetches the corresponding range table entry (last sub-range entry if multiple sub-ranges exist). Then mremap() checks whether there are enough free physical pages to resize in situ. If possible, mremap() initializes the pages and updates the range table entry. Otherwise, mremap() will search the free list for a free entry, record the translation information and link it to the last sub-range entry of the range. Sub-range entries are organized as a double-linked list, using NRID and PRID to connect with each other. The L bit indicates if a sub-range is the last one in a range. The index taken by a sub-range entry will never appear in the pointer tag until the sub-range gets freed.
    A possible yet not observed scenario is that a program calls mremap() to expand an under-threshold object, and the expanded part exceeds the threshold. In that case, mremap() should create a sub-range for the incremented part and tag the return address. However, the tricky part is that now addresses within the original object are also tagged. A tagged address should be translated using the range table, but addresses in the original part are mapped with non-contiguous physical pages and should be translated with the page table. To solve this problem, we create a dummy sub-range to represent the original under-threshold object. We use the D bit in a range table entry to mark it as a dummy entry. In a dummy entry, we store an upper-level PTE (PDE or PDPTE in x86-64) address in DELTA and Attr to reduce the page table walk overhead. Moreover, the return address of mremap() is tagged with the index of the dummy entry because it is the first sub-range.

    3.5 Range TLB

    Like RMM [33], FlexPointer relies on the range TLB to utilize range translations. We organize the range TLB as a fully-associative cache, as shown in Figure 5. The field Du represents if the entry is a dummy one. With the help of range ID, we can greatly simplify the lookup of range TLB. In RMM, the range TLB performs two comparisons per entry to complete an address lookup, while FlexPointer only needs to perform a single equality test per entry. Moreover, because our range ID does not participate in the address generation, we can start the search in parallel with the address generation instead of after it. As a result, we can shift the range TLB lookup to an earlier stage.
    Fig. 5.
    Fig. 5. A fully-associative range TLB.
    Although indexing the range TLB only requires the range ID, we also need to record BASE and LIMIT in the range TLB entry. For one reason, we need to compare the address with BASE and LIMIT for safety concerns. Since we already get the DELTA for translation after the ID equality test, the comparison can be done in parallel with fetching from the data cache. Upon a comparison fail, we trigger an RTLB miss signal to cancel the cache access and start a range table walk. If the address is illegal, then the range walk cannot find a corresponding entry and will eventually trigger a security fault.
    Another reason is that mremap() may create sub-ranges that share the same range ID. To simplify the range TLB lookup, we only record the most recently used sub-range, assuring that IDs are unique in the range TLB. As a result, we need to perform a boundary check to make sure that the address falls in the right sub-range. If the boundary check finds the address mismatching the sub-range, the processor will send an RTLB miss signal, cancel the cache access and start a range table walk to fetch the correct sub-range from memory.
    If an application creates too many sub-ranges and frequently switches between them, the range TLB will suffer from frequent range table walks. We examined the benchmarks and found that most of them rarely use mremap(). gcc is an exception that frequently increases the range size by one page. The sub-range number is closely related to the system fragmentation level and the allocation strategy of the buddy allocator. For example, with a proper page reservation strategy, the buddy allocator may satisfy most remapping requests of gcc in situ. Another possible solution for the sub-range problem is adding another level of range TLB, as discussed in Section 4.6. In this article, we assume that the physical contiguity is abundant and mremap() will not generate sub-ranges. We leave a detailed analysis of the sub-range number and its impact on the range TLB for future work.
    Figure 6 illustrates the hardware components and workflow of FlexPointer. At the address generation stage, the processor decides whether to query the range TLB or the page TLB according to the high bits of the address. If they are all 0s or 1s, then the address is translated with the original page TLB after the address generation. Otherwise, the processor uses the range ID in the pointer to search the range TLB at once. As we stated before, IDs are unique in the range TLB, so the processor gets only one DELTA when the ID hits in the range TLB. Next, the processor adds DELTA to the generated virtual address while comparing the virtual address with BASE and LIMIT. If it passes the range check, then the sum of DELTA and the virtual address is the correct physical address. Moreover, if an address passes the boundary check of a dummy entry, it will be sent to the page TLB for translation.
    Fig. 6.
    Fig. 6. Hardware overview.
    A range TLB miss happens in two cases. (i) There is no matching ID in the range TLB. (ii) The boundary check fails. When a range TLB miss happens, the processor handles it by starting a range table walk to fetch the corresponding range translation. To simplify the range TLB lookup, we want to keep IDs in the range TLB unique. So if the miss is caused by sub-range mismatching, the newly fetched translation should be updated where the previous sub-range lies instead of inserting a new entry.
    During a range table walk, we first calculate the translation entry address by adding \((ID\lt \lt 5)\) to the base address of the range table (a range table entry takes 32 bytes, as shown in Figure 4). The range table base address is part of the program context and should be stored in a dedicated register, like storing the page table base in CR3. If the address to be translated falls in \([BASE,LIMIT)\) , then the entry is what we need and will be fetched into the range TLB, whether it is a dummy entry or not. For a dummy entry, we should query the page table for a correct translation and insert it into the page TLB. When the address falls out of \([BASE,LIMIT)\) , we check the L bit to see whether the current entry is the last sub-range entry. If not, we get the next sub-range entry with NRID and repeat the above procedure. But if it is the last one, then there must be a safety violation.

    4 Discussions

    In this section, we discuss some of the hardware and software issues that should be considered carefully in a production implementation. We leave the detailed quantitative analysis of these issues for future work.

    4.1 Range ID Shortage

    Although current benchmarks allocate only a few hundred ranges, future programs may require much more range IDs. In general, we want to associate the ID with an entity in high-level program semantics then it can pass more software information to the hardware, so it’s better to avoid the sharing of IDs. The ID uniqueness also simplifies the design of range TLB and range table. Under this circumstance, we can either use the remaining two bits to increase the ID limit, or let the OS dynamically adjust the allocation threshold when the number of available IDs falls below a certain watermark. We also provide an alternative pointer format in Section 5.2 that breaks the ID limit by using the position of a range as part of its ID, although complicating the range table at the same time. Moreover, FlexPointer is an enhancement to the traditional page table instead of a replacement, so on an ID shortage, the OS can choose to only use the page table.

    4.2 Accessed and Dirty Bits

    Except for translating virtual addresses, the TLB in x86 processors is also responsible for updating the usage information of pages. In an x86 PTE, there is an accessed bit and a dirty bit. The TLB should set the accessed bit on the first access to a page and the dirty bit on the first write. However, the range TLB groups multiple pages in one range translation entry, so it is difficult to determine which page’s accessed/dirty bit to set on a hit. One solution is to only maintain range-level accessed/dirty bits. It is reasonable because in FlexPointer a range is created from a memory object, so from the perspective of program semantics, pages in a range can be treated as a whole. Although coarse-grained page usage information may induce overheads for swapping, studies also show that performance-critical big-memory applications tend to do little or no swapping [6, 36]. Also, since we, in fact, save many L2 page TLB entries, we can reuse them to record more fine-grained information, like compacting metadata of several pages in a bit vector as PRISM [4] does.

    4.3 Copy-On-Write

    Copy-on-write is a virtual memory optimization for sharing pages between processes. With this optimization, the OS duplicates a page when one of the processes writes to it, ensuring that modifications are only visible to the process initiating the write. Conventional systems implement this mechanism by setting protection bits for shared pages in their PTEs, so a write to such pages will trigger a fault for the OS to duplicate the page. In FlexPointer, we can utilize the dummy entry to support this mechanism. Upon the first write, we can dissolve part of the range (including the page being written to) and copy the modified page. Then, we create a dummy sub-range for the dissolved part and update the PTEs of the pages in it so that future access to the dissolved part will be directed to the page table. In this way, we can retain the translation efficiency of the remaining part of the range and the flexibility of copy-on-write at the same time.

    4.4 Memory Initialization Cost

    Linux requires physical pages to be initialized when allocated. Eager paging will increase the latency of an allocation request because the application must wait for all allocated pages to be initialized. Previous studies [6, 31] show that big-memory applications tend to allocate most of their memory at an early execution stage and those allocations generally have a long lifetime, so the overheads of initialization can be amortized. To further reduce the initialization overheads, we can adopt the reservation-based paging strategy used in FreeBSD [42] and other proposals [24, 44, 56]. We can use the dummy entry to represent reserved pages and promote them when needed. Another way to ease the impact of initialization is using a dedicated background thread to initialize free pages [45].

    4.5 Fragmentation

    In this work, we follow the same assumption of RMM that physical memory contiguity is abundant. But in a long-running server, there are generally multiple processes in execution and a variety of workload mixes, so physical memory may become fragmented during the execution. The chance to find a suitable contiguous physical region drops as the system fragmentation increases, thus the efficiency of FlexPointer is limited in a fragmented environment.
    Technically, the OS should fall back to the default paging strategy if it cannot find a sufficiently large range of free physical memory. But with the help of our sub-range mechanism, we can tolerate the fragmentation to a certain level. If there is not enough physical contiguity, we can split a range request into several smaller sub-ranges. But it also should be noted that too many sub-ranges can both lower the range TLB hit rate and slow the range table walk. So we suggest setting a sub-range number limit and treating a request that exceeds this limit as a signal for memory compaction. Moreover, many OS proposals [3, 39, 45, 62] can help increase the physical contiguity and ease the impact of memory fragmentations, increasing the chance of finding a suitable physical region.

    4.6 L2 Range TLB

    As stated before, when an application creates too many sub-ranges, the performance of FlexPointer will suffer from frequent range table walks. Although current benchmarks rarely create sub-ranges, future programs may present a different memory usage pattern. On the other hand, when using the sub-range mechanism to tolerate memory fragmentations, a program may split ranges into too many small parts in a heavily fragmented environment. To avoid such performance degradation, we can add a second-level range TLB to hold sub-ranges sharing the same ID. As IDs are no longer unique in the L2 range TLB, we have to do the boundary check first to distinguish which sub-range to use before the translation, which will complicate the L2 range TLB structure. In this work, we focus on the environment where physical memory contiguity is abundant. Also, our results in Section 6.2 shows that a small L1 range TLB is sufficient to satisfy most benchmarks, so adding an L2 range TLB in this work provides little benefit.

    5 Limitations and Suggestions

    5.1 Apply with Other Tagged-Pointer Systems

    In this article, we use the unused high bits in a pointer to accelerate the address translation, but some systems also use these bits for other purposes. For example, ARM Memory Tagging Extension (MTE) stores an address tag at the top of a pointer to detect memory safety violations [26]. Applying FlexPointer with other tagged-pointer-based techniques is a tricky task because we must avoid the conflict of different tags. Since we currently provide far more available range IDs than an application needs, we can compress the range ID and provide space for other tags. Moreover, most tagged-pointer systems only use part of the unused bits, so we can add selection bits in the tag to switch between them flexibly. But in general, applying FlexPointer with another tagged-pointer-based techniques is a question that needs to be discussed case-by-case.

    5.2 5-Level Paging

    Our current pointer format is designed towards a 64-bit system that adopts a 48-bit virtual address space, but future architectures may adopt a larger address space [29, 54]. Intel begins to support 5-level paging in Ice Lake, and AMD will also add the feature in future CPUs. When enabling the 5-level paging, the length of a virtual address increases to 57 bits. With the most significant bit used for distinguishing kernel and user pointers, unused bits decrease to 6 bits. Moreover, a program that requires a 57-bit address space may present a different memory usage pattern, such as allocating more large objects. Considering these problems, we provide an alternative encoding format in Figure 7.
    Fig. 7.
    Fig. 7. The alternative pointer format.
    This alternative format is inspired by FRAMER [43] and Cryptographic Capability Computing ( \(C^3\) ) [40]. FRAMER defines frames as memory blocks that are \(2^n\) -sized and aligned by their size. The wrapper frame of memory object x is the smallest frame that completely contains x. FRAMER proves that a given wrapper frame can contain at most one object [43], so we can use it as an alternative range ID. A frame can be represented by its start address and the binary logarithm of its size. The Anchor in Figure 8 is a \(2^{N+2}\) -aligned address. The wrapper frame of range A starts at Anchor and ends at Anchor+ \(2^{N+1}\) , represented as (Anchor, N+1). Similarly, the wrapper frame of range B is represented as (Anchor, N+2). We use N+1 and N+2 as the pointer tag for addresses in ranges A and B, respectively. To translate an address tagged with N, the processor first clears the lowest N bits to get the wrapper frame start address and then combines it with N to form an ID. As long as the access is in-bounds, this process can be done in parallel with the address generation.
    Fig. 8.
    Fig. 8. Using wrapper frame to index range entries.
    By using the position of a memory range as part of its ID, this format breaks the ID number limit. It requires 6 bits to store the tag and does not put extra constraints on the allocation process. We also exclude 0x0 and 0x3F to distinguish tagged and untagged pointers. When using the alternative format, the range table structure should also be changed. The new ID is much longer than 13 bits, so organizing the range table as a simple array may take up too much space. A hash table may be a more suitable choice, but the specific hash function implementation also affects the efficiency.
    In this article, our work focuses on the format using a 13-bit ID. The main reason is that current benchmarks have no hard requirement for 5-level paging, so we think it is not proper to assume program behaviors under the 5-level paging, especially memory usage patterns, with current benchmarks. As a result, we leave related research for future work and provide this format only as a possible solution for 5-level paging. The remaining parts of this article are all based on the format in Section 3.3.

    6 Evaluations

    To evaluate the performance of FlexPointer, we use a partial simulation method. First, we build a trace-based functional TLB simulator to evaluate how FlexPointer affects TLB hit/miss rates. Then, we use Mosmodel [1] to build performance models on the real machine. Mosmodel implements a special allocator that can adjust the memory layout using different combinations of 2 MB pages and 4 KB pages. It records performance counter measurements under multiple different memory layouts and generates the performance model for an application with these data.
    Table 2 shows our system configurations. FlexPointer is designed mainly towards big-memory applications and can improve the TLB performance for a variety of programs. We select benchmarks with poor TLB performance and/or with large memory footprints from SPEC 2017, Graph 500, GUPS, XSBench, and NASA Parallel Benchmarks. Table 3 summarizes the benchmarks we use.
    Table 2.
    CPUIntel E5-2650 v4 @2.20 GHz
    Memory64 GB, DDR4, 2133 MHz
    L1 Page TLB64 entries, 4-way associative
    L2 Page TLB1536 entries, 6-way associative
    Range TLB16 entries, fully associative
    Table 2. System Configurations
    Table 3.
    SuiteInputMemory
    SPEC 2017gcc
    mcf
    cactuBSSN
    omnetpp
    xalancbmk
    deepsjeng
    roms
    xz
    7.5 GB
    3.9 GB
    6.6 GB
    241.3 MB
    479.4 MB
    6.7 GB
    10.1 GB
    15.0 GB
    NASA Parallel
    Benchmarks
    NPB:CG
    NPB:MG
    19.1 GB
    3.4 GB
    TLB intensivegraph500
    XSBench
    gups
    3.3 GB
    5.5 GB
    8.0 GB
    Table 3. Benchmark Memory Footprint

    6.1 Internal Fragmentations of Eager Paging

    Eager paging strategy allocates physical memory rightly after the creation of a virtual memory area. But if a program does not access all of the allocated virtual pages, un-accessed physical pages are wasted, causing internal fragmentations. We tracked and examined the target address of every load/store instruction to find actually allocated pages in memory ranges. When a page is accessed for the first time, it is counted as an allocated page because, since then, it needs to be backed by a physical page.
    Table 4 shows the total page number requested by memory ranges and the number of allocated pages. We can find that most of the requested pages are actually accessed, so eager paging should not cause serious internal fragmentation problems. Here, we take the percentage of pages not allocated as the internal fragmentation. As shown in the table, the eager paging strategy leads to less than 10% internal fragmentations for most of the benchmarks.
    Table 4.
     #page (requested, 4 K)#page (allocated, 4 K)#page (2 M)#page (1 G)fragmentation (eager)fragmentaion (2 M)fragmentation (1 G)
    gcc19,72016,53539116.15%17.19%93.69%
    xalancbmk4,0293,4858113.49%14.92%98.67%
    omnetpp3,5363,487711.39%2.71%98.67%
    roms2,655,2592,589,6595,187112.47%2.49%10.19%
    xz3,927,2053,918,6117,671150.22%0.23%0.34%
    mcf1,261,2741,171,5012,46457.12%7.14%10.62%
    cactuBSSN1,719,4471,719,4473,35970%0.02%6.30%
    deepsjeng1,757,8131,757,8133,43470%0.02%4.21%
    graph500936,477853,5141,83048.86%8.91%18.60%
    XSBench1,453,8191,453,8192,84060%0.02%7.57%
    gups2,097,1532,097,1534,09790%0.02%11.11%
    NPB:CG5,226,8194,999,10410,209204.36%4.36%4.65%
    NPB:MG879,968876,8141,71940.36%0.38%16.38%
    Table 4. Numbers of Requested and Allocated Pages of Memory Ranges
    2M/1G pages are estimated by rounding up the requested size.
    We also compared with the internal fragmentations of adopting 2M/1G huge pages. Currently, using 1G pages in a program relies on the hugetlbfs [15] mechanism, which requires modifications to the source code and the program managing some of the memory on its own. Here, we assume using hugetlbfs and that the program can perfectly compact all the ranges into 2 M/1 G huge pages. We then estimate the number of huge pages by rounding up the sum of range sizes. Fragmentation results in Table 4 show that the eager paging strategy leads to far less internal fragmentation than using 1 G pages to support memory ranges. Adopting 2 M huge pages uses a similar amount of memory to the eager paging, but requires much more TLB entries.

    6.2 Reduction of TLB Misses and Page Walks

    We evaluate the TLB hit/miss rates of FlexPointer with a functional TLB simulator, which modeled L1/L2 page TLB and the page table walker. The simulator uses 4 KB pages. We use Simpoint [47] to locate simulation points and extract load/store instructions with PIN [41]. All simulation points consist of 1 billion instructions. We also validated MPKI of the trace (baseline configuration enabling only page TLBs) with performance counter measurements of native executions to ensure that the sampled regions are representative. The coverage of simulation points is over 90% on every benchmark.
    According to the results from previous studies [13, 31] and our evaluations in Table 1, we set the large object threshold to 64 KB to constrain the number of ranges. We implement a malloc() wrapper function as discussed in Section 3.2 and replace the original malloc() with it. Then we use PIN to catch all calls to mmap(), mremap(), and munmap() as hints for range creations and reclaims. These function calls are recorded both in and out of simulation points to completely reflect range creations and reclaims.
    We also implemented RMM [33] and CoLT [49] in our simulator. For CoLT, we assume an ideal situation where every TLB entries cover eight contiguous physical pages. We let FlexPointer and RMM create ranges according to the same threshold. For FlexPointer, all tagged addresses go to the range TLB while the others go to the page TLB; for RMM, the range TLB is searched only after an L1 TLB miss. In FlexPointer, a page walk happens either after the L2 page TLB misses or after the range TLB misses; in RMM, a page walk happens after the L2 page TLB and the range TLB both miss. Misses caused by sub-ranges (generated by remapping) are included in the range TLB misses.
    Figures 9 and 10 show the reduction of L1 TLB miss number and page table walk number, respectively. RMM only searches the range TLB after an L1 TLB miss, so it cannot reduce L1 TLB misses. For most benchmarks, FlexPointer can reduce 70–99+% of L1 TLB misses and page walks. Table 5 illustrates how FlexPointer and RMM affect the miss rate of page TLB. We can find that FlexPointer mainly reduces the L1 TLB miss rate. For some benchmarks, the L2 miss rate instead increases. This is because the decrease in L1 TLB misses reduces the total number of L2 TLB accesses, and some untagged addresses still miss in both L1 and L2 page TLB.
    Fig. 9.
    Fig. 9. L1 TLB miss number reduction, RMM not included because it activates after an L1 TLB miss.
    Fig. 10.
    Fig. 10. Page table walk number reduction.
    Table 5.
     L1 miss rateL2 miss rate
    baselineFlexbaselineFlexRMM
    gcc0.117%0.117%71.344%71.335%71.307%
    xalancbmk1.246%1.217%87.498%89.276%86.532%
    omnetpp4.348%3.951%50.632%54.237%48.966%
    roms4.934%2.066%85.206%70.865%30.539%
    xz1.654%0.289%63.666%44.877%19.216%
    mcf2.486%<0.001%77.721%<0.001%8.501%
    cactuBSSN9.371%<0.001%76.739%74.419%6.584%
    deepsjeng0.084%0.004%72.655%23.493%15.161%
    graph5002.617%<0.001%90.645%99.23%1.307%
    XSBench5.569%<0.001%97.426%<0.001%2.830%
    gups25.060%<0.001%99.993%<0.001%0.243%
    NPB-CG32.999%<0.001%97.670%<0.001%0.217%
    NPB-MG0.123%<0.001%95.253%<0.001%3.305%
    Table 5. Page TLB Miss Rates (RMM only affects the L2 miss rate)
    For gcc, xalancbmk, and omnetpp, neither FlexPointer nor RMM shows a good reduction of TLB miss number. Referring to the coverage results in Table 1, we find that for FlexPointer and RMM, the coverage of large objects is consistent with the TLB miss and page walk reduction. We then examined the allocation patterns of these benchmarks and found that they allocate much more objects than others, and most of the objects are below the threshold. So these three benchmarks benefit little from FlexPointer and RMM because the memory coverage of ranges is low. But it should also be noted that, although the reduction of TLB misses and page walks is low, FlexPointer does not hurt the performance of these benchmarks.
    On the other hand, CoLT can reduce many of the L1 TLB misses and page table walks for gcc, xalancbmk, and omnetpp, but cannot reduce as many on other benchmarks. Because CoLT increases the coverage of a TLB entry at a relatively small scale, its efficiency on benchmarks using only a few very large objects is limited. TLB coalescing techniques like CoLT only modifies the page TLB, so they are orthogonal to FlexPointer. We can combine them with FlexPointer to improve performance for applications with many small objects.
    We also noted that FlexPointer can reduce more page walks than RMM for all benchmarks except cactuBSSN. We assumed that cactuBSSN suffers from thrashing, so we changed the range TLB capacity and tested again. As Figure 11 shows, the number of L1 range TLB misses drops drastically when increasing the L1 range TLB capacity from 50 to 51. Thus we can come to a conclusion that cactuBSSN needs a larger range TLB to avoid thrashing. Although cactuBSSN endures range TLB thrashing, FlexPointer still reduces over 70% of page walks.
    Fig. 11.
    Fig. 11. Range TLB miss number of cactuBSSN under different capacity.
    One possible problem of adding a range TLB is that it may instead introduce many range table walks if its hit rate is poor. We investigated range TLB hit rates of benchmarks using a 16-entry range TLB configuration. As shown in Table 6, FlexPointer induces nearly no range TLB misses for all benchmarks. A benchmark experiencing thrashing like cactuBSSN can still achieve a hit rate over 93%. From the results, we can see that even a small range TLB can provide very high hit rates, so the range TLB miss will not become a bottleneck for FlexPointer. Table 6 also proves that a 16-entry range TLB is sufficient for most benchmarks.
    Table 6.
     Hit Rate Hit Rate
    gcc99.97%deepsjeng100%
    xalancbmk100%graph500100%
    omnetpp100%XSBench100%
    roms100%gups100%
    xz100%NPB:CG100%
    mcf100%NPB:MG100%
    cactuBSSN93.17%  
    Table 6. Range TLB Hit Rates with a 16-Entry Range TLB

    6.3 Performance

    The overall performance is evaluated with Mosmodel [1]. Mosmodel implemented a special memory allocator to generate different page size mixes and produced a performance model using performance counter measurements under different mixes. Denoting L2 TLB hits, L2 TLB misses, and page walk cycles as \(H\) , \(M\) , and \(C\) , respectively, Mosmodel is a third-degree polynomial model:
    \(\begin{eqnarray*} \begin{aligned}R(H,M,C)= \beta + \alpha _0 \cdot C + \alpha _1 \cdot M + \alpha _2 \cdot H + \alpha _3 \cdot C^2 + \alpha _4 \cdot CM + \alpha _5 \cdot CH + \cdots \end{aligned} \end{eqnarray*}\)
    We first execute each benchmark completely and use perf to get the original performance counter data. Then, we adjust the three parameters of each benchmark proportionally according to the functional simulation results and input them into the Mosmodel performance model to predict execution cycles.
    For FlexPointer, the data of H and M are only from the page TLB because our range TLB works before the L2 page TLB. We count the range TLB misses as page walks. In this article, we assume a system with a low fragmentation level, so range table walks can complete on the first access, which means that a range table walk cannot be more costly than a default page walk. We denote the L2 page TLB hit and miss in FlexPointer simulation as \(PHit_f\) and \(PMiss_f\) , and denote the range TLB miss as \(RMiss_f\) . The L2 TLB hit and L2 TLB miss of the baseline simulation (only page TLB) are denoted as \(PHit\) , and \(PMiss\) . The adjusted performance model parameters are denoted as \(C_p\) , \(H_p\) , and \(M_p\) . We adjust parameters as follows:
    \(\begin{eqnarray*} \begin{aligned}C_p &#x0026;= \frac{RMiss_f + PMiss_f}{PMiss} \times C \\ H_p &#x0026;= \frac{PHit_f}{PHit} \times H \\ M_p &#x0026;= \frac{PMiss_f}{PMiss} \times M \end{aligned} \end{eqnarray*}\)
    For RMM, we view the range TLB as a means to reduce page walks. When an address hits in the L2 page TLB, we omit the lookup result in the range TLB, so the number of L2 page TLB hits is the same as that in the baseline. As for an L2 page TLB miss, if it hits in the range TLB, we view it as a miss canceled by the range TLB and subtract it from the final miss number. Only when an address misses both in the L2 page TLB and the range TLB, it is counted as an L2 page TLB miss, denoted as \(PMiss_r\) . Now we adjust parameters as follows:
    \(\begin{eqnarray*} \begin{aligned}C_p &#x0026;= \frac{PMiss_r}{PMiss} \times C \\ H_p &#x0026;= H \\ M_p &#x0026;= \frac{PMiss_r}{PMiss} \times M \end{aligned} \end{eqnarray*}\)
    As for CoLT, since it only modify the page TLB, the adjustment of parameters is straightforward. Denoting the L2 page TLB hit and miss of CoLT as \(PHit_c\) and \(PMiss_c\) , respectively, we adjust its parameters as follows:
    \(\begin{eqnarray*} \begin{aligned}C_p &#x0026;= \frac{PMiss_c}{PMiss} \times C \\ H_p &#x0026;= \frac{PHit_c}{PHit} \times H \\ M_p &#x0026;= \frac{PMiss_c}{PMiss} \times M \end{aligned} \end{eqnarray*}\)
    Figure 12 presents the speedups of FlexPointer, RMM, and CoLT. All performance are normalized to the baseline system. We also compare the results with the speedups of THP as a reference because the internal fragmentations of using 2 MB huge pages are close to the eager paging strategy. Speedups of THP are calculated by comparing the elapsed time when executing with/without enabling THP on the real machine.
    Fig. 12.
    Fig. 12. Speedup predictions. THP results are from the real machine.
    Among all benchmarks, gups benefits the most from FlexPointer and RMM. FlexPointer grants a 180% performance improvement and RMM shows a 150% improvement. Considering that gups allocates a large area and does random access within it, this benchmark presents a memory usage pattern that best fits the range translation. On the other hand, CoLT improves little on gups because of the limited coverage of TLB entries and the poor memory locality.
    For most of the benchmarks, FlexPointer shows a better performance than RMM and CoLT. Some benchmarks, like xz, graph500, XSBench, and NPB:CG, can be accelerated by FlexPointer for 10–20%. Improvement for the remaining benchmarks is not as significant, but most of them experience around 5% performance improvement with FlexPointer. For benchmarks with a high range coverage, FlexPointer can offer a higher speedup than THP, except for XSBench and NPB:MG. For these benchmarks, one possible reason is that THP mapped the stack with huge pages, while FlexPointer only works on the heap.
    We also repeated the experiments on the remaining SPEC 2017 benchmarks to assess if FlexPointer causes performance degradation. Among them, exchange2, povray, perlbench, and nab allocate few or no memory range. They show similar results to gcc, xalancbmk, and omnetpp. Other benchmarks, like namd, parest, blender, wrf, and leela, can achieve a higher range coverage (around 40%). Remaining benchmarks (waves, lbm, x264, pop2, imagick, and fotonik3d) allocate even more large objects (range coverage over 80%). These benchmarks already have high TLB hit rates when using 4 KB pages, so they experience only a little improvement, around 1–4%. None of these benchmarks experiences performance degradation.
    Overall, FlexPointer offers an average performance improvement of around 14% (5.9% if excluding gups), while RMM and CoLT offer an 11% (4.7% if excluding gups) and 3% average improvement, respectively. On the other hand, THP offers a 12% (6.0% if excluding gups) average improvement.

    7 Related Work

    Virtual memory remains an active area of research. Prior studies have shown that the limited TLB reach is becoming a performance bottleneck because of inducing excessive page walks [6, 9, 22, 24, 32, 33, 34], and they have proposed various solutions for this problem.
    Among the TLB reach increasing techniques we discussed in Section 2.2, the most closely related to us are segment-like techniques. Direct Segment [6] only allows a program to create a single segment for allocating large objects, thus lacking flexibility. It also requires modification to the source code for manually creating the segment and managing the memory of this segment. RMM [33] is transparent to programmers and supports multiple memory ranges, but the complexity of range indexing restricts its potential. FlexPointer is also transparent to programmers and tackles the range indexing problem. Section 6 shows that we can achieve better performance than RMM.
    TLB coalescing techniques passively exploit the contiguity of physical pages naturally generated by the OS [46, 48, 49, 56]. They are restricted by the OS allocation strategy and the limited searching window, packing only a small number of translations (e.g., 8–16) per TLB entry, so their improvement for large working sets is limited. FlexPointer instead adopts the eager paging strategy and our range has no size limit, thus we can exploit the memory contiguity in large working sets much better.
    Conventional huge pages [14, 15] suffer from internal fragmentation problems. Intel Itanium [12] supports more available page sizes, but they also need support from a software-defined TLB. TPS [24] provides an elegant way of adding available huge page sizes, but it also faces the tradeoff between the number of PTEs and the internal fragmentation. To represent an allocation with one entry, TPS needs to round the request size to the nearest power of two, inducing internal fragmentations; on the other hand, TPS can satisfy an allocation with multiple tailored huge pages to reduce internal fragmentations, but will also reduce the average coverage of a single TLB entry. As for FlexPointer, we can represent a large object using one range entry in a lightly fragmented system. Moreover, according to the evaluation in Section 6.1, applying eager paging for large objects will not induce serious internal fragmentation problems.
    Apart from increasing the TLB reach, there are various proposals seeking to reduce the TLB miss number. For example, prefetching PTEs into the TLB in advance can increase the TLB hit rate [9, 32]. However, the effectiveness of prefetching is largely affected by the memory access pattern. When the application presents a random access pattern, the prefetching can hardly help.
    Another common way to ease the impact of limited TLB reach is to reduce the cost of a TLB miss, rather than reducing the miss number. MMU caches [5, 8, 16, 17] cache higher levels of the page table, skipping one or more memory accesses to reduce the page walk latency. Commodity processors also cache PTEs in the data cache [16]. POM-TLB [51] constructs a third-level TLB in memory and caches its TLB entries in the L2 data cache to reduce the cost of page walks. FlexPointer is orthogonal to these techniques. We focus on reducing, even eliminating TLB misses, but we can still accelerate page/range walks with them when needed.
    Prior work also proposes to utilize the relatively large capacity of the data cache hierarchy to reduce the number of address translations. Such techniques adopt virtual caches and only translate an address after a cache miss [7, 23, 25, 60]. But if a workload has a poor locality, it can still suffer from many translations. Using a virtual cache only shifts the inevitable translations to a lower level of the cache hierarchy, and it also requires complex modifications to the system in order to handle synonym problems.

    8 Summary

    In this article, we propose FlexPointer, a tagged-pointer-based technique to increase the TLB reach. FlexPointer tackles the range indexing problem by passing range IDs to hardware via unused bits in the pointer. With the help of range IDs, we can organize the range TLB as a traditional cache, removing many comparison units from the critical path. This simplified range TLB can work in parallel with the address generation, reducing not only page walks but also L1 TLB misses. According to our trace-based simulations, FlexPointer can eliminate nearly all of the L1 TLB misses and page walks for a variety of memory-intensive workloads, providing a 2.8x speedup in the best case.

    References

    [1]
    Mohammad Agbarya, Idan Yaniv, Jayneel Gandhi, and Dan Tsafrir. 2020. Predicting execution times with partial simulations in virtual memory research: Why and how. In Proceedings of the MICRO’20. 456–470.
    [2]
    Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. 2017. Do-It-Yourself virtual memory translation. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 457–468.
    [3]
    Chloe Alverti, Stratos Psomadakis, Vasileios Karakostas, Jayneel Gandhi, Konstantinos Nikas, Georgios Goumas, and Nectarios Koziris. 2020. Enhancing and exploiting contiguity for fast memory virtualization. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE Press, 515–528.
    [4]
    Rachata Ausavarungnirun, Timothy Merrifield, Jayneel Gandhi, and Christopher J. Rossbach. 2020. PRISM: Architectural support for variable-granularity memory metadata. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’20). Association for Computing Machinery, New York, NY, 441–454.
    [5]
    Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation caching: Skip, don’t walk (the page table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). Association for Computing Machinery, New York, NY, 48–59.
    [6]
    Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). Association for Computing Machinery, New York, NY, 237–248.
    [7]
    Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012. Reducing memory reference energy with opportunistic virtual caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, 297–308.
    [8]
    Abhishek Bhattacharjee. 2013. Large-Reach memory management unit caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). Association for Computing Machinery, New York, NY, 383–394.
    [9]
    Abhishek Bhattacharjee and Margaret Martonosi. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. 29–40.
    [10]
    Nathan Burow, Derrick McKee, Scott A. Carr, and Mathias Payer. 2018. CUP: Comprehensive user-space protection for C/C++. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security (ASIACCS’18). Association for Computing Machinery, New York, NY, 381–392.
    [11]
    Nicholas P. Carter, Stephen W. Keckler, and William J. Dally. 1994. Hardware support for fast capability-based addressing. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). Association for Computing Machinery, New York, NY, 319–327.
    [12]
    Matthew Chapman, Ian Wienand, and Gernot Heiser. 2003. Itanium page tables and TLB. (2003), University of New South Wales, School of Computer Science and Engineering.
    [13]
    Dongwei Chen, Dong Tong, Chun Yang, and Xu Cheng. 2021. MetaTableLite: An efficient metadata management scheme for tagged-pointer-based spatial safety. In Proceedings of the 2021 IEEE 39th International Conference on Computer Design (ICCD). 208–211.
    [14]
    Jonathan Corbet. 2009. Transparent Hugepages. (2009). Retrieved from https://lwn.net/Articles/359158/.
    [15]
    Jonathan Corbet. 2011. Transparent huge pages in 2.6.38. (2011). Retrieved from https://lwn.net/Articles/423584/.
    [16]
    Intel Corporation. 2008. Intel 64 and IA-32 architectures optimization reference manual. (2008).
    [17]
    Intel Corporation. 2008. TLBs, paging-structure caches and their invalidation. (2008).
    [18]
    Peter J. Denning. 1970. Virtual memory. ACM Computing Surveys 2, 3 (Sep.1970), 153–189.
    [19]
    Peter J. Denning. 1996. Virtual memory. ACM Computing Surveys 28, 1 (Mar.1996), 213–216.
    [20]
    Hewlett Packard Enterprise. 2008. Tunable Base Page Size. (2008).
    [21]
    Zhen Fang, Lixin Zhang, J. B. Carter, W. C. Hsieh, and S. A. McKee. 2001. Reevaluating online superpage promotion with hardware support. In Proceedings HPCA 7th International Symposium on High-Performance Computer Architecture. 63–72.
    [22]
    Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient memory virtualization: Reducing dimensionality of nested page walks. In Proceedings of the MICRO’14. 178–189.
    [23]
    Siddharth Gupta, Atri Bhattacharyya, Yunho Oh, Abhishek Bhattacharjee, Babak Falsafi, and Mathias Payer. 2021. Rebooting Virtual Memory with Midgard. IEEE Press, 512–525.
    [24]
    Faruk Guvenilir and Yale N. Patt. 2020. Tailored page sizes. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE Press, 900–912.
    [25]
    Nastaran Hajinazar, Pratyush Patel, Minesh Patel, Konstantinos Kanellopoulos, Saugata Ghose, Rachata Ausavarungnirun, Geraldo F. Oliveira, Jonathan Appavoo, Vivek Seshadri, and Onur Mutlu. 2020. The virtual block interface: A flexible alternative to the conventional virtual memory framework. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE Press, 1050–1063.
    [26]
    ARM Holdings. 2022. ARM architecture reference manual for A-profile architecture. ARM, Cambridge, UK, White Paper (2022).
    [27]
    Amazon Inc.2018. Amazon EC2 High Memory Instances with 6, 9, and 12 TB of Memory, Perfect for SAP HANA. (2018). Retrieved from https://aws.amazon.com/cn/blogs/aws/now-available-amazon-ec2-high-memory-instances-with-6-9-and-12-tb-of-memory-perfect-for-sap-hana/.
    [28]
    Google Inc. 2022. Google Compute Engine Pricing - Google Cloud. Retreived from https://cloud.google.com/compute/all-pricing/.
    [29]
    Intel Corporation. 2016. 5-Level Paging and 5-Level EPT. Intel, U.S., White Paper (2016).
    [30]
    B. Jacob and T. Mudge. 1998. Virtual memory in contemporary microprocessors. IEEE Micro 18, 4 (July1998), 60–75.
    [31]
    Xu Ji, Chao Wang, Nosayba El-Sayed, Xiaosong Ma, Youngjae Kim, Sudharshan S. Vazhkudai, Wei Xue, and Daniel Sanchez. 2017. Understanding object-level memory access patterns across the spectrum. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). Association for Computing Machinery, New York, NY, Article 25, 12 pages.
    [32]
    Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 195–206.
    [33]
    Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman Ünsal. 2015. Redundant memory mappings for fast access to large memories. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). Association for Computing Machinery, New York, NY, 66–78.
    [34]
    Vasileios Karakostas, Osman S. Unsal, Mario Nemirovsky, Adrian Cristal, and Michael Swift. 2014. Performance analysis of the memory management unit under scale-out workloads. In Proceedings of the 2014 IEEE International Symposium on Workload Characterization (IISWC). 1–12.
    [35]
    Yonghae Kim, Jaekyu Lee, and Hyesoon Kim. 2020. Hardware-based Always-On Heap Memory Safety. In Proceedings of the MICRO’20. 1153–1166.
    [36]
    Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. 2010. Server engineering insights for large-scale online services. IEEE Micro 30, 4 (2010), 8–19.
    [37]
    Taddeus Kroes, Koen Koning, Erik van der Kouwe, Herbert Bos, and Cristiano Giuffrida. 2018. Delta pointers: Buffer overflow checks without the checks. In Proceedings of the 13th EuroSys Conference (EuroSys’18). Association for Computing Machinery, New York, NY, Article 22, 14 pages.
    [38]
    Albert Kwon, Udit Dhawan, Jonathan M. Smith, Thomas F. Knight, and Andre DeHon. 2013. Low-Fat pointers: Compact encoding and efficient gate-level implementation of fat pointers for spatial safety and capability-based security. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (CCS’13). Association for Computing Machinery, New York, NY, 721–732.
    [39]
    Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and efficient huge page management with ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 705–721.
    [40]
    Michael LeMay, Joydeep Rakshit, Sergej Deutsch, David M. Durham, Santosh Ghosh, Anant Nori, Jayesh Gaur, Andrew Weiler, Salmin Sultana, Karanvir Grewal, and Sreenivas Subramoney. 2021. Cryptographic capability computing. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21). Association for Computing Machinery, New York, NY, 253–267.
    [41]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). Association for Computing Machinery, New York, NY, 190–200.
    [42]
    Marshall Kirk McKusick, George V. Neville-Neil, and Robert N. M. Watson. 2014. The design and implementation of the FreeBSD operating system. Pearson Education.
    [43]
    Myoung Jin Nam, Periklis Akritidis, and David J. Greaves. 2019. FRAMER: A tagged-pointer capability system with memory safety applications. In Proceedings of the 35th Annual Computer Security Applications Conference (ACSAC’19). Association for Computing Machinery, New York, NY, 612–626.
    [44]
    Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. 2002. Practical, transparent operating system support for superpages. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Copyright Restrictions Prevent ACM from Being Able to Make the PDFs for This Conference Available for Downloading) (OSDI’02). USENIX Association, 89–104.
    [45]
    Ashish Panwar, Sorav Bansal, and K. Gopinath. 2019. HawkEye: Efficient fine-grained os support for huge pages. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). Association for Computing Machinery, 347–360.
    [46]
    Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh. 2017. Hybrid TLB coalescing: Improving TLB translation coverage under diverse fragmented memory allocations. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 444–456.
    [47]
    Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using simpoint for accurate and efficient simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’03). Association for Computing Machinery, New York, NY, 318–319.
    [48]
    Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the HPCA’14. 558–567.
    [49]
    Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced large-reach TLBs. In Proceedings of the MICRO’12. IEEE Computer Society, 258–269.
    [50]
    Venkat Sri Sai Ram, Ashish Panwar, and Arkaprava Basu. 2021. Trident: Harnessing architectural resources for all page sizes in X86 processors. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21). Association for Computing Machinery, New York, NY, 1106–1120.
    [51]
    Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. 2017. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). Association for Computing Machinery, New York, NY, 469–480.
    [52]
    Gururaj Saileshwar, Rick Boivie, Tong Chen, Benjamin Segal, and Alper Buyuktosunoglu. 2022. HeapCheck: Low-Cost hardware support for memory safety. ACM Transactions on Architecture and Code Optimization 19, 1, Article 10 (Jan2022), 24 pages.
    [53]
    Rasool Sharifi and Ashish Venkat. 2020. CHEx86: Context-Sensitive enforcement of memory safety via microcode-enabled capabilities. In Proceedings of the ISCA’20. 762–775.
    [54]
    Kirill A. Shutemov. 2005. 5-level paging. (2005). Retrieved from https://lwn.net/Articles/708526/.
    [55]
    Mark Swanson, Leigh Stoller, and John Carter. 1998. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA’98). IEEE Computer Society, 204–213.
    [56]
    Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). Association for Computing Machinery, New York, NY, 171–182.
    [57]
    Po-An Tsai, Yee Ling Gan, and Daniel Sanchez. 2018. Rethinking the memory hierarchy for modern languages. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-51). IEEE Press, 203–216.
    [58]
    Nandita Vijaykumar, Ataberk Olgun, Konstantinos Kanellopoulos, F. Nisa Bostanci, Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, and Onur Mutlu. 2022. MetaSys: A practical open-source metadata management system to implement and evaluate cross-layer optimizations. ACM Transactions on Architecture and Code Optimization 19, 2, Article 26 (Mar.2022), 29 pages.
    [59]
    Pinchas Weisberg and Yair Wiseman. 2015. Virtual memory systems should use larger pages rather than the traditional 4KB pages. International Journal of Hybrid Information Technology 8, 8 (2015), 57–68.
    [60]
    D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton. 1986. An in-cache address translation mechanism. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA’86). IEEE Computer Society Press, 358–365.
    [61]
    Shengjie Xu, Wei Huang, and David Lie. 2021. In-Fat pointer: Hardware-Assisted tagged-pointer spatial memory safety defense with subobject granularity protection. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, 224–240.
    [62]
    Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Translation ranger: Operating system support for contiguity-aware TLBs. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA’19). Association for Computing Machinery, 698–710.

    Cited By

    View all
    • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 20, Issue 2
    June 2023
    284 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3572850
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 March 2023
    Online AM: 01 February 2023
    Accepted: 07 January 2023
    Revised: 27 December 2022
    Received: 02 August 2022
    Published in TACO Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Tagged pointer
    2. TLB reach
    3. address translation

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,627
    • Downloads (Last 6 weeks)168

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media