6.1 Internal Fragmentations of Eager Paging
Eager paging strategy allocates physical memory rightly after the creation of a virtual memory area. But if a program does not access all of the allocated virtual pages, un-accessed physical pages are wasted, causing internal fragmentations. We tracked and examined the target address of every load/store instruction to find actually allocated pages in memory ranges. When a page is accessed for the first time, it is counted as an allocated page because, since then, it needs to be backed by a physical page.
Table
4 shows the total page number requested by memory ranges and the number of allocated pages. We can find that most of the requested pages are actually accessed, so eager paging should not cause serious internal fragmentation problems. Here, we take the percentage of pages not allocated as the internal fragmentation. As shown in the table, the eager paging strategy leads to less than 10% internal fragmentations for most of the benchmarks.
We also compared with the internal fragmentations of adopting 2M/1G huge pages. Currently, using 1G pages in a program relies on the
hugetlbfs [
15] mechanism, which requires modifications to the source code and the program managing some of the memory on its own. Here, we assume using
hugetlbfs and that the program can perfectly compact all the ranges into 2 M/1 G huge pages. We then estimate the number of huge pages by rounding up the sum of range sizes. Fragmentation results in Table
4 show that the eager paging strategy leads to far less internal fragmentation than using 1 G pages to support memory ranges. Adopting 2 M huge pages uses a similar amount of memory to the eager paging, but requires much more TLB entries.
6.2 Reduction of TLB Misses and Page Walks
We evaluate the TLB hit/miss rates of FlexPointer with a functional TLB simulator, which modeled L1/L2 page TLB and the page table walker. The simulator uses 4 KB pages. We use Simpoint [
47] to locate simulation points and extract load/store instructions with PIN [
41]. All simulation points consist of 1 billion instructions. We also validated MPKI of the trace (baseline configuration enabling only page TLBs) with performance counter measurements of native executions to ensure that the sampled regions are representative. The coverage of simulation points is over 90% on every benchmark.
According to the results from previous studies [
13,
31] and our evaluations in Table
1, we set the large object threshold to 64 KB to constrain the number of ranges. We implement a malloc() wrapper function as discussed in Section
3.2 and replace the original malloc() with it. Then we use PIN to catch all calls to mmap(), mremap(), and munmap() as hints for range creations and reclaims. These function calls are recorded both in and out of simulation points to completely reflect range creations and reclaims.
We also implemented RMM [
33] and CoLT [
49] in our simulator. For CoLT, we assume an ideal situation where every TLB entries cover eight contiguous physical pages. We let FlexPointer and RMM create ranges according to the same threshold. For FlexPointer, all tagged addresses go to the range TLB while the others go to the page TLB; for RMM, the range TLB is searched only after an L1 TLB miss. In FlexPointer, a page walk happens either after the L2 page TLB misses or after the range TLB misses; in RMM, a page walk happens after the L2 page TLB and the range TLB both miss. Misses caused by sub-ranges (generated by remapping) are included in the range TLB misses.
Figures
9 and
10 show the reduction of L1 TLB miss number and page table walk number, respectively. RMM only searches the range TLB after an L1 TLB miss, so it cannot reduce L1 TLB misses. For most benchmarks, FlexPointer can reduce 70–99+% of L1 TLB misses and page walks. Table
5 illustrates how FlexPointer and RMM affect the miss rate of page TLB. We can find that FlexPointer mainly reduces the L1 TLB miss rate. For some benchmarks, the L2 miss rate instead increases. This is because the decrease in L1 TLB misses reduces the total number of L2 TLB accesses, and some untagged addresses still miss in both L1 and L2 page TLB.
For gcc, xalancbmk, and omnetpp, neither FlexPointer nor RMM shows a good reduction of TLB miss number. Referring to the coverage results in Table
1, we find that for FlexPointer and RMM, the coverage of large objects is consistent with the TLB miss and page walk reduction. We then examined the allocation patterns of these benchmarks and found that they allocate much more objects than others, and most of the objects are below the threshold. So these three benchmarks benefit little from FlexPointer and RMM because the memory coverage of ranges is low. But it should also be noted that, although the reduction of TLB misses and page walks is low, FlexPointer does not hurt the performance of these benchmarks.
On the other hand, CoLT can reduce many of the L1 TLB misses and page table walks for gcc, xalancbmk, and omnetpp, but cannot reduce as many on other benchmarks. Because CoLT increases the coverage of a TLB entry at a relatively small scale, its efficiency on benchmarks using only a few very large objects is limited. TLB coalescing techniques like CoLT only modifies the page TLB, so they are orthogonal to FlexPointer. We can combine them with FlexPointer to improve performance for applications with many small objects.
We also noted that FlexPointer can reduce more page walks than RMM for all benchmarks except cactuBSSN. We assumed that cactuBSSN suffers from thrashing, so we changed the range TLB capacity and tested again. As Figure
11 shows, the number of L1 range TLB misses drops drastically when increasing the L1 range TLB capacity from 50 to 51. Thus we can come to a conclusion that cactuBSSN needs a larger range TLB to avoid thrashing. Although cactuBSSN endures range TLB thrashing, FlexPointer still reduces over 70% of page walks.
One possible problem of adding a range TLB is that it may instead introduce many range table walks if its hit rate is poor. We investigated range TLB hit rates of benchmarks using a 16-entry range TLB configuration. As shown in Table
6, FlexPointer induces nearly no range TLB misses for all benchmarks. A benchmark experiencing thrashing like cactuBSSN can still achieve a hit rate over 93%. From the results, we can see that even a small range TLB can provide very high hit rates, so the range TLB miss will not become a bottleneck for FlexPointer. Table
6 also proves that a 16-entry range TLB is sufficient for most benchmarks.
6.3 Performance
The overall performance is evaluated with Mosmodel [
1]. Mosmodel implemented a special memory allocator to generate different page size mixes and produced a performance model using performance counter measurements under different mixes. Denoting L2 TLB hits, L2 TLB misses, and page walk cycles as
\(H\) ,
\(M\) , and
\(C\) , respectively, Mosmodel is a third-degree polynomial model:
We first execute each benchmark completely and use perf to get the original performance counter data. Then, we adjust the three parameters of each benchmark proportionally according to the functional simulation results and input them into the Mosmodel performance model to predict execution cycles.
For FlexPointer, the data of H and M are only from the page TLB because our range TLB works before the L2 page TLB. We count the range TLB misses as page walks. In this article, we assume a system with a low fragmentation level, so range table walks can complete on the first access, which means that a range table walk cannot be more costly than a default page walk. We denote the L2 page TLB hit and miss in FlexPointer simulation as
\(PHit_f\) and
\(PMiss_f\) , and denote the range TLB miss as
\(RMiss_f\) . The L2 TLB hit and L2 TLB miss of the baseline simulation (only page TLB) are denoted as
\(PHit\) , and
\(PMiss\) . The adjusted performance model parameters are denoted as
\(C_p\) ,
\(H_p\) , and
\(M_p\) . We adjust parameters as follows:
For RMM, we view the range TLB as a means to reduce page walks. When an address hits in the L2 page TLB, we omit the lookup result in the range TLB, so the number of L2 page TLB hits is the same as that in the baseline. As for an L2 page TLB miss, if it hits in the range TLB, we view it as a miss canceled by the range TLB and subtract it from the final miss number. Only when an address misses both in the L2 page TLB and the range TLB, it is counted as an L2 page TLB miss, denoted as
\(PMiss_r\) . Now we adjust parameters as follows:
As for CoLT, since it only modify the page TLB, the adjustment of parameters is straightforward. Denoting the L2 page TLB hit and miss of CoLT as
\(PHit_c\) and
\(PMiss_c\) , respectively, we adjust its parameters as follows:
Figure
12 presents the speedups of FlexPointer, RMM, and CoLT. All performance are normalized to the baseline system. We also compare the results with the speedups of THP as a reference because the internal fragmentations of using 2 MB huge pages are close to the eager paging strategy. Speedups of THP are calculated by comparing the elapsed time when executing with/without enabling THP on the real machine.
Among all benchmarks, gups benefits the most from FlexPointer and RMM. FlexPointer grants a 180% performance improvement and RMM shows a 150% improvement. Considering that gups allocates a large area and does random access within it, this benchmark presents a memory usage pattern that best fits the range translation. On the other hand, CoLT improves little on gups because of the limited coverage of TLB entries and the poor memory locality.
For most of the benchmarks, FlexPointer shows a better performance than RMM and CoLT. Some benchmarks, like xz, graph500, XSBench, and NPB:CG, can be accelerated by FlexPointer for 10–20%. Improvement for the remaining benchmarks is not as significant, but most of them experience around 5% performance improvement with FlexPointer. For benchmarks with a high range coverage, FlexPointer can offer a higher speedup than THP, except for XSBench and NPB:MG. For these benchmarks, one possible reason is that THP mapped the stack with huge pages, while FlexPointer only works on the heap.
We also repeated the experiments on the remaining SPEC 2017 benchmarks to assess if FlexPointer causes performance degradation. Among them, exchange2, povray, perlbench, and nab allocate few or no memory range. They show similar results to gcc, xalancbmk, and omnetpp. Other benchmarks, like namd, parest, blender, wrf, and leela, can achieve a higher range coverage (around 40%). Remaining benchmarks (waves, lbm, x264, pop2, imagick, and fotonik3d) allocate even more large objects (range coverage over 80%). These benchmarks already have high TLB hit rates when using 4 KB pages, so they experience only a little improvement, around 1–4%. None of these benchmarks experiences performance degradation.
Overall, FlexPointer offers an average performance improvement of around 14% (5.9% if excluding gups), while RMM and CoLT offer an 11% (4.7% if excluding gups) and 3% average improvement, respectively. On the other hand, THP offers a 12% (6.0% if excluding gups) average improvement.