| |

Subscribe / Log in / New account

Huge pages part 1 (Introduction)

Ready to give LWN a try?
With a subscription to LWN, you can stay current with what is happening in the Linux and free-software community and take advantage of subscriber-only site features. We are pleased to offer you a free trial subscription, no credit card required, so that you can see for yourself. Please, join us!

February 16, 2010

This article was contributed by Mel Gorman

[Editor's note: this article is the first in a five-part series on the use of huge pages with Linux. We are most fortunate to have core VM hacker Mel Gorman as the author of these articles! The remaining installments will appear in future LWN Weekly Editions.]

One of the driving forces behind the development of Virtual Memory (VM) was to reduce the programming burden associated with fitting programs into limited memory. A fundamental property of VM is that the CPU references a virtual address that is translated via a combination of software and hardware to a physical address. This allows information only to be paged into memory on demand (demand paging) improving memory utilisation, allows modules to be arbitrary placed in memory for linking at run-time and provides a mechanism for the protection and controlled sharing of data between processes. Use of virtual memory is so pervasive that it has been described as “one of the engineering triumphs of the computer age” [denning96], but this indirection is not without cost.

Typically, the total number of translations required by a program during its lifetime will require that the page tables are stored in main memory. Due to translation, a virtual memory reference necessitates multiple accesses to physical memory, multiplying the cost of an ordinary memory reference by a factor depending on the page table format. To cut the costs associated with translation, VM implementations take advantage of the principal of locality [denning71] by storing recent translations in a cache called the Translation Lookaside Buffer (TLB) [casep78,smith82,henessny90]. The amount of memory that can be translated by this cache is referred to as the "TLB reach" and depends on the size of the page and the number of TLB entries. Inevitably, a percentage of a program's execution time is spent accessing the TLB and servicing TLB misses.

The amount of time spent translating addresses depends on the workload as the access pattern determines if the TLB reach is sufficient to store all translations needed by the application. On a miss, the exact cost depends on whether the information necessary to translate the address is in the CPU cache or not. To work out the amount of time spent servicing the TLB misses, there are some simple formulas:

Cycles_tlbhit = TLBHitRate * TLBHitPenalty
Cycles_{tlbmiss_cache} = TLBMissRate_cache * TLBMissPenalty_cache
Cycles_{tlbmiss_full} = TLBMissRate_full * TLBMissPenalty_full
TLBMissCycles = Cycles_{tlbmiss_cache} + Cycles__{tlbmiss_full}
TLBMissTime = (TLB Miss Cycles)/(Clock rate)

If the TLB miss time is a large percentage of overall program execution, then the time should be invested to reduce the miss rate and achieve better performance. One means of achieving this is to translate addresses in larger units than the base page size, as supported by many modern processors.

Using more than one page size was identified in the 1990s as one means of reducing the time spent servicing TLB misses by increasing TLB reach. The benefits of huge pages are twofold. The obvious performance gain is from fewer translations requiring fewer cycles. A less obvious benefit is that address translation information is typically stored in the L2 cache. With huge pages, more cache space is available for application data, which means that fewer cycles are spent accessing main memory. Broadly speaking, database workloads will gain about 2-7% performance using huge pages whereas scientific workloads can range between 1% and 45%.

Huge pages are not a universal gain, so transparent support for huge pages is limited in mainstream operating systems. On some TLB implementations, there may be different numbers of entries for small and huge pages. If the CPU supports a smaller number of TLB entries for huge pages, it is possible that huge pages will be slower if the workload reference pattern is very sparse and making a small number of references per-huge-page. There may also be architectural limitations on where in the virtual address space huge pages can be used.

Many modern operating systems, including Linux, support huge pages in a more explicit fashion, although this does not necessarily mandate application change. Linux has had support for huge pages since around 2003 where it was mainly used for large shared memory segments in database servers such as Oracle and DB2. Early support required application modification, which was considered by some to be a major problem. To compound the difficulties, tuning a Linux system to use huge pages was perceived to be a difficult task. There have been significant improvements made over the years to huge page support in Linux and as this article will show, using huge pages today can be a relatively painless exercise that involves no source modification.

This first article begins by installing some huge-page-related utilities and support libraries that make tuning and using huge pages a relatively painless exercise. It then covers the basics of how huge pages behave under Linux and some details of concern on NUMA. The second article covers the different interfaces to huge pages that exist in Linux. In the third article, the different considerations to make when tuning the system are examined as well as how to monitor huge-page-related activities in the system. The fourth article shows how easily benchmarks for different types of application can use huge pages without source modification. For the very curious, some in-depth details on TLBs and measuring the cost within an application are discussed before concluding.

1 Huge Page Utilities and Support Libraries

There are a number of support utilities and a library packaged collectively as libhugetlbfs. Distributions may have packages, but this article assumes that libhugetlbfs 2.7 is installed. The latest version can always be cloned from git using the following instructions

  $ git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs/libhugetlbfs
  $ cd libhugetlbfs
  $ git checkout -b next origin/next
  $ make PREFIX=/usr/local

There is an install target that installs the library and all support utilities but there are install-bin, install-stat and install-man targets available in the event the existing library should be preserved during installation.

The library provides support for automatically backing text, data, heap and shared memory segments with huge pages. In addition, this package also provides a programming API and manual pages. The behaviour of the library is controlled by environment variables (as described in the libhugetlbfs.7 manual page) with a launcher utility hugectl that knows how to configure almost all of the variables. hugeadm, hugeedit and pagesize provide information about the system and provide support to system administration. tlbmiss_cost.sh automatically calculates the average cost of a TLB miss. cpupcstat and oprofile_start.sh provide help with monitoring the current behaviour of the system. Manual pages are available describing in further detail each utility.

2 Huge Page Fault Behaviour

In the following articles, there will be discussions on how different type of memory regions can be created and backed with huge pages. One important common point between them all is how huge pages are faulted and when the huge pages are allocated. Further, there are important differences between shared and private mappings depending on the exact kernel version used.

In the initial support for huge pages on Linux, huge pages were faulted at the same time as mmap() was called. This guaranteed that all references would succeed for shared mappings once mmap() returned successfully. Private mappings were safe until fork() was called. Once called, it's important that the child call exec() as soon as possible or that the huge page mappings were marked MADV_DONTFORK with madvise() in advance. Otherwise, a Copy-On-Write (COW) fault could result in application failure by either parent or child in the event of allocation failure.

Pre-faulting pages drastically increases the cost of mmap() and can perform sub-optimally on NUMA. Since 2.6.18, huge pages were faulted the same as normal mappings when the page was first referenced. To guarantee that faults would succeed, huge pages were reserved at the time the shared mapping is created but private mappings do not make any reservations. This is unfortunate as it means an application can fail without fork() being called. libhugetlbfs handles the private mapping problem on old kernels by using readv() to make sure the mapping is safe to access, but this approach is less than ideal.

Since 2.6.29, reservations are made for both shared and private mappings. Shared mappings are guaranteed to successfully fault regardless of what process accesses the mapping.

For private mappings, the number of child processes is indeterminable so only the process that creates the mapping mmap() is guaranteed to successfully fault. When that process fork()s, two processes are now accessing the same pages. If the child performs COW, an attempt will be made to allocate a new page. If it succeeds, the fault successfully completes. If the fault fails, the child gets terminated with a message logged to the kernel log noting that there were insufficient huge pages. If it is the parent process that performs COW, an attempt will also be made to allocate a huge page. In the event that allocation fails, the child's pages are unmapped and the event recorded. The parent successfully completes the fault but if the child accesses the unmapped page, it will be terminated.

3 Huge Pages and Swap

There is no support for the paging of huge pages to backing storage.

4 Huge Pages and NUMA

On NUMA, memory can be local or remote to the CPU, with significant penalty incurred for remote access. By default, Linux uses a node-local policy for the allocation of memory at page fault time. This policy applies to both base pages and huge pages. This leads to an important consideration while implementing a parallel workload.

The thread processing some data should be the same thread that caused the original page fault for that data. A general anti-pattern on NUMA is when a parent thread sets up and initialises all the workload's memory areas and then creates threads to process the data. On a NUMA system this can result in some of the worker threads being on CPUs remote with respect to the memory they will access. While this applies to all NUMA systems regardless of page size, the effect can be pronounced on systems where the split between worker threads is in the middle of a huge page incurring more remote accesses than might have otherwise occurred.

This scenario may occur for example when using huge pages with OpenMP, because OpenMP does not necessarily divide its data on page boundaries. This could lead to problems when using base pages, but the problem is more likely with huge pages because a single huge page will cover more data than a base page, thus making it more likely any given huge page covers data to be processed by different threads. Consider the following scenario. A first thread to touch a page will fault the full page's data into memory local to the CPU on which the thread is running. When the data is not split on huge-page-aligned boundaries, such a thread will fault its data and perhaps also some data that is to be processed by another thread, because the two threads' data are within the range of the same huge page. The second thread will fault the rest of its data into local memory, but will still have part of its data accesses be remote. This problem manifests as large standard deviations in performance when doing multiple runs of the same workload with the same input data. Profiling in such a case may show there are more cross-node accesses with huge pages than with base pages. In extreme circumstances, the performance with huge pages may even be slower than with base pages. For this reason it is important to consider on what boundary data is split when using huge pages on NUMA systems.

One work around for this instance of the general problem is to use MPI in combination with OpenMP. The use of MPI allows division of the workload with one MPI process per NUMA node. Each MPI process is bound to the list of CPUs local to a node. Parallelisation within the node is achieved using OpenMP, thus alleviating the issue of remote access.

5 Summary

In this article, the background to huge pages were introduced, what the performance benefits can be and some basics of how huge pages behave on Linux. The next article (to appear in the near future) discusses the interfaces used to access huge pages.

Read the successive installments:

Part 2: Interfaces
Part 3: Administration
Part 4: Benchmarking
Part 5: A deeper look at TLBs and costs

Details of publications referenced in these articles can be found in the bibliography at the end of Part 5.

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.

Index entries for this article
Kernel	Huge pages
Kernel	hugetlbfs
Kernel	Memory management/Huge pages
GuestArticles	Gorman, Mel

Well-written article

Posted Feb 18, 2010 11:12 UTC (Thu) by sdalley (subscriber, #18550) [Link]

Well done Mel Gorman for writing about a deeply technical topic in such a comprehensible way! I actually understood it 80% on the first pass.

It's a real art to be able to define unfamiliar concepts as you go without over-simplifying on the one hand or being lost in nested interrupts on the other.

The only things I had to look up were MPI and OpenMP but that was easily done.

About NUMA

Posted Feb 18, 2010 14:35 UTC (Thu) by cma (guest, #49905) [Link] (3 responses)

I'm curious... If my app is NUMA aware (let's say it's an old fashioned threaded app), in this case woould it be interesting in enabling in BIOS (in concrete for a Dell R610 server) the memory option NODE INTERLEAVING? Or just let BIOS enable NUMA behavior (NODE INTERLEVING disabled)? Thanks and congrats for this great article! Regards

About NUMA - corrigenda

Posted Feb 18, 2010 14:36 UTC (Thu) by cma (guest, #49905) [Link]

Sorry about the typo: I was meaning: if my app WAS NOT NUMA aware...

About NUMA

Posted Feb 18, 2010 15:00 UTC (Thu) by mel (subscriber, #5484) [Link] (1 responses)

> I'm curious... If my app is NUMA aware (let's say it's an old fashioned threaded app), in this
> case woould it be interesting in enabling in BIOS (in concrete for a Dell R610 server) the
> memory option NODE INTERLEAVING? Or just let BIOS enable NUMA behavior (NODE
> INTERLEVING disabled)? Thanks and congrats for this great article!

s/NUMA aware/not NUMA aware/

It depends on whether your application fits in one node or not. If it fits in one node, then leave
NODE_INTERLEAVING off and use taskset to bind the application to one nodes worth of CPUs.
The memory will be allocated locally and performance will be decent.

If the application needs the whole machine and one thread faults all of the memory before
spawning other threads (a common anti-pattern for NUMA), then NODE_INTERLEAVING will give
good average performance. Without interleaving, your performance will sometimes be great
and other times really poor depending on if the thread is running on the node that faulted the
data or not. You don't need to go to the bios to test it out, launch the
application with

numactl --interleave=all your-application

About NUMA

Posted Feb 18, 2010 18:17 UTC (Thu) by cma (guest, #49905) [Link]

Mel, thanks A LOT! It's all very clear now! Best regards!

Huge pages part 1 (Introduction)

Posted Feb 19, 2010 10:57 UTC (Fri) by nix (subscriber, #2304) [Link] (10 responses)

I find myself looking at huge pages and thinking that huge pages are a feature that will be useful only for special-purpose single-use machines (basically just the two Mel mentions: simulation and databases, and perhaps here and there virtualization where the machine has *lots* of memory) until the damn things are swappable. Obviously we can't swap them as a unit (please wait while we write a gigabyte out), so we need to break the things up and swap bits of them, then perhaps reaggregate them into a Gb page once they're swapped back in. Yes, it's likely to be a complete sod to implement, but it would also give that nice TLB-hit speedup feeling without having to worry if you're about to throw the rest of your system into thrash hell as soon as the load on it increases. (Obviously once you *do* start swapping speed goes to hell anyway, so the overhead of taking TLB misses is lost in the noise. In the long run, defragmenting memory on swapin and trying to make bigger pages out of it without app intervention seems like a good idea, especially on platforms like PPC with a range of page sizes more useful than 4Kb/1Gb.)

IIRC something similar to this being discussed in the past, maybe as part of the defragmentation work? ... I also dimly remember Linus being violently against it but I can't remember why.

Huge pages part 1 (Introduction)

Posted Feb 19, 2010 12:03 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

What you really want (but it's difficult to do sanely on x86) is transparent huge pages. Where possible, the kernel gives you continguous physical pages for continguous virtual pages, and it transparently converts suitable sets of continguous virtual pages to the next size of mapping up when it can do so, and splits large mappings into the next size down when they're not in use, or when there's memory pressure.

The pain on x86 is twofold: first, instead of getting to aggregate (e.g.) 16 4K pages into a 16K page, then 16 16K pages into a 256K page, you get to do things like aggregate 1024 4K pages into a 4M page, and 256 4M pages into a 1GB page. Second, typical x86 TLBs are split by page type; so it's not uncommon to have something like the Core 2 Duo, where you have 128 entries for 4K pages, and just 4 entries for 4M pages (Instruction TLB).

Given that split, most workloads gain more from having the kernel always in the TLB, than from evicting the kernel in favour of your own code (which would have been in the 4K page size TLB otherwise).

Huge pages part 1 (Introduction)

Posted Feb 19, 2010 14:15 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Doesn't Linux already use huge pages for the kernel? Or am I misremembering
something?

Huge pages part 1 (Introduction)

Posted Feb 19, 2010 14:19 UTC (Fri) by farnz (subscriber, #17727) [Link] (1 responses)

It does. The work being done is for huge pages for userspace, which is a whole different ballgame, and could result in the kernel's hugepage mapping being pushed out of the TLB.

If/when someone does the work, it'll need benchmarking not just on the latest and greatest, but also on real-world older systems with more restrictive TLBs, to see if it's a net loss.

Huge pages part 1 (Introduction)

Posted Feb 20, 2010 16:33 UTC (Sat) by nix (subscriber, #2304) [Link]

This stuff could presumably autotune, kicking in only on CPUs for which it
is a net win.

Huge pages part 1 (Introduction)

Posted Feb 20, 2010 16:31 UTC (Sat) by nix (subscriber, #2304) [Link]

Oh, hell, I forgot about the split TLB. That makes the whole resource
allocation problem drastically harder :/ Still, does the kernel need more
than one or two entries? If you're not using hugepages currently, it seems
to me that some of those hugepage TLB entries are actually wasted...

Alan Cox on FreeBSD

Posted Feb 23, 2010 5:22 UTC (Tue) by man_ls (guest, #15091) [Link] (4 responses)

You are probably thinking of this excellent LWN article from 2007 and the excellent Navarro paper (with a contribution by Alan Cox). The fact that they decided to bring transparent huge pages support to FreeBSD and not to Linux is funny, considering.

Alan Cox on FreeBSD

Posted Feb 23, 2010 6:10 UTC (Tue) by viro (subscriber, #7872) [Link]

ac != alc... IOW, it's not who you probably think it is.

Alan Cox on FreeBSD

Posted Feb 23, 2010 21:53 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

I wish I had been thinking of that paper, but I had no idea it existed.
Unfortunately all the links to it appear dead :( 10.1.1.14.2514 is not a
valid DOI as far as I can tell, and the source link throws an error page
at me...

Alan Cox on FreeBSD

Posted Feb 24, 2010 19:37 UTC (Wed) by biged (guest, #50106) [Link] (1 responses)

Try here: http://www.usenix.org/events/osdi02/tech/full_papers/nava...

Alan Cox on FreeBSD

Posted Feb 24, 2010 22:47 UTC (Wed) by nix (subscriber, #2304) [Link]

Yay, thank you!

Huge pages part 1 (Introduction)

Posted Jul 7, 2010 4:48 UTC (Wed) by glennewton (guest, #64085) [Link]

I've collected a number of good resources for huge pages for Linux, Java, Solaris, MySql, AMD here: Java, MySql increased performance with Huge Pages.

Huge pages part 1 (Introduction)

Posted Dec 26, 2012 8:00 UTC (Wed) by heguanjun (guest, #88525) [Link]

well, now there is another huge page implementation: Transparent huge pages. and with many benefits.