Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
|
|

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc8, released on February 12. Linus says:

I think this is going to be the last -rc of the series, so please do test it out. A number of regressions should be fixed, and while the regression list doesn't make me _happy_, we didn't have the kind of nasty things that went on before -rc7 and made me worried.

Full details can be found in the changelog.

According to the latest regression report, the number of unresolved regressions has risen to 31, the highest point yet in this development cycle.

Comments (4 posted)

Quotes of the week

There _are_ things we can do though. Detect a write to the old file and emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into WARN_ON("you really suck"). Wait a year, then remove it.
-- Feature deprecation Andrew Morton style

The post-Google standard company perks - free food, on-site exercise classes, company shuttles - make it trivial to speak only to fellow employees in daily life. If you spend all day with your co-workers, socialize only with your co-workers, and then come home and eat dinner with - you guessed it - your co-worker, you might go several years without hearing the words, "Run Solaris on my desktop? Are you f-ing kidding me?"
--- Valerie Aurora

Everybody takes it for granted to run megabytes of proprietary object code, without any memory protection, attached to an insecure public network (GSM). Who would do that with his PC on the Internet, without a packet filter, application level gateways and a constant flow of security updates of the software? Yet billions of people do that with their phones all the time.
-- Harald Welte

Comments (9 posted)

Compression formats for kernel.org

By Jonathan Corbet
February 17, 2010
The kernel.org repository depends heavily on compression to keep its storage and bandwidth expenses down. An uncompressed tarball for the 2.6.32 release weighs in at 365MB; if downloaders grabbed the data in this format, the resulting bandwidth usage would be huge. So kernel.org does not make uncompressed tarballs available; instead, one can choose between versions compressed with gzip (79MB) or bzip2 (62MB). Bzip2 is the newer choice; it took a while to catch on because the needed tools were not widely shipped. Now, though, the folks at kernel.org are considering making a change in the compression formats used there.

What's driving this discussion is the availability of the XZ tool, which is based on the LZMA compression algorithm. XZ offers better compression performance - 53MB on that 2.6.32 tarball - but it suffers from a familiar problem: the tools are not yet widely available in distributions, especially those of the "enterprise" variety. This has led to pushback against the idea of standardizing on XZ in the near future, as can be seen in this comment from Ted Ts'o:

Keep in mind that there are people where who are still using RHEL 3, and some of them might want to download from ftp.kernel.org. So those people who are suggesting that we replace .gz files with .xz on kernel.org are *really* smoking something good.

In fact, there is little pressure to replace the gzip format anytime in the near future. Its compression performance may not be the best, but it does have the advantage of being far faster than any of the alternatives. From the discussion, it is fairly clear that some users care about decompression time. What is more likely is that XZ might eventually displace files in the bzip2 format. Then there would be a clear choice: speed and widespread availability or the best available compression. Even that change, though, is likely to be at least a year away; in the mean time, kernel.org will probably carry files in all three formats.

(This discussion also included a side thread on changing the 2.6.xx numbering scheme. Once again, though, the expected flame wars failed to materialize. There just does not seem to be much interest in or energy for this particular change.)

Comments (19 posted)

Extended error reporting

By Jonathan Corbet
February 17, 2010
Linux contains a number of system calls which do complex things; they take large structures as input, operate on significant internal state, and, perhaps, return some sort of complicated output data. The normal status returned from these system calls, however, is compressed down into a single integer called errno. Application programmers dealing with certain subsystems (Video4Linux2 being your editor's favorite in this regard) will all be well familiar with the process of trying to figure out what the problem is when the kernel says only "it failed."

Andi Kleen describes the problem this way:

I always describe that as a the "ed approach to error handling". Instead of giving a error message you just give ?. Just ? happens to be EINVAL in Linux.

My favourite example of this is the configuration of the networking queueing disciplines, which configure complicated data structures and algorithms and in many cases have tens of different error conditions based on the input parameters -- and they all just report EINVAL.

It would be nice to provide application developers with better information than this. A brief discussion covered some of the options:

  • Use printk() to put information into the system logfile. This approach is widely used, but it bloats the kernel with string data, risks flooding the logs, and the resulting information may not be easily accessible to an unprivileged programmer.

  • Extend specific system calls to enable them to provide richer status information. Just adding a new version of ioctl() would address many of the worst problems.

  • Create an errno-like mechanism by which any system call could return extended information. That information could be an error string, some sort of special code, or, as Alan Cox suggested, a pointer to the structure field which caused the problem.

One could certainly argue that the narrow errno mechanism is showing its age and could use an upgrade. Any enhancements, though, would be Linux-specific and non-POSIX, which always tends to limit their uptake. They would also have to be lived with forever, and, thus, would require careful design. So we're unlikely to see a solution in the mainline anytime soon, even if somebody does take up the challenge.

Comments (9 posted)

Kernel development news

Merging kdb and kgdb

By Jake Edge
February 17, 2010

It was something of a surprise when Linus Torvalds merged kgdb—a stub to talk to the gdb debugger—back in the 2.6.26 merge window, because of his well-known disdain for kernel debuggers. But there is another kernel debugging solution that has long been out of the mainline: kdb. Jason Wessel has proposed merging the two solutions by reworking kgdb to use the "kdb shell" underneath, which would lead to both solutions being available for kernel hackers.

The two debuggers serve different purposes, with kdb having much less functionality, but they both have uses. Kgdb allows source-level debugging using gdb over a serial line, but that requires a separate system. For systems where it is painful or impractical to set up a serial connection, kdb may provide enough capability to debug a problem. In addition, things like kernel modesetting (KMS) allow for additional features that kdb has lacked. Wessel described one possibility:

A 2010 example of where kdb can be useful over kgdb is where you have a small netbook, no serial ports etc... and you are running X and your file system driver crashes the kernel. With kdb plus kms you can get an opportunity to see the crash which would have otherwise been lost from /var/log/messages because the crash was in the file system driver.

While kgdb allows access to all of the standard debugging commands that gdb provides, kdb has a much more limited command set. One can examine and change memory locations or registers, set breakpoints, and get a backtrace of the stack, but those commands typically require using addresses, rather than symbolic names. Currently, the best reference for kdb commands comes from a developerWorks article, though Wessel plans to change that. There is some documentation that comes with the patches, but a command reference will depend on exactly which pieces, if any, actually land in the mainline.

It should be noted that one of the capabilities that was removed from kdb as part of the merger is the disassembler. It was x86 specific, and the new code is "99% platform independent", according to the FAQ about the merged code. Because kgdb is implemented for many architectures, rewriting it atop kdb led to support for many more architectures for kdb. Instead of just the x86 family, kdb now supports arm, blackfin, mips, sh, powerpc, and sparc.

In addition, kgdb and kdb can work together. From a running kgdb session, one can use the gdb monitor command to access kdb commands. There are several that might be helpful like ps for a process list or dmesg to see log output.

The FAQ lists a number of other advantages that would come from the merge, beyond just getting kdb into the mainline so that its users no longer have to patch their kernels, The basic idea behind the advantages listed is to unite the users and developers of kgdb and kdb so that they are all pulling in the same direction, because "both kdb and kgdb have similar needs in terms of how they integrate into the kernel". There have been arguments in the past about which of the two solutions is best, but, since they serve different use cases, having both available would have another benefit: "No longer will people have to debate which is better, kdb or kgdb, why do we have only one... Just go use the best tool for the job."

Wessel notes that Ubuntu has enabled kgdb in recent kernels, which is something he would like to see done by other distributions. If kdb is available, that too could be enabled, which would make it easier for users to access the functionality:

My other hope is that the new kdb is much easier to use in the sense that the barrier of entry is much lower. For example, someone with a laptop running a kernel with a kdb enabled kernel can use it as easily as:
    echo kms,kbd > /sys/module/kgdboc/parameters/kgdboc
    echo g > /proc/sysrq-trigger
    dmesg
    bt
    go
And voila you just ran the kernel debugger.

In the example above, Wessel shows how to enable kdb (for keyboard (kbd) and KMS operation), then trap into it using sysrq-g (once enabled, kdb will also be invoked if there is a panic or oops). The following three commands are kdb commands for looking at log output, getting a stack backtrace, and continuing execution.

The patches themselves are broken up into three separate patchsets: the first and largest adds the kdb infrastructure into kernel/debug/ and moves kgdb.c into that directory, the second adds KMS support for kdb along with an experimental patch to do atomic modesetting for the i915 graphics driver, and the third allows kernel debugging via kdb or kgdb early in the boot process; starting from the point where earlyprintk() is available. Wessel is targeting 2.6.34 and, at least so far, the patches have been well received. The most recent posting is version 3 of the patchset, with a long list of changes made in response to earlier comments. Furthermore, an RFC about the idea last May gained a fair number of comments that clearly indicated there was interest in kdb and merging it with the kgdb code.

Sharp-eyed readers will note some similarities between this proposal and the recent utrace push. In both cases, an existing debugging facility was rewritten using a new core, but there are differences as well. Unlike utrace, the kdb/kgdb patches directly provide some lacking user-space functionality. Whether that is enough to overcome Torvalds's semi-hostile attitude towards kernel debuggers—though the inclusion of kgdb would seem to indicate some amount of softening—remains to be seen.

Comments (7 posted)

How old is our kernel?

By Jonathan Corbet
February 17, 2010
April 2005 was a bit of a tense time in the kernel development community. The BitKeeper tool which had done so much to improve the development process had suddenly become unavailable, and it wasn't clear what would replace it. Then Linus appeared with a new system called git; the current epoch of kernel development can arguably be dated from then. The opening event of that epoch was commit 1da177e4, the changelog of which reads:

Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it.

Let it rip!

The community did, indeed, let it rip; some 180,000 changesets have been added to the repository since then. Typically hundreds of thousands of lines of code are changed with each three-month development cycle. A while back, your editor began to wonder how much of the kernel had actually been changed, and how much of our 2.6.33-to-be kernel dates back to 2.6.12-rc2, which was tagged at the opening of the git era? Was there anything left of the kernel we were building in early 2005?

Answering this question is a simple matter of bashing out some ugly scripts and dedicating many hours of processing time. In essence, the "git blame" command can be used to generate an annotated version of a file which lists the last commit to change each line of code. Those commit IDs can be summed, then associated with major version releases. At the end of the process, one has a simple table showing the percentage of the current kernel code base which was created for each major release since 2.6.12. Here's what it looks like:

[Pretty bar chart]

In summary: just over 41% 31% of the kernel tree dates back to 2.6.12, and has not been modified since then. Our kernel may be changing quickly, but parts of it have not changed at all for nearly five years. Since then, we see a steady stream of changes, with more recent kernels being more strongly represented than the older ones. That curve will partly be a result of the general increase in the rate of change over time; 2.6.13 had fewer than 4,000 commits, while 2.6.33 will have almost 11,000. Still, one has to wonder what happened with 2.6.20 (5,000 commits) to cause that release to represent less than 2% of the total code base.

Much of the really old material is interspersed with newer lines in many files; comments and copyright notices, in particular, can go unchanged for a very long time. The 2.6.12 top-level makefile set VERSION=2 and PATCHLEVEL=6, and those lines have not changed since; the next line (SUBLEVEL=33) was changed in December.

There are interesting conclusions to be found at the upper end of the graph as well. Using this yardstick, 2.6.33 is the smallest development cycle we have seen in the last year, even though this cycle will have replaced some code added during the previous cycles. 4.2% of the code in 2.6.33 was last touched in the 2.6.33 cycle, while each of the previous four kernels (2.6.29 through 2.6.32) still represents more than 5.5% of the code to be shipped in 2.6.33.

Another interesting exercise is to look for entire files which have not been touched in five years. Given the amount of general churn and API change which has happened over that time, files which have not changed at all have a good chance of being entirely unused. Here is a full list of files which are untouched since 2.6.12 - all 1062 of them. Some conclusions:

  • Every kernel tarball carries around drivers/char/ChangeLog, which is mostly dedicated to documenting the mid-90's TTY exploits of Ted Ts'o. There is only one change since 1998, and that was in 2001. Files like this may be interesting from a historical point of view, but they have little relevance to current kernels.

  • Unsurprisingly, the documentation directory contains a great deal of material which has not been updated in a long time. Much of it need not change; the means by which one configures an ISA Sound Blaster card is pretty much as it always was - assuming one can find such a card and an ISA bus to plug it into. Similarly, Klingon language support (Documentation/unicode.txt), Netwinder support, and such have not seen much development activity recently, so the documentation can be deemed to be current, if not particularly useful. All told, 41% of the documentation directory dates back to 2.6.12. There was a big surge of documentation work in 2.6.32; without that, a larger percentage of this subtree would look quite old.

  • Some old interfaces haven't changed in a long time, resulting in a lot of static files in include/. <linux/sort.h> declares sort(), which is used in a number of places. <include/fcdevice.h> declares alloc_fcdev(), and includes a warning that "This file will get merged with others RSN." Much of the sunrpc interface has remained static for a long time as well.

  • Ancient code abounds in the driver tree, though, perhaps unsurprisingly, old header files are much more common than old C files. The ISDN driver tree has been quite static.

  • Much of sound/oss has not been touched for a long time and must be nicely filled with cobwebs by now; there hasn't been much of a reason to touch the OSS code for some time.

  • net/decnet/TODO contains a "quick list of things that need finishing off"; it, too, hasn't been changed in the git era. One wonders how the DECnet hackers are doing on that list...

So which subsystem is the oldest? This plot looks at the kernel subsystems (as defined by top-level directories) and gives the percentage of 2.6.12 code in each:

[Oldest subsystems]

The youngest subsystem, unsurprisingly, is tools/, which did not exist prior to 2.6.29. Among subsystems which did exist in 2.6.12, the core kernel, with about 13% code dating from that release, is the newest. At the other end, the sound subsystem is more than 45% 2.6.12 code - the highest in the kernel. For those who are curious about the age distribution in specific subsystems, this page contains a chart for each.

In summary: even in a code base which is evolving as rapidly as the kernel, there is a lot of code which has not been touched - even by coding style or white space fixes - in the last five years. Code stays around for a long time.

(For those who would like to play with this kind of data, the scripts used have been folded into the gitdm repository at git://git.lwn.net/gitdm.git).

Note: this article has been edited to fix an error which overstated the amount of 2.6.12 code remaining in the full kernel.

Comments (55 posted)

Huge pages part 1 (Introduction)

February 16, 2010

This article was contributed by Mel Gorman

[Editor's note: this article is the first in a five-part series on the use of huge pages with Linux. We are most fortunate to have core VM hacker Mel Gorman as the author of these articles! The remaining installments will appear in future LWN Weekly Editions.]

One of the driving forces behind the development of Virtual Memory (VM) was to reduce the programming burden associated with fitting programs into limited memory. A fundamental property of VM is that the CPU references a virtual address that is translated via a combination of software and hardware to a physical address. This allows information only to be paged into memory on demand (demand paging) improving memory utilisation, allows modules to be arbitrary placed in memory for linking at run-time and provides a mechanism for the protection and controlled sharing of data between processes. Use of virtual memory is so pervasive that it has been described as “one of the engineering triumphs of the computer age” [denning96], but this indirection is not without cost.

Typically, the total number of translations required by a program during its lifetime will require that the page tables are stored in main memory. Due to translation, a virtual memory reference necessitates multiple accesses to physical memory, multiplying the cost of an ordinary memory reference by a factor depending on the page table format. To cut the costs associated with translation, VM implementations take advantage of the principal of locality [denning71] by storing recent translations in a cache called the Translation Lookaside Buffer (TLB) [casep78,smith82,henessny90]. The amount of memory that can be translated by this cache is referred to as the "TLB reach" and depends on the size of the page and the number of TLB entries. Inevitably, a percentage of a program's execution time is spent accessing the TLB and servicing TLB misses.

The amount of time spent translating addresses depends on the workload as the access pattern determines if the TLB reach is sufficient to store all translations needed by the application. On a miss, the exact cost depends on whether the information necessary to translate the address is in the CPU cache or not. To work out the amount of time spent servicing the TLB misses, there are some simple formulas:

Cyclestlbhit = TLBHitRate * TLBHitPenalty

Cyclestlbmiss_cache = TLBMissRatecache * TLBMissPenaltycache

Cyclestlbmiss_full = TLBMissRatefull * TLBMissPenaltyfull

TLBMissCycles = Cyclestlbmiss_cache + Cycles_tlbmiss_full

TLBMissTime = (TLB Miss Cycles)/(Clock rate)

If the TLB miss time is a large percentage of overall program execution, then the time should be invested to reduce the miss rate and achieve better performance. One means of achieving this is to translate addresses in larger units than the base page size, as supported by many modern processors.

Using more than one page size was identified in the 1990s as one means of reducing the time spent servicing TLB misses by increasing TLB reach. The benefits of huge pages are twofold. The obvious performance gain is from fewer translations requiring fewer cycles. A less obvious benefit is that address translation information is typically stored in the L2 cache. With huge pages, more cache space is available for application data, which means that fewer cycles are spent accessing main memory. Broadly speaking, database workloads will gain about 2-7% performance using huge pages whereas scientific workloads can range between 1% and 45%.

Huge pages are not a universal gain, so transparent support for huge pages is limited in mainstream operating systems. On some TLB implementations, there may be different numbers of entries for small and huge pages. If the CPU supports a smaller number of TLB entries for huge pages, it is possible that huge pages will be slower if the workload reference pattern is very sparse and making a small number of references per-huge-page. There may also be architectural limitations on where in the virtual address space huge pages can be used.

Many modern operating systems, including Linux, support huge pages in a more explicit fashion, although this does not necessarily mandate application change. Linux has had support for huge pages since around 2003 where it was mainly used for large shared memory segments in database servers such as Oracle and DB2. Early support required application modification, which was considered by some to be a major problem. To compound the difficulties, tuning a Linux system to use huge pages was perceived to be a difficult task. There have been significant improvements made over the years to huge page support in Linux and as this article will show, using huge pages today can be a relatively painless exercise that involves no source modification.

This first article begins by installing some huge-page-related utilities and support libraries that make tuning and using huge pages a relatively painless exercise. It then covers the basics of how huge pages behave under Linux and some details of concern on NUMA. The second article covers the different interfaces to huge pages that exist in Linux. In the third article, the different considerations to make when tuning the system are examined as well as how to monitor huge-page-related activities in the system. The fourth article shows how easily benchmarks for different types of application can use huge pages without source modification. For the very curious, some in-depth details on TLBs and measuring the cost within an application are discussed before concluding.

1 Huge Page Utilities and Support Libraries

There are a number of support utilities and a library packaged collectively as libhugetlbfs. Distributions may have packages, but this article assumes that libhugetlbfs 2.7 is installed. The latest version can always be cloned from git using the following instructions

  $ git clone git://libhugetlbfs.git.sourceforge.net/gitroot/libhugetlbfs/libhugetlbfs
  $ cd libhugetlbfs
  $ git checkout -b next origin/next
  $ make PREFIX=/usr/local

There is an install target that installs the library and all support utilities but there are install-bin, install-stat and install-man targets available in the event the existing library should be preserved during installation.

The library provides support for automatically backing text, data, heap and shared memory segments with huge pages. In addition, this package also provides a programming API and manual pages. The behaviour of the library is controlled by environment variables (as described in the libhugetlbfs.7 manual page) with a launcher utility hugectl that knows how to configure almost all of the variables. hugeadm, hugeedit and pagesize provide information about the system and provide support to system administration. tlbmiss_cost.sh automatically calculates the average cost of a TLB miss. cpupcstat and oprofile_start.sh provide help with monitoring the current behaviour of the system. Manual pages are available describing in further detail each utility.

2 Huge Page Fault Behaviour

In the following articles, there will be discussions on how different type of memory regions can be created and backed with huge pages. One important common point between them all is how huge pages are faulted and when the huge pages are allocated. Further, there are important differences between shared and private mappings depending on the exact kernel version used.

In the initial support for huge pages on Linux, huge pages were faulted at the same time as mmap() was called. This guaranteed that all references would succeed for shared mappings once mmap() returned successfully. Private mappings were safe until fork() was called. Once called, it's important that the child call exec() as soon as possible or that the huge page mappings were marked MADV_DONTFORK with madvise() in advance. Otherwise, a Copy-On-Write (COW) fault could result in application failure by either parent or child in the event of allocation failure.

Pre-faulting pages drastically increases the cost of mmap() and can perform sub-optimally on NUMA. Since 2.6.18, huge pages were faulted the same as normal mappings when the page was first referenced. To guarantee that faults would succeed, huge pages were reserved at the time the shared mapping is created but private mappings do not make any reservations. This is unfortunate as it means an application can fail without fork() being called. libhugetlbfs handles the private mapping problem on old kernels by using readv() to make sure the mapping is safe to access, but this approach is less than ideal.

Since 2.6.29, reservations are made for both shared and private mappings. Shared mappings are guaranteed to successfully fault regardless of what process accesses the mapping.

For private mappings, the number of child processes is indeterminable so only the process that creates the mapping mmap() is guaranteed to successfully fault. When that process fork()s, two processes are now accessing the same pages. If the child performs COW, an attempt will be made to allocate a new page. If it succeeds, the fault successfully completes. If the fault fails, the child gets terminated with a message logged to the kernel log noting that there were insufficient huge pages. If it is the parent process that performs COW, an attempt will also be made to allocate a huge page. In the event that allocation fails, the child's pages are unmapped and the event recorded. The parent successfully completes the fault but if the child accesses the unmapped page, it will be terminated.

3 Huge Pages and Swap

There is no support for the paging of huge pages to backing storage.

4 Huge Pages and NUMA

On NUMA, memory can be local or remote to the CPU, with significant penalty incurred for remote access. By default, Linux uses a node-local policy for the allocation of memory at page fault time. This policy applies to both base pages and huge pages. This leads to an important consideration while implementing a parallel workload.

The thread processing some data should be the same thread that caused the original page fault for that data. A general anti-pattern on NUMA is when a parent thread sets up and initialises all the workload's memory areas and then creates threads to process the data. On a NUMA system this can result in some of the worker threads being on CPUs remote with respect to the memory they will access. While this applies to all NUMA systems regardless of page size, the effect can be pronounced on systems where the split between worker threads is in the middle of a huge page incurring more remote accesses than might have otherwise occurred.

This scenario may occur for example when using huge pages with OpenMP, because OpenMP does not necessarily divide its data on page boundaries. This could lead to problems when using base pages, but the problem is more likely with huge pages because a single huge page will cover more data than a base page, thus making it more likely any given huge page covers data to be processed by different threads. Consider the following scenario. A first thread to touch a page will fault the full page's data into memory local to the CPU on which the thread is running. When the data is not split on huge-page-aligned boundaries, such a thread will fault its data and perhaps also some data that is to be processed by another thread, because the two threads' data are within the range of the same huge page. The second thread will fault the rest of its data into local memory, but will still have part of its data accesses be remote. This problem manifests as large standard deviations in performance when doing multiple runs of the same workload with the same input data. Profiling in such a case may show there are more cross-node accesses with huge pages than with base pages. In extreme circumstances, the performance with huge pages may even be slower than with base pages. For this reason it is important to consider on what boundary data is split when using huge pages on NUMA systems.

One work around for this instance of the general problem is to use MPI in combination with OpenMP. The use of MPI allows division of the workload with one MPI process per NUMA node. Each MPI process is bound to the list of CPUs local to a node. Parallelisation within the node is achieved using OpenMP, thus alleviating the issue of remote access.

5 Summary

In this article, the background to huge pages were introduced, what the performance benefits can be and some basics of how huge pages behave on Linux. The next article (to appear in the near future) discusses the interfaces used to access huge pages.

Read the successive installments:

Details of publications referenced in these articles can be found in the bibliography at the end of Part 5.

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.

Comments (18 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.33-rc8 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds