| |

Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc3, released on April 11. Linus said:

It's been another almost spookily calm week. Usually this kind of calmness happens much later in the -rc series (during -rc7 or -rc8, say), but I'm not going to complain. I'm just still waiting for the other shoe to drop.

And it is possible that this really ended up being a very calm release cycle. We certainly didn't have any big revolutionary changes like the name lookup stuff we had last cycle. So I'm quietly optimistic that no shoe-drop will happen.

The short-form changelog is in the announcement, or see the full changelog for all the details.

Stable updates: no stable updates have been released in the last week. The 2.6.38.3 update is in the review process as of this writing; it can be expected sometime on or after April 14. The 2.6.33.10 and 2.6.32.37 updates are also in the review process, with an expected release on or after April 15.

Comments (none posted)

Quotes of the week

If we were going to use the SELinux reference policy, I would completely agree. However, looking at something that would focus more on the SELinux privacy controls would limit the complexity. You're correct that it might be easier to create an LSM that does exactly what we want. Because of MeeGo policies though, that would mean I would have to get it upstreamed first before we could use it and that would be problematic.

-- Ryan Ware

That field will contain internal state information which is not going to be exposed to anything outside the core code - except via accessor functions. I'm tired of everyone fiddling in irq_desc.status.

core_internal_state__do_not_mess_with_it is clear enough, annoying to type and easy to grep for. Offenders will be tracked down and slapped with stinking trouts.

-- Thomas Gleixner

Comments (none posted)

Collecting condolences for the family of David Brownell

As many LWN readers will have heard, long-time kernel developer David Brownell recently passed away. His contributions to the code are many, but it is clear that they were outweighed by his contributions to the community. He will be much missed.

A collection point has been set up for condolences to be passed on to David's family. People outside of our community are often not fully aware of the role a loved one plays in the community; this is a chance to let David's family know more about how many lives he touched and how valuable his work was. If you would like to share your memories of David, they may be sent to dbrownell-condolences@kernel.org; from there, they will be passed on to his family.

Comments (4 posted)

The native KVM tool

By Jonathan Corbet
April 12, 2011

The KVM subsystem provides native virtualization support in the Linux kernel. To that end, it provides a virtualized CPU and access to memory, but not a whole lot more; some other software component is needed to provide virtual versions of all the hardware (console, disk drives, network adapters, etc) that a kernel normally expects to find when it boots. With KVM, a version of the QEMU emulator is normally used to provide that hardware. While QEMU is stable and capable, it is not universally loved; a competitor has just come along that may not displace QEMU, but it may claim some of its limelight.

Just over one year ago, LWN covered an extended discussion about KVM, and, in particular, about the version of QEMU used by KVM. At that time, there were some suggestions that QEMU should be forked and brought into the kernel source tree; the idea was that faster and more responsive development would result. That fork never happened, and the idea seemed to fade away.

That idea is now back, in a rather different form, with Pekka Enberg's announcement of the "native KVM tool." In short, this tool provides a command (called kvm) which can substitute for QEMU - as long as nobody cares about most of the features provided by QEMU. The native tool is able to boot a kernel which can talk over a serial console. It lacks graphics support, networking, SMP support, and much more, but it can get to a login prompt when run inside a terminal emulator.

Why is such a tool interesting? There seem to be a few, not entirely compatible reasons. Replacing QEMU is a nice idea because, as Avi Kivity noted, "It's an ugly gooball". The kvm code - being new and with few features - is compact, clean, and easy to work with. Some developers have said that kvm makes debugging (especially for early-boot problems) easier, but others doubt that it can ever replace QEMU, with its extensive hardware emulation, in that role. There's also talk of moving kvm toward the paravirtualization model in the interest of getting top performance, but there is also resistance to doing anything which would make it unable to run native kernels.

Developers seem to like the idea of this project, and chances are that it will go somewhere even if it never threatens to push QEMU aside. There are a few complaints about the kvm name - QEMU already has a kvm command and the name is hard to search for anyway - but no alternative names seem to be in the running as of this writing. Regardless of its name, this project may be worth watching; it's clearly the sort of tool that people want to hack on.

Comments (13 posted)

Kernel development news

A JIT for packet filters

By Jonathan Corbet
April 12, 2011

The Berkeley Packet Filter (BPF) is a mechanism for the fast filtering of network packets on their way to an application. It has its roots in BSD in the very early 1990's, a history that was not enough to prevent the SCO Group from claiming ownership of it. Happily, that claim proved to be as valid as the rest of SCO's assertions, so BPF remains a part of the Linux networking stack. A recent patch from Eric Dumazet may make BPF faster, at least on 64-bit x86 systems.

The purpose behind BPF is to let an application specify a filtering function to select only the network packets that it wants to see. An early BPF user was the tcpdump, which used BPF to implement the filtering behind its complex command-line syntax. Other packet capture programs also make use of it. On Linux, there is another interesting application of BPF: the "socket filter" mechanism allows an application to filter incoming packets on any type of socket with BPF. In this mode, it can function as a sort of per-application firewall, eliminating packets before the application ever sees them.

The original BPF distribution came in the form of a user-space library, but the BPF interface quickly found its way into the kernel. When network traffic is high, there is a lot of value in filtering unwanted packets before they are copied into user space. Obviously, it is also important that BPF filters run quickly; every bit of per-packet overhead is going to hurt in a high-traffic situation. BPF was designed to allow a wide variety of filters while keeping speed in mind, but that does not mean that it cannot be made faster.

BPF defines a virtual machine which is almost Turing-machine-like in its simplicity. There are two registers: an accumulator and an index register. The machine also has a small scratch memory area, an implicit array containing the packet in question, and a small set of arithmetic, logical, and jump instructions. The accumulator is used for arithmetic operations, while the index register provides offsets into the packet or into the scratch memory areas. A very simple BPF program (taken from the 1993 USENIX paper [PDF]) might be:

	ldh	[12]
	jeq	#ETHERTYPE_IP, l1, l2
    l1:	ret	#TRUE
    l2:	ret	#0

The first instruction loads a 16-bit quantity from offset 12 in the packet to the accumulator; that value is the Ethernet protocol type field. It then compares the value to see if the packet is an IP packet or not; IP packets are accepted, while anything else is rejected. Naturally, filter programs get more complicated quickly. Header length can vary, so the program will have to calculate the offsets of (for example) TCP header values; that is where the index register comes into play. Scratch memory (which is the only place a BPF program can store to) is used when intermediate results must be kept.

The Linux BPF implementation can be found in net/core/filter.c; it provides "standard" BPF along with a number of Linux-specific ancillary instructions which can test whether a packet is marked, which CPU the filter is running on, which interface the packet arrived on, and more. It is, at its core, a long switch statement designed to run the BPF instructions quickly. This code has seen a number of enhancements and speed improvements over the years, but there has not been any fundamental change for a long time.

Eric Dumazet's patch is a fundamental change: it puts a just-in-time compiler into the kernel to translate BPF code directly into the host system's assembly code. The simplicity of the BPF machine makes the JIT translation relatively simple; every BPF instruction maps to a straightforward x86 instruction sequence. There are a few assembly language helpers which help to implement the virtual machine's semantics; the accumulator and index are just stored in the processor's registers. The resulting program is placed in a bit of vmalloc() space and run directly when a packet is to be tested. A simple benchmark shows a 50ns savings for each invocation of a simple filter - that may seem small, but, when multiplied by the number of packets going through a system, that difference can add up quickly.

The current implementation is limited to the x86-64 architecture; indeed, that architecture is wired deeply into the code, which is littered with hard-coded x86 instruction opcodes. Should anybody want to add a second architecture, they will be faced with the choice of simply replicating the whole thing (it is not huge) or trying to add a generalized opcode generator to the existing JIT code.

An obvious question is: can this same approach be applied to iptables, which is more heavily used than BPF? The answer may be "yes," but it might also make more sense to bring back the nftables idea, which is built on a BPF-like virtual machine of its own. Given that there has been some talk of using nftables in other contexts (internal packet classification for packet scheduling, for example), the value of a JIT-translated nftables could be even higher. Nftables is a job for another day, though; meanwhile, we have a proof of the concept for BPF that appears to get the job done nicely.

Comments (19 posted)

Explicit block device plugging

April 13, 2011

This article was contributed by Jens Axboe

Since the dawn of time, or for at least as long as I have been involved, the Linux kernel has deployed a concept called "plugging" on block devices. When I/O is queued to an empty device, that device enters a plugged state. This means that I/O isn't immediately dispatched to the low level device driver, instead it is held back by this plug. When a process is going to wait on the I/O to finish, the device is unplugged and request dispatching to the device driver is started. The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth. With the release of the 2.6.39-rc1 kernel, block device plugging was drastically changed. Before we go into that, lets take a historic look at how plugging has evolved.

Back in the early days, plugging a device involved global state. This was before SMP scalability was an issue, and having global state made it easier to handle the unplugging. If a process was about to block for I/O, any plugged device was simply unplugged. This scheme persisted in pretty much the same form until the early versions of the 2.6 kernel, where it began to severely impact SMP scalability on I/O-heavy workloads.

In response to this problem, the plug state was turned into a per-device entity in 2004. This scaled well, but now you suddenly had no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: sync_page() in struct address_space_operations; this hook would unplug the device of interest.

If you have a more complicated I/O setup with device mapper or RAID components, those layers would in turn unplug any lower-level device. The unplug event would thus percolate down the stack. Some heuristics were also added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. With the asymmetric nature of plugging where the device was automatically plugged but had to be explicitly unplugged, we've had our fair share of I/O stall bugs in the kernel. While crude, the auto-unplug would at least ensure that we would chuck along if someone missed an unplug call after I/O submission.

With really fast devices hitting the market, once again plugging had become a scalability problem and hacks were again added to avoid this. Essentially we disabled plugging on solid-state devices that were able to do queueing. While plugging originally was a good win, it was time to reevaluate things. The asymmetric nature of the API was always ugly and a source of bugs, and the sync_page() hook was always hated by the memory management people. The time had come to rewrite the whole thing.

The primary use of plugging was to allow an I/O submitter to send down multiple pieces of I/O before handing it to the device. Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests. The state is now tracked in struct blk_plug, which is little more than a linked list and a should_sort flag informing blk_finish_plug() whether or not to sort this list before flushing the I/O. We'll come back to that later.

	struct blk_plug {
		unsigned long magic;
		struct list_head list;
		unsigned int should_sort;
	};

The magic member is a temporary addition to detect uninitialized use cases, it will eventually be removed. The new API to do this is straightforward and simple to use:

	struct blk_plug plug;

	blk_start_plug(&plug);
	submit_batch_of_io();
	blk_finish_plug(&plug);

blk_start_plug() takes care of initializing the structure and tracking it inside the task structure of the current process. The latter is important to be able to automatically flush the queued I/O should the task end up blocking between the call to blk_start_plug() and blk_finish_plug(). If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep, we avoid these types of deadlocks.

If blk_start_plug() is called and the task already has a plug structure registered, it is simply ignored. This can happen in cases where the upper layers plug for submitting a series of I/O, and further down in the call chain someone else does the same. I/O submitted without the knowledge of the original plugger will thus end up on the originally assigned plug, and be flushed whenever the original caller ends the plug by calling blk_finish_plug(), or if some part of the call path goes to sleep or is scheduled out.

Since the plug state is now device agnostic, we may end up in a situation where multiple devices have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing blk_finish_plug() to grab and release the related queue locks multiple times. To avoid this problem, a should_sort flag in the blk_plug structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.

With this new scheme in place, the device need no longer be notified of unplug events. The queue unplug_fn() used to exist for this purpose alone, it has now been removed. For most drivers it is safe to just remove this hook and the related code. However, some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. Since this mechanism no longer exists, a similar API has been provided for such use cases. Drivers may now use blk_delay_queue() for this:

	blk_delay_queue(queue, delay_in_msecs);

The block layer will re-invoke request queueing after the specified number of milliseconds have passed. It will be invoked from process context, just as it would have been with the unplug event. blk_delay_queue() honors the queue stopped state, so if blk_stop_queue() was called before blk_delay_queue(), or if is called after the fact but before the delay has passed, the request handler will not be invoked. blk_delay_queue() must only be used for conditions where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use blk_stop_queue() and blk_start_queue() to manage those directly.

These changes have been merged for the 2.6.39 kernel. While a few problems have been found (and fixed), it would appear that the plugging changes have been integrated without greatly disturbing Linus's calm development cycle.

Comments (10 posted)

LFCS: ARM, control groups, and the next 20 years

By Jake Edge
April 13, 2011

The recently held Linux Foundation Collaboration Summit (LFCS) had its traditional kernel panel on April 6 at which Andrew Morton, Arnd Bergmann, James Bottomley, and Thomas Gleixner sat down to discuss the kernel with moderator Jonathan Corbet. Several topics were covered, but the current struggles in the ARM community were clearly at the forefront of the minds of participants and audience members alike.

Each of the kernel hackers introduced themselves, some with tongue planted firmly in cheek, such as Bottomley with a declaration that he was on the panel "to meet famous kernel developers", and Morton who said he spent most of his time trying to figure out what the other kernel hackers are doing to the memory management subsystem. Bergmann was a bit modest about his contributions, so Gleixner pointed out that Bergmann had done the last chunk of work required to remove the big kernel lock, which was greeted with a big round of applause. For his part, Gleixner was a bit surprised to find out that he manages bug reports for NANA flash (based on a typo on the giant slides on either side of the stage), but noted that he specialized in "impossible tasks" like getting the realtime preemption patches into the mainline piecewise.

There is a "high-level architectural issue" that Corbet wanted the panel to tackle first, and that was the current problems in the ARM world. It is "one of our more important architectures", he said, without which we wouldn't have all these different Android phones to play with. So, it is "discouraging to see that there is a mess" in the ARM kernel community right now. What's the situation, he asked, and how can we improve things?

For a long time, the problem in the ARM community was convincing system-on-chip (SoC) and board vendors to get their code upstream, Bergmann said, but now there is a new problem in that they all "have their own subtrees that don't work very well together". Each of those trees is going their own way, which means that core and driver code gets copied "five times or twenty times" into different SoC trees.

Corbet asked how the kernel community can do better with respect to ARM. Gleixner noted that ARM maintainer Russell King tries to push back on bad code coming in, "but he simply doesn't scale". There are 70 different sub-architectures and 500 different SoCs in the ARM tree, he said. In addition, "people have been pushing sub-arch trees directly to Linus", Bergmann said, so King does not have any control over those. It is a consequence of the "tension between cleanliness and time-to-market", Bottomley said.

Gleixner thinks that the larger kernel community should be providing the ARM vendors with "proper abstractions" and that because of a lack of a big picture view, those vendors cannot be expected to come up with those themselves. By and large the ARM vendor community has a different mindset that comes from other operating systems where changes to the core code were impossible, so 500 line workarounds in drivers were the norm. Bergmann suggested that the vendors get the code reviewed and upstream before shipping products with that code. Morton said that as the "price of admission" vendors need to be asked to maintain various pieces horizontally across the ARM trees. Actually motivating them to do that is difficult, he said.

From the audience, Wolfram Sang asked whether more code review for the ARM patches would help. All agreed that more code review is good, but Bottomley expressed some reservations because there are generally only a few reviewers that a subsystem maintainer can trust to spot important issues, so all code review is not created equal. Morton suggested a "review economy" where one patch submitter needs to review the code of another and vice versa. That would allow developers to justify the time spent reviewing code to their managers. But, Bottomley said, "collaborating with competitors" is a hard concept for organizations that are new to open source development.

If a driver looks like one that is already in the tree, it should not be merged, and instead someone needs to get the developers to work with the existing driver, Bergmann said. There is a lot of reuse of IP blocks in SoCs, but the developers aren't aware of it because different teams are work on the different SoCs, Gleixner said. The kernel community needs people that can figure that out, he said. Bottomley observed that "the first question should be: did anyone do it before and can I 'steal' it?".

In response to an audience question about the panel's thoughts on Linaro, Bergmann, who works with Linaro, said "I think it's great" with a smile. He went on to say that Linaro is doing work that is closely related to the ARM problems that had been discussed. Getting different SoC vendors to work together is a big part of what Linaro is doing, and that "everyone suffers" if that collaboration doesn't happen. "ARM is one of the places where it [collaboration] is needed most", he said.

Control groups

The discussion soon shifted to control groups, with Corbet noting that they are becoming more pervasive in the kernel, but that lots of kernel hackers hate them. It will soon be difficult to get a distribution to boot and run without control groups, he said, and wondered if adding them to the kernel was the right move: "did we make a mistake?" Gleixner said that there is nothing wrong with control groups conceptually, "just that the code is a horror". Bottomley lamented the code that is getting "grafted onto the side of control groups" as each resource in the kernel that is getting controlled requires reaching into multiple subsystems in rather intrusive ways.

As with "everything that sucks" in the kernel, control groups needs to be cleaned up by someone who looks at it from a global perspective; that person will have to "reimplement it and radically modify it", Gleixner said. That is difficult to do because it is both a technical and a political problem, Bottomley said. The technical part is to get the interaction right, while the political part is that it is difficult to make changes across subsystem boundaries in the kernel.

But Morton said that he hadn't seen much in the way of specific complaints about control groups cross his desk. Conceptually, it extends what an operating system should do in terms of limiting resources. "If it's messy, it's because of how it was developed" on top of a production kernel that gets updated every three months. Bottomley said that the problem with doing cross-subsystem work is often just a matter of communication, but it also requires someone to take ownership and talk to all of the affected subsystems rather than just picking the "weakest subsystem" and getting changes in through there.

Corbet wondered if the independence of subsystems in the kernel, something that was very helpful in allowing its development to scale, was changing. The panel seemed to think there wasn't much of an issue there, that while control groups crossed a lot of boundaries, naming five things like that in the kernel would be hard to do as Bottomley pointed out.

Twenty years ahead

With the 20 year anniversary of Linux being celebrated this year, Jon Masters asked from the audience, what would things be like 20 years from now. Bottomley promptly replied that four-fifths of the panel would be retired, but Gleixner expected that the 2038 bug would have brought them all back out of retirement. Morton said that unless some kind of quantum computer came along to make Linux obsolete, it would still be there in 20 years. He also expected that the first thing to be done with any new quantum computer would be to add an x86 emulation layer.

When Corbet posited that perhaps the realtime preempt code would be merged by then, Gleixner made one his strongest predictions yet for merging that code: "I am planning to be done with it before I retire". More seriously, he said that it is on a good track, he has talked to the relevant subsystem maintainers, and is optimistic about getting it all merged—eventually.

In 20 years, the kernel will still be supporting the existing user-space interfaces, Corbet said. He quoted Morton from a recent kernel mailing list post: "Our hammer is kernel patches and all problems look like nails", and wondered whether there was a problem with how the kernel hackers developed user-space interfaces. Morton noted that the quote was about doing more pretty printing inside the kernel, which he is generally opposed to. It has been done in the past because it was difficult for the kernel hackers to ship user-space code, so that it would stay in sync with kernel changes. But perf has demonstrated that the kernel can ship user-space code, which could be a way forward.

Gleixner noted that there was quite a bit of resistance to shipping perf, but that it worked out pretty well as a way to "keep the strict connection between the kernel and user space". Perf is meant to be a simple tool to allow users to try out perf events gathering, he said, and that people are building more full-blown tools on top of perf. Having tools shipped with the kernel allows more freedom to experiment with the ABI, Bottomley said. Morton said that there needs to be a middle ground, noting that Google had a patch that exported a procfs file that contained a shell script inside.

Ingo Molnar recently pointed out that FreeBSD is getting Linux-like quality with a much smaller development community and suggested that it was because the user space and kernel are developed together. Corbet asked whether Linux was holding itself back by not taking that route. Bottomley thought that Molnar was "both right and wrong", and that FreeBSD has an entire distribution in its kernel tree. "I hope Linux never gets to that", he said.

From perf to control groups, FreeBSD to ARM, as usual, the panel ranged over a number of topics in the hour allotted. The format and participants vary from year to year, but it is always interesting to hear what kernel developers are thinking about issues that Linux is facing.

Comments (22 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.39-rc3 ?

Thomas Gleixner 2.6.33.9-rt31 ?

Architecture-specific

Kevin Hilman ARM: runtime PM: consolidate runtime PM implementations ?

Joerg Roedel ATS support for AMD IOMMU driver ?

Joerg Roedel AMD IOMMU Extended feature base support ?

Build system

Michal Marek Make kernel build deterministic ?

Core kernel code

Peter Zijlstra sched: Rewrite sched_domain/sched_group creation ?

Peter Zijlstra perf: Rework event scheduling ?

Trinabh Gupta [RFC PATCH V2 0/4] cpuidle: global registration of idle states with per-cpu statistics ?

Device drivers

Antonio Ospite Add a GSPCA subdriver for the Microsoft Kinect sensor ?

Jamie Iles gpio: extend basic_mmio_gpio for different controllers ?

Michal Miroslaw Gemini: Gigabit ethernet driver ?

Laurent Pinchart v4l: Add mt9v032 sensor driver ?

Sakari Ailus V4L2 API for flash devices ?

Huang Shijie add the GPMI controller driver for IMX23/IMX28 ?

Kenneth Heitke i2c: QUP based bus driver for Qualcomm MSM chipsets ?

Filesystems and block I/O

Darrick J. Wong Add inode checksum support to ext4 ?

Hugo Mills btrfs: Balance mangement ?

Memory management

Ying Han memcg: isolate pages in memcg lru from global lru ?

Ying Han memcg: per cgroup background reclaim ?

Networking

Eric Dumazet net: filter: Just In Time compiler ?

Stephen Hemminger sched: QFQ - quick fair queue scheduler ?

Krishna Kumar [RFC rev2] Implement multiqueue (RX & TX) virtio-net ?

Page editor: Jonathan Corbet
Next page: Distributions>>