| |

Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc4, released on April 18. According to Linus:

So things have sadly not continued to calm down even further. We had more commits in -rc4 than we had in -rc3, and I sincerely hope that upward trend doesn't continue.

That said, so far the only thing that has really caused problems this release cycle has been the block layer plugging changes, and as of -rc4 the issues we had with MD should hopefully now be behind us. So we're making progress on that front too.

The short-form changelog is in the announcement, or see the full changelog for all the details.

Stable updates: 2.6.34.9 was released on April 17, 2.6.32.37 and 2.6.33.10 were released on April 15 (and quickly followed by 2.6.32.38 and 2.6.33.11 to fix a problem with the RDS network protocol), and 2.6.38.3 was released on April 14.

The 2.6.32.39, 2.6.33.12, and 2.6.38.4 updates are in the review process as of this writing; they can be expected on or after April 21.

Comments (none posted)

Quote of the week

There really are only two acceptable models of development: "think and analyze" or "years and years of testing on thousands of machines". Those two really do work.

-- Linus Torvalds

Comments (none posted)

TI introduces OpenLink

Texas Instruments has announced the delivery of a mobile-grade, battery-optimized Wi-Fi solution to the open source Linux community as part of the OpenLink project. "OpenLink wireless connectivity drivers attach to open source development platforms such as BeagleBoard, PandaBoard and other boards. Whether working with Android, MeeGo or other Linux-based distributions, developers can now access code natively as part of their kernel builds to introduce the latest low-power wireless connectivity solution into their products. Additionally, community support and resources are available 24/7 via the active OpenLink community."

Comments (6 posted)

MIPS Technologies Launches New Developer Community

MIPS Technologies has announced the launch of its new developer community developer.mips.com. "The new site, which is live now, is specifically tailored to the needs of software developers working with the Android(TM) platform, Linux operating system and other applications for MIPS-Based(TM) hardware. All information and resources on the site are openly accessible."

Full Story (comments: 17)

DISCONTIGMEM, !NUMA, and SLUB

By Jonathan Corbet
April 20, 2011

The kernel has two different ways of dealing with systems where there are large gaps in the physical memory address space - DISCONTIGMEM and SPARSEMEM. Of those two, DISCONTIGMEM is the older; it has been semi-deprecated for some time and appears to be on its (slow) way out. But some architectures still use it. Recent changes (and the resulting crashes) have shown that there are some interesting misunderstandings about how DISCONTIGMEM is handled in the kernel.

The problem comes down to this: DISCONTIGMEM tracks separate ranges of memory by putting each into its own virtual NUMA node. The result is that a system running in this mode can appear to have multiple NUMA nodes, even if NUMA support is not configured in. That apparently works well much of the time, but it recently has been shown to cause crashes in the SLUB allocator, which is not prepared for the appearance of multiple NUMA nodes on a non-NUMA system.

There was a surprisingly acrimonious discussion on just whose fault this misunderstanding is and how to fix it. Options including changing DISCONTIGMEM to not "abuse" (in some peoples' view) the NUMA concept in this way; that might be a long-term solution, but the bug exists now and, as James Bottomley put it: "That has to be fixed in -stable. I don't really think a DISCONTIGMEM re-engineering effort would be the best thing for the -stable series." Another option is to force NUMA support to be configured in when DISCONTIGMEM is used; that could bloat the kernel on embedded systems and requires acceptance of the strange concept that uniprocessor systems can be NUMA. The kernel could be fixed to handle non-zero NUMA nodes at all times; that could involve a significant code audit as the problems might not be limited to the SLUB allocator. The SLUB allocator could be disallowed on non-NUMA DISCONTIGMEM systems, but, once again, there may be issues elsewhere. Or the process of escorting DISCONTIGMEM out of the kernel could be expedited - though that would not be suitable for the stable series.

As of this writing the discussion continues; it's not clear what form the real solution will take. The problem is subtle and there do not appear to be any easy fixes at hand.

Comments (none posted)

Kernel development news

Rationalizing the ARM tree

By Jonathan Corbet
April 19, 2011

The kernel's ARM architecture support is one of the fastest-moving parts of a project which, as a whole, is anything but slow. Recent concerns about the state of the code in the ARM tree threaten to slow things down considerably, though, with some developers now worrying in public that support for new platforms could be delayed indefinitely. The situation is probably not that grim, but some changes will certainly need to be made to get ARM development back on track.

Top-level ARM maintainer Russell King recently looked at the ARM patches in linux-next and was not pleased with what he saw. About 75% of all the architecture-specific changes in linux-next were for the ARM architecture, and those changes add some 6,000 lines of new code. Some of this work is certainly justified by the fact that the appearance of new ARM-based processors and boards is a nearly daily event, but it is still problematic in an environment where there have been calls for the ARM code to shrink. So, Russell suggested: "Please take a moment to consider how Linus will react to this at the next merge window."

As it turns out, relatively little consideration was required; Linus showed up and told the ARM developers what to expect:

Hint for anybody on the arm list: look at the dirstat that rmk posted, and if your "arch/arm/{mach,plat}-xyzzy" shows up a lot, it's quite possible that I won't be pulling your tree unless the reason it shows up a lot is because it has a lot of code removed.

People need to realize that the endless amounts of new pointless platform code is a problem, and since my only recourse is to say "if you don't seem to try to make an effort to fix it, I won't pull from you", that is what I'll eventually be doing.

Exactly when I reach that point, I don't know.

A while back, most of the ARM subplatform maintainers started managing their own trees and sending pull requests directly to Linus. It was a move that made some sense; the size and diversity of the ARM tree makes it hard for a single top-level maintainer to manage everything. But it has also led to a situation where there seems to be little overall control, and that leads to a lot of duplicated code. As Arnd Bergmann put it:

Right now, every subarchitecture in arm implements a number of drivers (irq, clocksource, gpio, pci, iommu, cpufreq, ...). These drivers are frequently copies of other existing ones with slight modifications or (worse) actually are written independently for the same IP blocks. In some cases, they are copies of drivers for stuff that is present in other architectures.

The obvious solution to the problem is to pull more of the code out of the subplatforms, find the commonalities, and eliminate the duplications. It is widely understood that a determined effort along these lines could reduce the amount of code in the ARM tree considerably while simultaneously making it more generally useful and more maintainable. Some work along these lines has already begun; some examples include Thomas Gleixner's work to consolidate interrupt chip drivers, Rafael Wysocki and Kevin Hilman's work to unify some of the runtime power management code, and Sascha Hauer's "sanitizing crazy clock data files" patch.

Some of the ongoing work could benefit architectures beyond ARM as well. It has been observed, for example, that most GPIO drivers tend to look a lot alike. There are, after all, only so many ways that even the most imaginative hardware designers can come up with to control a wire with a maximum of two or three states. The kernel has an unbelievable number of GPIO drivers; if most of them could be reduced to declarations of which memory-mapped I/O bits need to be twiddled to read or change the state of the line, quite a bit of code could go away.

There is also talk of reorganizing the ARM tree so that most drivers no longer live in subplatform-specific directories. Once all of the drivers of a specific type can be found in the same place, it will be much easier to find duplicates and abstract out common functionalities.

All of this work takes time, though, and the next merge window is due to open in less than two months. Any work which is to be merged for 2.6.40 needs to be in a nearly-complete state by now; most of the work that satisfies that criterion will be business as usual: adding new platforms, boards, and drivers. Russell worries that this work is now unmergeable:

Will we ever be able to put John's code in the kernel? Honestly, I have no idea. What I do know is that unless we start doing something to solve the problem we have today with the quantity of code under arch/arm _and_ the constant churn of that code, we will _never_ be able to add new platform support in any shape or form to the kernel.

Russell has an occasional tendency toward drama that might cause readers to discount the above, but he's not alone in these worries. Mark Brown is concerned that ARM development will come to a halt for the next several months; he also has expressed doubts about the whole idea that the ARM tree must shrink before it can be allowed to grow again:

What we're telling people to do is work on random improvements to more or less tangentially related code. This doesn't seem entirely reasonable and is going to be especially offputting for new contributors (like the people trying to submit new platforms, many of them will be new to mainline work) as it's a pretty big jump to start working on less familiar code when you're still trying to find your feet and worried about stepping on people's toes or breaking things, not to mention justifying your time to management.

If these fears hold true, we could be looking at a situation where the kernel loses much of its momentum - both in support for new hardware and in getting more contributions from vendors. The costs of such an outcome could be quite high; it is not surprising that people are concerned.

In the real world, though, such an ugly course of events seems unlikely. Nobody expects the ARM tree to be fixed by the 2.6.40 merge window; even Linus, for all his strongly-expressed opinions, is not so unreasonable. Indeed, he is currently working on a patch to git to make ARM cleanup work not look so bad in the statistics. What is needed in the near future is not a full solution; it's a clear signal that the ARM development community is working toward that solution. Some early cleanup work, some pushback against the worst offenses, and a plan for following releases should be enough to defer the Wrath Of Linus for another development cycle. As long as things continue to head in the right direction thereafter, it should be possible to keep adding support for new hardware.

Observers may be tempted to view this whole episode as a black mark for the kernel development community. How can we run a professional development project if this kind of uncertainty can be cast over an entire architecture? What we are really seeing here, though, is an example of how the community tries to think for the long term. Cramming more ARM code into the kernel will make some current hardware work now, but, in the long term, nobody will be happy if the kernel collapses under its own weight. With luck, some pushback now will help to avoid much more significant problems some years down the line. Those of us who plan to still be working on (and using) Linux then will benefit from it.

Comments (5 posted)

ELC: Linaro power management work

By Jake Edge
April 20, 2011

There was a large Linaro presence at this year's Embedded Linux Conference with speakers from the organization reporting on its efforts to consolidate functionality from the various ARM architecture trees. One of those talks was by Amit Kucheria, technical lead for the power management working group (PMWG), who talked about what the working group has been doing since it began. That includes some work on tools like powertop, and the newly available PowerDebug, as well as some consolidation within the kernel tree. He also highlighted areas where Linaro plans to focus its efforts in the future.

Kucheria started with a look at what Linaro is trying to accomplish, part of which is to "take the good things in the BSP [board support package] trees and get them upstream". In addition, consolidating the kernel source, so that there is one kernel tree that can be used by all of the Linaro partners, is high on the list. There is a fair amount of architecture consolidation that is part of that, including things like reducing the "ten or twenty memcpy() functions" to one version optimized for all of the ARM processors. All of that work should result in patches that get sent upstream.

The PMWG has "existed for six to eight months now", Kucheria said, and has been focused on consolidation and tools. There has been a bit of kernel work, which includes ensuring that the clock tree is exported in the right place in debugfs for five System-on-a-chips (SoCs) that Linaro and its sponsors/partners have targeted (Freescale i.MX51, TI OMAP 3 and 4, Samsung Orion, and ST-Ericsson UX8500). In addition, work was done on cpufreq, cpuidle, and CPU hotplug for some of them. Some of that work is still in progress, but most of it has gone (or is working its way) upstream, he said.

Beyond kernel work, the group has been working on tools, starting with getting powertop to work with ARM CPUs and pushing that work upstream. A new tool, PowerDebug, has been created to help look at the clock tree to see "what clocks are on, which are active, and at what frequency", Kucheria said. It also shows power regulators that have registered with the regulator framework by pulling information from sysfs. It shows which regulators are on and what voltages are currently being used. Other SoCs or architectures can use PowerDebug simply by exporting their clock tree into debugfs.

PMWG has also been experimenting with thermal management and hotplug. In particular, it has been looking at what policies make sense when the CPU temperature gets too high. One possibility would be to hot-unplug a core to reduce the amount of heat generated. There is some inherent latency in plugging or unplugging a core, he said, which can range from 40-50ms in a simple case to several seconds if there are a lot of threads running. There is a notification chain that causes the latency, so it's possible that could be reduced by various means.

Complexity in power management

With a slide showing the complexity of Linux power management (shown at right) today, Kucheria launched into a description of some of the problems that OEMs are faced with when trying to tune products for good battery life. In that diagram, he noted there are "six or seven different knobs that you can twiddle" to adjust power usage. Those OEMs simply don't have the resources to deal with that complexity, some kind of simplification is required. In addition, the complexity is growing with more and more SoCs along with different power management schemes in the hardware.

In the "good old days", of five or six years ago, the OMAP 1 just used the Linux driver model suspend hooks to change the clock frequency. The clock framework was standard back then, but now there are 30 or 40 different clock frameworks in the ARM tree. CPU frequency scaling (cpufreq) was added after that, but it doesn't take into account the bus or coprocessor frequencies. Later on, several different frameworks were added including the regulator framework, cpuidle to control idle states, and power management quality of service (pm_qos).

The quality of service controls are important for devices that need to bound the latency for coming out of idle states, for example for network drivers that cannot tolerate more than 300ms of latency. The cpuidle framework introduced some problems, though, Kucheria said, because they were created by Intel, who concentrated on its platforms. The C-states (C0-C6) don't really exist for ARM processors and various vendors interpreted them differently for particular SoCs. In addition, some have added additional states (C7, C8)

Later still in the evolution of Linux power management, hotplug support was added, which can reduce the power consumption by unplugging CPU cores. There are a number of outstanding issues there, though, including latency and policy. Vendors have various "patches floating around", but there isn't a consistent approach. Coming up with policies, perhaps embodied in a hotplug governor, is something that needs to be done.

Runtime power management was the next component added in. PMWG would like to use it to reduce the need for drivers to talk directly to the clocks and instead they would talk in a more general way to the runtime power management framework. Lots of code that is scattered around in various drivers can be centralized in bus drivers, which will make the device drivers much more portable because they don't refer to specific clocks. Vendors have started switching over to using the runtime power management framework, but "it's a painful process" to change all of the drivers, he said.

The latest piece of the power management puzzle is the addition of Operating Performance Points (OPP) support, which was added in 2.6.38. OPP is a way to describe frequency/voltage pairs that a particular SoC will support for its various sub-modules. OPP is very CPU/SoC-specific, but can also encapsulate the requirements for different buses and co-processors. The cpufreq framework can make use of the information as it changes the frequency characteristics of different parts of the hardware.

As more dual-core and quad-core packages are being used, heat can be a problem. The existing thermal management framework is not being used by ARM vendors yet and there are a number of issues to be resolved. Linaro wants to "figure it out once and for all", and that is one its focuses in the coming months. One of the questions is what should be done when the system is overheating. Should it unplug one or more cores? Or reduce the frequency of the CPU clock? One of the "crazy things" PMWG has been thinking about is registering devices that can reduce their frequency as "cooling devices" (since they will generate less heat with a lower frequency).

PMWG's plans

The existing thermal management code works for desktop Linux, Ubuntu in particular, and also for Android, but there is still some experimenting that needs to be done to come up with an ARM-wide solution. Another area that PMWG will work on is adding scheduling domains for ARM so that you can "tweak your scheduler policy" regarding how processes and threads get spread around on multiple cores. Scheduling domains and sched_mc tunables could eliminate the need for hotplug in some cases, he said.

Rationalizing the names and abilities of the processor C-states is also something that PMWG will be working on. Kucheria said that PMWG wants to "start a conversation" with the relevant vendors and developers to make that happen. PowerDebug enhancements are also on the radar: "If you need stuff [in PowerDebug], let us know". There is lots of other consolidation work that could be done, but there are only enough developers to address the parts he described, at least in the near term.

At the end of the talk, Kucheria put the Linux power management diagram slide back up, noting that the complexity was "great for job security". There is clearly plenty of work to do in the ARM tree in the months ahead. Kucheria's talk just covered the work going on in the power management group, but there are four other groups within Linaro (kernel, toolchain, graphics, and multimedia) that are doing similar jobs inside and outside of the kernel. One gets the sense that the companies who founded Linaro were getting as tired of the chaotic ARM world as the kernel developers (e.g. Linus Torvalds) are. So far, the organization has made some strides, but there is a long way to go.

Comments (none posted)

Safely swapping over the net

By Jonathan Corbet
April 19, 2011

Swapping, like page writeback, operates under some severe constraints. The ability to write dirty pages to backing store is critical for memory management; it is the only way those pages can be freed for other uses. So swapping must work well in situations where the system has almost no memory to spare. But writing pages to backing store can, itself, require memory. This problem has been well solved (with mempools) for locally-attached devices, but network-attached devices add some extra challenges which have never been addressed in an entirely satisfactory way.

This is not a new problem, of course; LWN ran an article about swapping over network block devices (NBD) almost exactly six years ago. Various approaches were suggested then, but none were merged; it remains to be seen whether the latest attempt (posted by Mel Gorman based on a lot of work by Peter Zijlstra) will be more successful.

The kernel's page allocator makes a point of only giving out its last pages to processes which are thought to be working to make more memory free. In particular, a process must have either the PF_MEMALLOC or TIF_MEMDIE flag set; PF_MEMALLOC indicates that the process is currently performing memory compaction or direct reclaim, while TIF_MEMDIE means the process has run afoul of the out-of-memory killer and is trying to exit. This rule should serve to keep some memory around for times when it is needed to make more memory free, but one aspect of this mechanism does not work entirely well: its interaction with slab allocators.

The slab allocators grab whole pages and hand them out in smaller chunks. If a process marked with PF_MEMALLOC or TIF_MEMDIE requests an object from the slab allocator, that allocator can use a reserved page to satisfy the request. The problem is that the remainder of the page is then made available to any other process which may make a request; it could, thus, be depleted by processes which are making the memory situation worse, not better.

So one of the first things Mel's patch series does is to adapt a patch by Peter that adds more awareness to the slab allocators. A new boolean value (pfmemalloc) is added to struct page to indicate that the corresponding page was allocated from the reserves; the recipient of the page is then expected to treat it with due care. Both slab and SLUB have been modified to recognize this flag and reserve the rest of the page for suitably-marked processes. That change should help to ensure that memory is available where it's needed, but at the cost of possibly failing other memory allocations even though there are objects available.

The next step is to add a __GFP_MEMALLOC GFP flag to mark allocation requests which can dip into the reserves. This flag separates the marking of urgent allocation requests from the process state - a change will be useful later in the series, where there may be no convenient process state available. It will be interesting to see how long it takes for some developer to attempt to abuse this flag elsewhere in the kernel.

The big problem with network-based swap is that extra memory is required for the network protocol processing. So, if network-based swap is to work reliably, the networking layer must be able to access the memory reserves. Quite a bit of network processing is done in software interrupt handlers which run independently of any given process. The __GFP_MEMALLOC flag allows those handlers to access reserved memory, once a few other tweaks have been added as well.

It is not desirable to allow any network operation to access the reserves, though; bittorrent and web browsers should not be allowed to consume that memory when it is urgently needed elsewhere. A new function, sk_set_memalloc(), is added to mark sockets which are involved with memory reclaim. Allocations for those sockets will use the __GFP_MEMALLOC flag, while all other sockets have to get by with ordinary allocation priority. It is assumed that only sockets managed within the kernel will be so marked; any socket which ends up in user space should not be able to access the reserves. So swapping onto a FUSE filesystem is still not something which can be expected to work.

There is one other problem, though: incoming packets do not have a special "needed for memory reclaim" flag on them. So the networking layer must be able to allocate memory to hold all incoming packets for at least as long as it takes to identify the important ones. To that end, any network allocation for incoming data is allowed to dip into the reserves if need be. Once a packet has been identified and associated with a socket, that socket's flags can be checked; if the packet was allocated from the reserves and the destination socket is not marked as being used for memory reclaim, the packet will be dropped immediately. That change should allow important packets to get into the system without consuming too much memory for unimportant traffic.

The result should be a system where it is safe to swap over a network block device. At least, it should be safe if the low watermark - which controls how much memory is reserved - is high enough. Systems which are swapping over the net may be expected to make relatively heavy use of the reserves, so administrators may want to raise the watermark (found in /proc/sys/vm/min_free_kbytes) accordingly. The final patch in the series keeps an eye on the reserves and start throttling processes performing direct reclaim if they get too low; the idea here is to ensure that enough memory remains for a smaller number of reclaimers to actually get something done. Adjusting the size of the reserves dynamically might be the better solution in the long run, but that feature has been omitted for now in the interest of keeping the patch series from getting too large.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.39-rc4 ?

Greg KH Linux 2.6.38.3 ?

Paul Gortmaker Linux 2.6.34.9 has been released ?

Greg KH Linux 2.6.33.10 ?

Greg KH Linux 2.6.33.11 ?

Greg KH Linux 2.6.32.37 ?

Greg KH Linux 2.6.32.38 ?

Architecture-specific

Rafael J. Wysocki PM: Rework shmobile and OMAP runtime PM using power domains ?

Thomas Gleixner Interrupt chip consolidation ?

Core kernel code

John Stultz Posix Alarm Timers ?

Device drivers

Antonio Ospite rfkill: Regulator consumer driver for rfkill ?

Jon Brenner TAOS 258x: Device Driver ?

Rafael J. Wysocki Remove sysdev suspend/resume and shutdown operations ?

Linus Walleij gpio: add pin biasing and drive mode to gpiolib ?

Richard Cochran ptp: IEEE 1588 hardware clock support ?

Elly Jones Add Qualcomm Gobi 2000/3000 driver. ?

Marek Szyprowski [RFC/PATCH v3 0/7] Samsung IOMMU videobuf2 allocator and s5p-fimc update ?

Matthew Garrett drm: Add a driver for kvm emulated Cirrus ?

Documentation

Christoph Hellwig XFS status update for March 2011 ?

Filesystems and block I/O

Miklos Szeredi overlay filesystem ?

Josef Bacik Add a new file op for fsync to give fs's more control ?

Miklos Szeredi tmpfs: implement generic xattr support ?

Memory management

Mel Gorman Swap-over-NBD without deadlocking v1 ?

Christoph Lameter SLUB: Lockless freelists for objects V3 ?

Wu Fengguang IO-less dirty throttling v7 ?

Wu Fengguang writeback: moving expire targets for background/kupdate works ?

Ying Han memcg: per cgroup background reclaim ?

Networking

Andreas Gruenbacher Zero-copy receive from socket into bio ?

Shirley Ma macvtap/vhost TX zero copy support ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.39-rc3-git7: Reported regressions from 2.6.38 ?

Rafael J. Wysocki 2.6.36-rc3-git7: Reported regressions 2.6.37 -> 2.6.38 ?

Miscellaneous

Samuel Thibault Hardware locality (hwloc) v1.2 released ?

Karel Zak util-linux v2.19.1-rc1 ?

Page editor: Jonathan Corbet
Next page: Distributions>>