| |

Subscribe / Log in / New account

A JIT for packet filters

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
April 12, 2011

The Berkeley Packet Filter (BPF) is a mechanism for the fast filtering of network packets on their way to an application. It has its roots in BSD in the very early 1990's, a history that was not enough to prevent the SCO Group from claiming ownership of it. Happily, that claim proved to be as valid as the rest of SCO's assertions, so BPF remains a part of the Linux networking stack. A recent patch from Eric Dumazet may make BPF faster, at least on 64-bit x86 systems.

The purpose behind BPF is to let an application specify a filtering function to select only the network packets that it wants to see. An early BPF user was the tcpdump, which used BPF to implement the filtering behind its complex command-line syntax. Other packet capture programs also make use of it. On Linux, there is another interesting application of BPF: the "socket filter" mechanism allows an application to filter incoming packets on any type of socket with BPF. In this mode, it can function as a sort of per-application firewall, eliminating packets before the application ever sees them.

The original BPF distribution came in the form of a user-space library, but the BPF interface quickly found its way into the kernel. When network traffic is high, there is a lot of value in filtering unwanted packets before they are copied into user space. Obviously, it is also important that BPF filters run quickly; every bit of per-packet overhead is going to hurt in a high-traffic situation. BPF was designed to allow a wide variety of filters while keeping speed in mind, but that does not mean that it cannot be made faster.

BPF defines a virtual machine which is almost Turing-machine-like in its simplicity. There are two registers: an accumulator and an index register. The machine also has a small scratch memory area, an implicit array containing the packet in question, and a small set of arithmetic, logical, and jump instructions. The accumulator is used for arithmetic operations, while the index register provides offsets into the packet or into the scratch memory areas. A very simple BPF program (taken from the 1993 USENIX paper [PDF]) might be:

	ldh	[12]
	jeq	#ETHERTYPE_IP, l1, l2
    l1:	ret	#TRUE
    l2:	ret	#0

The first instruction loads a 16-bit quantity from offset 12 in the packet to the accumulator; that value is the Ethernet protocol type field. It then compares the value to see if the packet is an IP packet or not; IP packets are accepted, while anything else is rejected. Naturally, filter programs get more complicated quickly. Header length can vary, so the program will have to calculate the offsets of (for example) TCP header values; that is where the index register comes into play. Scratch memory (which is the only place a BPF program can store to) is used when intermediate results must be kept.

The Linux BPF implementation can be found in net/core/filter.c; it provides "standard" BPF along with a number of Linux-specific ancillary instructions which can test whether a packet is marked, which CPU the filter is running on, which interface the packet arrived on, and more. It is, at its core, a long switch statement designed to run the BPF instructions quickly. This code has seen a number of enhancements and speed improvements over the years, but there has not been any fundamental change for a long time.

Eric Dumazet's patch is a fundamental change: it puts a just-in-time compiler into the kernel to translate BPF code directly into the host system's assembly code. The simplicity of the BPF machine makes the JIT translation relatively simple; every BPF instruction maps to a straightforward x86 instruction sequence. There are a few assembly language helpers which help to implement the virtual machine's semantics; the accumulator and index are just stored in the processor's registers. The resulting program is placed in a bit of vmalloc() space and run directly when a packet is to be tested. A simple benchmark shows a 50ns savings for each invocation of a simple filter - that may seem small, but, when multiplied by the number of packets going through a system, that difference can add up quickly.

The current implementation is limited to the x86-64 architecture; indeed, that architecture is wired deeply into the code, which is littered with hard-coded x86 instruction opcodes. Should anybody want to add a second architecture, they will be faced with the choice of simply replicating the whole thing (it is not huge) or trying to add a generalized opcode generator to the existing JIT code.

An obvious question is: can this same approach be applied to iptables, which is more heavily used than BPF? The answer may be "yes," but it might also make more sense to bring back the nftables idea, which is built on a BPF-like virtual machine of its own. Given that there has been some talk of using nftables in other contexts (internal packet classification for packet scheduling, for example), the value of a JIT-translated nftables could be even higher. Nftables is a job for another day, though; meanwhile, we have a proof of the concept for BPF that appears to get the job done nicely.

Index entries for this article
Kernel	BPF
Kernel	Networking/Packet filtering
Kernel	Nftables

(Log in to post comments)

A JIT for packet filters

Posted Apr 14, 2011 6:57 UTC (Thu) by jengelh (subscriber, #33263) [Link] (3 responses)

>An obvious question is: can this same approach be applied to iptables, which is more heavily used than BPF?

Since xtables modules are already handcrafted for a specific task, any interpreter module for arbitrary expressions (such as xt_u32 and nft) has a tendency to naturally run slower. But, if BPF can be JITed, it would seem it not being impossible to extend xt_u32.

A JIT for packet filters

Posted Apr 15, 2011 14:01 UTC (Fri) by Nelson (subscriber, #21712) [Link] (2 responses)

Someone beat me to the punch.. I have been toying with code that does this for a few months. Congrats, I hope the community accepts the patch.

I did some research on this for work about 2 years ago, with something like BPF there is a dramatic gain due to the nature of the interpreter. You can cut a lot of cruft out with JIT. As for generic firewall rules? It's not as dramatic but you can get a pretty general across the board improvement and on some architectures it's definitely in the interesting range (maybe 30%, maybe more, I guess it depends how far you go) If you simply recode firewall rules as binary, think about it this way: the firewall has a linked list of instructions, all the linked list code goes away (it's not much, but some memory reads, a few instructions here and there) and then as you execute the instructions the JIT ones basically turn memory reads into literals so you've can dump the loads and some other machinery. It's not warp speed but a nice bump, maybe in the 30%ish generally, it depends on the architecture and there are a lot of variables. Like I said, it's interesting and noticeable improvement.

Now where it can be interesting is if you coded up a more advanced compiler to optimize the rules. (tcpdump does this, it's ugly but check it out, look at the optimized output some time) a typical stream of rules might have 10 rules that all apply to TCP packets and then check various IP ranges and ports. xtables currently would execute each "instruction" until it reached a result (is packet TCP? does packet src IP match range Y from rule.. okay go to the next one, is the packet TCP?... it would check TCP 10 times) A good compiler can invert that logic and figure out better ways to do it, (if packet TCP? yes then see if it's in the ranges of IPs from these 10 rules and maybe those rules can be compressed in to just checking a couple bits because all the IPs are similar... no it's not TCP, then skip all ten rules and look at the next batch.) I wouldn't hazard a guess as to how much faster this makes the firewall but the potential is HUGE. So we could maybe replace iptables with some sort of LLVM based compiler that generated a bytecode "program" that contained the whole firewall.

Whether it's worth the complexity, the difficulty in debugging and porting is a different question.

A JIT for packet filters

Posted Apr 16, 2011 2:55 UTC (Sat) by wahern (subscriber, #37304) [Link] (1 responses)

The problem with the benchmark they did was that it's not a fair appraisal of interpretation versus compiling. The BPF switch interpreter isn't threaded. That is, at the end of every instruction it jumps back to the while loop, which does a conditional branch. Then there's the switch, which may or not may not do one or more conditional branches.

For fair comparison with a JIT compiler, the interpreter would instead jump directly from one instruction to the next using jump tables--indexing into a table of labels constructed using GCC's label address-of operator, &&.

On my own VM I can dramatically improve performance on many programs merely by threading the interpreter. If doing this gives the same performance, which it very well could given that BPF might be data bound and the ops are so simple, then it would be far preferable rather than adding hundreds of lines of new code for each architecture (or, conversely, having some architectures needlessly disadvantaged).

A JIT for packet filters

Posted Apr 18, 2011 18:33 UTC (Mon) by Nelson (subscriber, #21712) [Link]

That's a fair criticism, you can make the BPF VM more efficient, it's still a comparison of whats there to a JIT though. Even with those improvements, you can get a fairly consistent boost with a JIT, just from turning the loads in to literals. It might not be worth the complexity but if there was a more generic JIT framework such that the platform support was there it is an interesting optimization if you rely upon BPF stuff a lot.

A JIT for packet filters

Posted Apr 14, 2011 10:24 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

The obvious solution, of course, is to import LLVM into the kernel. Right?

A JIT for packet filters

Posted Apr 14, 2011 17:48 UTC (Thu) by fuhchee (guest, #40059) [Link]

One hopes that security/assurance concerns with a little JIT would be more manageable than with LLVM. Though you never know whether someone else's enthusiasm for importing <whatever> into the kernel might overrule such concerns.

A JIT for packet filters

Posted Apr 14, 2011 15:13 UTC (Thu) by trasz (guest, #45786) [Link] (3 responses)

Might be worth mentioning that FreeBSD has been using this approach for years.

A JIT for packet filters

Posted Apr 14, 2011 16:28 UTC (Thu) by wahern (subscriber, #37304) [Link] (2 responses)

Links? I can't find any JIT compiler in the FreeBSD kernel, at least skimming code in sys/net/bpf.c or bpf_filter.c.

I'd be interested in this because I'm looking for a tiny JIT compiler. MyJIT is the best I can find so far, but it can't recover from OOM errors and it's quite large, which means I'm too lazy to fix it.

A JIT for packet filters

Posted Apr 14, 2011 16:37 UTC (Thu) by trasz (guest, #45786) [Link]

See http://fxr.watson.org/fxr/source/net/bpf_jitter.c and http://fxr.watson.org/fxr/source/amd64/amd64/bpf_jit_mach....

It was introduced six years ago, with this commit:

r153151 | jkim | 2005-12-06 03:58:12 +0100 (Tue, 06 Dec 2005) | 17 lines

Add experimental BPF Just-In-Time compiler for amd64 and i386.

Use the following kernel configuration option to enable:

options BPF_JITTER

If you want to use bpf_filter() instead (e. g., debugging), do:

sysctl net.bpf.jitter.enable=0

to turn it off.

Currently BIOCSETWF and bpf_mtap2() are unsupported, and bpf_mtap() is
partially supported because 1) no need, 2) avoid expensive m_copydata(9).

Obtained from: WinPcap 3.1 (for i386)

A JIT for packet filters

Posted Apr 14, 2011 16:39 UTC (Thu) by wahern (subscriber, #37304) [Link]

Nevermind

A JIT for packet filters

Posted Apr 15, 2011 9:56 UTC (Fri) by rilder (guest, #59804) [Link] (4 responses)

Since generation of the JIT code is a one-time task, can't a usermode helper in form of something like LLVM be used by the kernel then or atleast be made pluggable ? It will also support multiple architectures and may generate more optimized code

A JIT for packet filters

Posted Apr 17, 2011 1:40 UTC (Sun) by jzbiciak (guest, #5246) [Link] (3 responses)

I can understand some folks being, erm, jittery about being able to load arbitrary code into the kernel. Then again, we have loadable kernel modules, so why not?

I agree that doing this in userspace seems to make much more sense than doing it in the kernel if optimized performance is your main careabout, since you can bring more resources to bear on the problem without bloating the kernel. It then comes down to managing the potential security issues, and trusting the correctness of the translator since you lose any sandboxing the interpreter might have offered.

(Yes, the translator can insert the required bounds checks, but nothing requires it to if you're loading raw machine code into the kernel.)

A JIT for packet filters

Posted Apr 17, 2011 21:08 UTC (Sun) by rilder (guest, #59804) [Link]

Good points.
My thought process for this was influenced by:
1. Text processing algorithms which use request_module() to load at runtime for algorithms which are not in kernel. Again, if proper case is exercised here -- not loading outside modprobe path etc. it should be fine.
2. Coming back to usermode helpers, we already allow modules to be modprbed through external helpers, so a similar approach can be used. If someone can write to a sysctl/procfs maliciously, then system is already compromised. I was thinking of reading from a pipe using a usermode helper similar to how core dumping function uses it to write instead.

A JIT for packet filters

Posted Apr 26, 2011 2:16 UTC (Tue) by welinder (guest, #4699) [Link]

People at CMU played with that years and years ago. The programs
from user space would only be accepted if they came with a proof
of correctness (which is easy to verify). The buzz words were
"proof-carrying code", I think.

A JIT for packet filters

Posted May 21, 2011 11:56 UTC (Sat) by snemarch (guest, #75085) [Link]

Have the usermode LLVM-compiler generate pf bytecode, and have a small JIT'er in the kernel? Sounds like a reasonably safe scheme to me.

why does this need a JIT compiler?

Posted Apr 28, 2011 6:07 UTC (Thu) by dlang (guest, #313) [Link] (3 responses)

my understanding of the definition of a JIT compiler is that the system starts to run the interpreted code and compiles it as it goes, then uses the compiled version in the future.

filters change very infrequently, so why do you need a JIT compiler instead of a normal compiler?

am I missing something on the definition here? or are they misusing the term JIT? or are they using a JIT setup when they could just as easily use a normal compiler?

why does this need a JIT compiler?

Posted May 21, 2011 12:00 UTC (Sat) by snemarch (guest, #75085) [Link] (2 responses)

JIT isn't a super precisely defined term.

In this context, the "just in time" means the code is not compiled into the kernel (or as a LKM), but generated from user data. You don't need the Java/.NET style "interpret until determined hotspot, then generate native" behavior in order to call something JIT :-)

why does this need a JIT compiler?

Posted Aug 17, 2011 2:42 UTC (Wed) by tao686 (guest, #77944) [Link] (1 responses)

I don't know if this JIT can avoid GPL licence? Since the filter sits on the userspace application, but finnally it is compiled and linked into kernel to execute. any idea on this? or it is a grey area, no one specify this?

From my point of view, this will not volidate the GPL licence. Linus can give some hint on this?

Also there is a trend, more and more code can be developed on userspace, such as UIO framework, it makes driver can be implemented in the userspace, this will avoid GPL licence issue. Then later on, more and more persons develop non-GPL licenced code, right?

why does this need a JIT compiler?

Posted Sep 29, 2011 13:54 UTC (Thu) by mcortese (guest, #52099) [Link]

I can hardly imagine why one should want to distribute his filters, be them interpreted, compiled, or JIT...

Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds