+FreeBSD Journal - 2014-01-01
+FreeBSD Journal - 2014-01-01
+FreeBSD Journal - 2014-01-01
BeagleBone Black
Getting Started
svn update
Source Tree Changes
CLANG in 10
A New Compiler & Library
FreeBSD10
MOVING FORWARD
Table of Contents • Jan/Feb 2014
FreeBSD Journal • Vol. 1, Issue No. 1
4
CLANG
D E PA RT M E N T S
& COLUMNS
FEATURE
ARTICLES
in 10
FreeBSD 10 includes out-
of-the-box support for the
1 Foundation Letter Implementing majority of the C11 and
Welcome to FreeBSD Journal.
By FreeBSD Journal Board System Control C++11 standards.
39 Events Calendar
[20] Managed Services This white paper discusses the
challenges associated with running an ISP, specifically managed services, and presents
By Dru Lavigne
some of the unique solutions that FreeBSD provides. By Joseph Kong
40 svn update
Here are the latest changes in
the FreeBSD source tree for all
BeagleBone Black
supported release and devel-
Getting Started with FreeBSD
opment branches, including
new features added to the [26] Popular new ARM systems such as the
BeagleBone and Raspberry Pi have gener-
FreeBSD operating system, ated a lot of developer interest in FreeBSD/ARM.
bug fixes and enhancements, By Tim Kientzle
and driver updates for newly-
supported hardware devices.
By Glen Barber The Z File System
42 This Month in [32] The Future of Storage ZFS is more
than just a file system, as it combines the roles of RAID controller, Volume
FreeBSD By Dru Lavigne Manager, and File System. By Allan Jude
Jan/Feb 2014 1
® LETTER
from the Board
JOURNAL
E d i t o r i a l B o a rd
John Baldwin • Member of the
FreeBSD Core Team
Jan/Feb 2014 3
B Y DAV I D C H I S N A L L
C l a n g 10 in
Some History
In 2007, GCC went to GPLv3. This license had
one or two clauses that some major down-
stream consumers found unacceptable and so
the decision was made not to import any
GPLv3 code into the base system. The version
of GCC stayed at 4.2.1.
In 2008, some developers at Apple released
clang, a C front end for the LLVM compiler
infrastructure. LLVM was originally offered to
the Free Software Foundation as a new back
end for GCC, but was turned down. Apple
started using the LLVM back end (and hired
the original developer of LLVM and a lot of
other people) and continued to improve it.
LLVM is far more than just a C compiler. It
provides a uniform intermediate representa-
tion that language front ends can generate,
optimisers can modify, and back ends can con-
vert into native code. One of its earliest was in
Apple’s OpenGL shader stack, where a naive
LLVM-based JIT compiler outperformed the
handwritten one by around 20% and worked
4 FreeBSD Journal
on all of Apple’s supported architectures. things, like the separator character for floating
In the last five years, the combination of clang point values. The setlocale() function sets
and LLVM has become a mature product. It’s the locale globally, which means that it’s not safe
now the only compiler supported by Apple and for multiple locales in a multithreaded program.
is one of the standard compilers shipped with The POSIX2008 locale extensions provide a
the Android NDK. Companies like ARM, per-thread locale, but more importantly they pro-
Qualcomm, Apple, Google, AMD, Intel, and vide a set of variants of the standard C functions
many others are contributing large amounts of that take an explicit locale as a parameter. Lots
code to it. of things use these APIs, including new versions
Meanwhile, our old GCC has begun to look of GNOME, but the primary consumer that we’re
quite dated. At the end of 2011, the C and C++ interested in here is libc++.
standards committees released specifications for Libc++ is another component of the stack to
new dialects of their respective languages. The originate at Apple. Mostly developed by Howard
extensions to C were relatively simple. The Hinnant, it is a completely new implementation
changes to C++ were huge. Both required of the C++ standard library, designed from
changes to both the compiler and the standard scratch for C++11. Implementing C++11 support
library. in the standard library required quite invasive
changes (which broke backwards compatibility)
The C Standard Library and so seemed like a good place to start from
The C standard library is a core feature of scratch. This also allowed all of the standard
FreeBSD. Various people have worked on data structures to be redesigned in a way that
improving this to implement the new C11 fea- makes more sense for modern hardware, for
example focusing more on cache usage in
std::string.
_Atomic(int) x;
Jan/Feb 2014 5
Clang in 10
At the opposite extreme, atomic operations experience. A lot of the initial experiences were
with relaxed ordering require that the operation simply from getting FreeBSD to compile without
be atomic with respect to that single variable, warnings. Clang gives a lot more warnings than
but not with respect to others. If x and y were our old gcc (gcc 4.8 is now at a similar quality,
relaxed, then it would be acceptable for some although still has a slight tendency toward false
threads to see the new x and the old y, and positives) and we try to ensure that all of our
some to see the new y and old x. code builds without warnings. Somewhat amus-
The stdatomic.h header contains a lot of func- ingly, the worst offender in our tree for having
tions that operate on atomic variables, and our compiler warnings was gcc itself.
implementation contains several code paths to In the ports tree, things were somewhat dif-
make it work with old compilers. This is a pat- ferent. We do maintain local patches for a lot of
tern that we’ve replicated elsewhere and things programs, but ideally these should be small (and
like the _Thread_local storage qualifier and should be pushed upstream where possible).
similar are implemented in our standard headers Most C code works fine with clang. The
using extensions when using a compiler that biggest issue that we faced in the ports tree was
does not support them natively. that clang defaults to C99 as the dialect when
One other addition in C11 has made it possi- invoked as clang or cc, whereas gcc defaults
ble to clean up some of our headers. The stan- to C89. It’s somewhat depressing that people
dard adds _Generic() expressions, which are still invoke a C compiler as cc in 2013, because
similar to switch statements selecting based on cc was deprecated in the 1997 release of POSIX
the type, rather than the value, of an expression. and defined as accepting an unspecified version
This is only useful in macros, but it’s useful in of the C language. Back then, the choices were
several standard macros that must be defined in K&R C or C89. If you wanted C89, you were
C header files. In particular, there are several recommended to invoke the c89 utility. The next
related to numerics that are defined for all float- release of POSIX added a c99 utility. Presumably
ing-point types. the next one to be published will also specify a
Two examples in this category are isinf() c11 utility.
and isnan(), which return true if the argu- The C99 specification was carefully designed
ment is infinite or a not-a-number value, respec- so that valid C89 programs were also valid C99
tively. Our old code was determining the correct programs, so this shouldn’t have been a prob-
path to call depending on the size of the argu- lem. Unfortunately, this didn’t quite work
ment. This meant that if you passed a 32-bit because few people wrote C89 code, instead
int value, it would call the function that they wrote C89 with GNU extensions. I said
expected a 32-bit float. This would always GCC defaulted to C89 mode, but that’s not
return false (because no float that is created quite correct: it defaulted to C89 with GNU
by casting an int can possibly be infinite or not- extensions (gnu89, as it’s known on the com-
a-number), but almost certainly hid a logic bug mand line).
because there’s no reason why you’d ever want There is only one significant incompatibility
to check these properties on an int. between C89 with GNU extensions and C99, and
We now use _Generic() for these and so that’s the handling of the inline keyword. The
they will always go to the correct function and differences meant that code that expected the
you get an error if you try to call it with the GNU inline rules would end up with functions
wrong argument. We found a few bug ports as defined multiple times in C99 mode and so
a result, and some quite bizarre behavior. For would fail to link. This is relatively easy to fix—
example, both Google’s v8 and Mono had con- just add -fgnu89-inline to the compiler
figure script checks that tested whether flags—but it needed to be done for every port
isnan(1) worked. In the Mono case, if they that had this kind of error. When you have over
detected that isnan(int) didn’t work, then 20,000 ports, even simple fixes are a lot of work.
they declared their own isnan(double) to In C++, the problems were more pronounced.
use, which then conflicted with the system one. The rules for symbol resolution in C++ are
incredibly complicated. This is especially true
Challenges with Clang inside templates, where the standard calls for a
Getting the system ready for clang, initially as two-stage lookup. Both GCC and the Microsoft
the system compiler and then as the only com- C++ compiler managed to get this wrong. Of
piler in the base system, has been an interesting course, they did it in different wrong ways,
6 FreeBSD Journal
which was why it has traditionally been very diffi- library that has interfaces that use STL types (for
cult to move C++ code between compilers. example, std::string) then both the library
Clang benefited from all of this experience and the things that call it must use the same STL
and wrote the C++ parser to the letter of the implementation.
specification. This means that any standards- This causes some problems in the ports collec-
compliant C++98 code will compile with clang. tion, because a few libraries won’t compile with
These days, so does C++11 code, and some clang and so can’t use libc++, whereas others
C++1y code (C++1y is the draft that will most require C++11 and so won’t compile with our
likely become C++14). Unfortunately, when you base GCC. Using a GCC from ports doesn’t really
refuse to compile a popular open source pro- address this either, as many of the old C++ pro-
gram, you don’t get much sympathy when you grams also won’t compile with a new GCC, and
turn around and say—“well, the code is invalid.” the new libstdc++ is not binary-compatible with
the one that we have in the base system either.
The Challenge of Migration
For C code, there is no difficulty migrating. The C Debugging
ABI is defined by the target platform and both One other unfortunate problem with the clang
Clang and GCC generate entirely compatible switch is that clang now emits debug informa-
code. This is also true for C++ code, if you’re tion conforming to version 4 of the DWARF stan-
only talking about C++—the language. dard. Soon, it will default to DWARF 5, which
Unfortunately, there is more to either language— includes support for much smaller debug info
there is also the standard library. In the case of C, tables and for separating out the debug info into
this is FreeBSD libc. Again, this is shared between separate files during compilation so that they can
compilers. be linked separately.
In the case of C++, it’s actually two libraries. Unfortunately, the old version of the GNU
The smaller of the two implements the dynamic debugger (GDB) that we include in the base sys-
parts of the language such as exceptions and the tem can only support DWARF 2. For 10, we’ve
dynamic_cast<> operator. The larger imple- imported the LLVM Debugger (LLDB) and Ed
ments the standard template library (STL). In our Maste has been working (with FreeBSD
old stack, these were implemented by libsupc++ Foundation funding) on the FreeBSD port.
and libstdc++, respectively. Originally, these two LLDB, like the rest of LLVM, is very modular. It is
were statically linked together. intended as a set of libraries that allow debugging
In the new stack, these are libcxxrt and libc++. features to be added to various applications, rang-
As part of the migration path, we wanted to ing from command-line tools to IDEs. It is largely
make it possible for programs to link against developed by Apple and so remote debugging was
libraries that used both libc++ and libstdc++. This a core part of the design, allowing ARM devices to
required modifying our libstdc++ to link against be debugged from x86 desktops.
libsup++ as a dynamic (shared) library, which All of these are nice features, but unfortunately
then allowed libmap.conf to switch between LLDB isn’t quite ready for enabling by default in the
them. 10 release. It’s in the tree, so feel free to upgrade
Unfortunately, life is never that simple. ELF your sources and try building the latest version. We
symbol versioning associates the symbol with expect to enable it by default in 10.1.
both the version and the library that it came
from and so existing binaries would fail to link on
symbols suddenly moved from
libstdc++.so to libsupc++.so. The solu-
tion to this was to make libstdc++ a filter
library. This allows it to, effectively, forward the
symbol resolution on to libraries it linked against.
With this done, it became possible to link
against both. The STL symbols have different
names and so you will use the ones from
whichever headers you included in the source
code. Unfortunately, because they have different
symbol names (and different binary layouts), you
can’t use them interchangeably. If you have a
Jan/Feb 2014 7
Clang in 10
Architectural Problems Going Forward
These days, FreeBSD has one tier 1 architecture: The goal in all of this was to make FreeBSD a
x86 (in 32-bit and 64-bit variants). ARMv6 and modern development platform. We’ve achieved
newer are very close to being tier 1 as well. that. FreeBSD 10 shipped with the most complete
These two are well supported by Clang and by C11 and C++11 (and C++1y) implementations of
the LLVM back end. Unfortunately, we also have any system to date. We now have a modern com-
a lot of tier 2 architectures, such as SPARC, piler and C++ stack, with an active upstream com-
PowerPC, and MIPS, with less good support. munity that is engaged with FreeBSD as a con-
LLVM has quite good support for 64-bit sumer, and a number of people (myself included)
PowerPC, developed largely by Argone National who contribute to both projects.
Laboratory, but not nearly as good support for We still have a few missing pieces for a com-
32-bit PowerPC. Since Apple switched to Intel, pletely BSD-licensed toolchain, however. We cur-
these architectures have been dead in the con- rently ship a lot of GNU binutils. Some things,
sumer PC market, but they’re still popular in a such as the GNU assembler, are easy to replace.
lot of places where people ship embedded The LLVM libraries contain all of the required
FreeBSD-derived systems, especially in the auto- functionality; they just require small tools to be
motive industry. written to implement them.
The MIPS back end is now able to compile The one exception is the linker. Like compil-
LLVM itself, which is quite an achievement given ers, linkers are quite complex pieces of software.
the size and complexity of the code base, but it’s We’re currently evaluating two linkers to replace
still a little way away from being able to compile GNU ld. The first, MCLinker, was originally
the FreeBSD kernel. developed by MediaTek using LLVM libraries and
SPARC and IA64 are two with less certain now has a larger community. It currently ships,
futures. The SPARC back end in LLVM is one of was one of the linkers in the Android SDK, and
the oldest ones, yet it is still not production- can link all of the base system, but lacks support
ready. The Itanium back end was removed, after for symbol versioning (this may have been fin-
being unmaintained for a while. Intel doesn’t ished by the time you’re reading this, as work is
seem to be pushing Itanium very hard, and ongoing to implement it).
Oracle seems to regard SPARC as a platform for The other option is lld, the LLVM linker. This is
running Solaris-based Oracle appliances, so the a more complex design and is not yet as
future of these architectures is not that certain advanced, but does have some large corporate
anyway, but it would be a shame to drop sup- backers such as Sony (Sony is a FreeBSD con-
port for them in FreeBSD while there are still sys- sumer), and so might be a better long-term
tems using them. prospect.
Whichever we select, FreeBSD
will continue to pick the best
tools for the job. We hope to
have a fully BSD licensed tool-
chain by default for 11.0, and as
optional components in the 10.x
series. Being BSD licensed is
always nice, but we won’t switch
until the tools are also better. •
8 FreeBSD Journal
BY JOHN BALDWIN
IMPLEMENTING SYSTEM
CONTROL
NODES
(sysctl)
•
10 FreeBSD Journal
•
query the current state of the node, set the
struct kinfo_proc kp;
state of the node, or perform both operations.
int i, mib[4];
The sysctlnametomib(3) function maps a
size_t len;
node’s full name to its internal address. This
operation uses an internal sysctl node and is a
/* Fetch the address of the "kern.proc.pid"
bit expensive, so a program that queries a
prefix. */
control node frequently can use this routine
len = 3;
to cache the address of a node. It can then
sysctlnametomib("kern.proc.pid", mib, &len);
query the node using sysctl(3) rather than
sysctlbyname(3).
/* Fetch the process information for the
Some control nodes have a named prefix with
current process. */
unnamed leaves. An example of this is the
len = sizeof(kp);
“kern.proc.pid” node. It contains a child node
mib[3] = getpid();
for each process. The internal address of a given
sysctl(mib, 4, &kp, &len, NULL, 0);
process’s node consists of the address of
“kern.proc.pid” and a fourth number which cor- Example 1
responds to the pid of the process.
Example 1 demonstrates using this to fetch
information about the current process. SYSCTL_INT(_kern, OID_AUTO, one, CTLFLAG_RD,
NULL, 1, "Always returns one");
Simple Control Nodes
In the kernel, the <sys/sysctl.h> header provides int frob = 500;
several macros to declare control nodes. Each SYSCTL_INT(_kern, OID_AUTO, frob, CTLFLAG_RW,
declaration includes the name of the parent &frob, 0, "The \"frob\" variable");
node, a number to assign to this node, a name
for the node, flags to control the node’s behav- Example 2
ior, and a description of the node. Some decla-
rations require additional arguments. The parent
Example 2 defines two integer sysctl nodes:
node is identified by its full name, but with a
“kern.one” is a read-only node that always
•
single underscore as a prefix and dots replaced
returns the value one and “kern.frob” is a read-
by underscores. For example, the “foo.bar” par-
write node that reads and writes the value of
ent node would be identified by “_foo_bar”. To
the global “frob” integer.
declare a top-level node, use an empty parent
Additional macros are available for several
name. The number should use the macro
integer types including: SYSCTL_UINT() for
OID_AUTO to request that the system assign a
unsigned integers, SYSCTL_LONG() for signed
unique number. (Some nodes use hardcoded
long integers, SYSCTL_ULONG() for unsigned
numbers for legacy reasons, but all new nodes
long integers, SYSCTL_QUAD() for signed 64-
should use system-assigned numbers.) The flags
bit integers, SYSCTL_UQUAD() for 64-bit un-
argument must indicate which types of access
signed integers, and SYSCTL_COUNTER_U64()
the node supports (read, write, or both) and can
for 64-bit unsigned integers managed by the
also include other optional flags. The description
counter(9) API. Only the SYSCTL_INT() and
should be a short string describing the node. It
SYSCTL_UINT() macros may be used with a
is displayed instead of the value when the “-d”
NULL pointer. The other macros require a non-
flag is passed to sysctl(8).
NULL pointer and ignore the value parameter.
Integer Nodes Other Node Types
The simplest and most common control node is
The SYSCTL_STRING() macro is used to
a leaf node that controls a single integer. This
define a leaf node with a string value. This
type of node is defined by the SYSCTL_INT()
macro accepts two additional arguments: a
macro. It accepts two additional arguments; a
pointer and a length. The pointer should point
pointer and a value. If the pointer is non-NULL,
to the start of the string. If the length is zero,
it should point to an integer variable which will
the string is assumed to be a constant string and
be read and written by the control node (as
attempts to write to the string will fail (even if
specified in the flags argument). If the pointer is
the node allows write access). If the length is
NULL, then the node must be a read-only node
non-zero, it specifies the maximum length of the
that returns the value argument when read.
string buffer (including a terminating null char-
Jan/Feb 2014 11
IMPLEMENTING CONTROL NODES
The SYSCTL_NODE() macro is used to
static SYSCTL_NODE(, OID_AUTO, demo, 0, NULL, define a branch node. This macro accepts one
"Demonstration tree"); additional argument which is a pointer to a
function handler. For a branch node with explicit
static char name_buffer[64] = "initial name"; leaf nodes (declared by other SYSCTL_*()
SYSCTL_STRING(_demo, OID_AUTO, name, macros) the pointer should be NULL. The macro
CTLFLAG_RW, name_buffer, may be prefixed with static to declare a
sizeof(name_buffer), "Demo name"); branch node private to the current file. A public
node can be forward declared in a header for
static struct demo_stats { use by other files via the SYSCTL_DECL()
int demo_reads; macro. This macro accepts a single argument
int demo_writes; which is the full name of the node specified in
} stats; the format used for a parent node in the other
SYSCTL_STRUCT(_demo, OID_AUTO, status, macro invocations. Example 3 defines a top level
CTLFLAG_RW, &stats, demo_stats, node with three leaf nodes describing a string
"Demo statistics"); buffer, a structure, and an opaque data buffer.
SYSCTL_OPAQUE(_demo, OID_AUTO, mi_switch, Node Flags
CTLFLAG_RD, &mi_switch, 64, "Code",
"First 64 bytes of mi_switch()"); Each node definition requires a flags argument.
All leaf nodes and branch nodes with a non-
NULL function handler must specify the permit-
Example 3 ted access (read and/or write) in the flags field.
The flags field can also include zero or more of
the flags listed in Table 1.
acter) and attempts to write a string longer than
• the buffer’s size will fail.
The SYSCTL_STRUCT() macro is used to
Complex Control Nodes
define a leaf node whose value is a single C System control nodes are not limited to simply
struct. This macro accepts one additional pointer reading and writing existing variables. Each leaf
argument which should point to the structure to node includes a pointer to a handler function
be controlled. The size of the structure is that is invoked when the node is accessed. This
inferred from the type. function is responsible for returning the “old”
The SYSCTL_OPAQUE() macro is used to value of a node as well as accepting “new” val-
define a leaf node whose value is a data buffer ues assigned to a node. The standard node
of unspecified type. The macro accepts three macros such as SYSCTL_INT() use predefined
additional arguments: a pointer to the start of handlers in sys/kern/kern_sysctl.c.
the data buffer, the length of the data buffer, A leaf node with a custom handler function is
and a string describing the format of the data defined via the SYSCTL_PROC() macro. In
buffer. addition to the standard arguments accepted by
the other macros, SYSCTL_PROC()accepts a
FLAG ............................PURPOSE
Table 1 CTLFLAG_ANYBODY .........All users can write to this node. Normally only the superuser can
............................................write to a node.
CTLFLAG_SECURE ..............Can only be written if securelevel is less than or equal to zero.
CTLFLAG_PRISON ..............Can be written to by a superuser inside of a prison created by jail(2).
CTLFLAG_SKIP....................Hides this node from iterative walks of the tree such as when
............................................sysctl(8) lists nodes.
CTLFLAG_MPSAFE .............Handler routine does not require Giant. All of the simple node types
............................................set this flag already. It is only required explicitly for nodes that use a
............................................custom handler.
CTLFLAG_VNET ..................Can be written to by a superuser inside of a prison if that prison
............................................contains its own virtual network stack.
12 FreeBSD Journal
FLAG .........................................MEANING
CTLTYPE_NODE ........................This node is a branch node and does not have an associated value. •
CTLTYPE_INT.............................This node describes one or more signed integers.
CTLTYPE_UINT ..........................This node describes one or more unsigned integers.
CTLTYPE_LONG.........................This node describes one or more signed long integers.
CTLTYPE_ULONG ......................This node describes one or more unsigned long integers.
CTLTYPE_S64 ............................This node describes one or more signed 64-bit integers.
Table 2
CTLTYPE_U64............................This node describes one or more unsigned 64-bit integers.
CTLTYPE_STRING ......................This node describes one or more strings.
CTLTYPE_OPAQUE....................This node describes an arbitrary data buffer.
CTLTYPE_STRUCT
Table 1 .....................This node describes one or more data structures.
pointer argument named “arg1”, an integer format string. The available format strings are
argument named “arg2”, a pointer to the han- listed in Table 3.
dler function, and a string describing the format
of the node’s value. The flags argument is also Handler Functions
required to specify the type of the node’s value. A system control node handler can be used to
provide additional behavior beyond reading and
Node Types writing an existing variable. Handlers can be used
The type of a node’s value is specified in a field to provide input validation such as range checks
in the node’s flags. The standard node macros all on new node values. Handlers can also generate
work with a specific type and adjust the flags temporary data structures to return to userland.
argument to include the appropriate type. The This is commonly done for handlers which return
SYSCTL_PROC() macro does not imply a specif- a snapshot of system state such as a list of open
ic type, so the type must be specified explicitly. network connections or the process table.
Note that all nodes are allowed to return or Handler functions accept four arguments and
accept an array of values and the type simply return an integer error code. The <sys/
specifies the type of one array member. The stan- sysctl.h> header provides a macro to define
dard node macros all return or accept a single the function arguments:
value rather than an array. The available types are SYSCTL_HANDLER_ARGS. It defines four argu-
listed in Table 2. ments: “oidp”, “arg1”, “arg2”, and “req”. The
Note that since
SYSCTL_PROC() only defines
leaf nodes, CTLTYPE_NODE FORMAT .......MEANING
should not be used. Branch
nodes with custom handlers “A” ...............An ASCII string. Used with CTLTYPE_STRING. Table 3
are described below. “I” .................A signed integer. Used with CTLTYPE_INT.
“IU” ..............An unsigned integer. Used with CTLTYPE_UINT.
Node Format Strings
“IK”...............An integer whose value is in units of one-tenth of a
Each node has a format string
.....................degree Kelvin. The sysctl(8) utility will convert the value
in addition to a type. The
.....................to Celsius before displaying. Used with CTLTYPE_UINT.
sysctl(8) utility uses this string
to format the node’s value. As “L” ................A signed long integer. Used with CTLTYPE_LONG.
with the node type, most of “LU”..............An unsigned long integer. Used with CTLTYPE_ULONG.
the standard macros specify “Q” ...............A signed 64-bit integer. Used with CTLTYPE_S64.
the format implicitly. The
“QU”.............An unsigned 64-bit integer. Used with CTLTYPE_U64.
SYSCTL_OPAQUE and
SYSCTL_PROC macros require “S,<foo>”.....A C structure of type struct foo. Used with
the format to be specified .....................CTLTYPE_STRUCT. The sysctl(8) utility understands a few
explicitly. Most format strings .....................structure types such as struct timeval and
are tied to a specific type and .....................struct loadavg.
most types only have a single
Jan/Feb 2014 13
IMPLEMENTING CONTROL NODES
“oidp” argument points to the struct SYSCTL_OUT() are provided for this purpose.
sysctl_oid structure that describes the Both macros accept three arguments: a point-
node whose handler is being invoked. The er to the current request (“req”) from
“arg1” and “arg2” arguments hold the values SYSCTL_HANDLER_ARGS, a pointer to a
• assigned to the “arg1” and “arg2” arguments buffer in the kernel’s address space, and a
to the SYSCTL_PROC() invocation that length. The SYSCTL_IN() macro copies data
defined this node. The “req” argument points from the caller’s “new” buffer into the kernel
to a struct sysctl_req structure that buffer. The SYSCTL_OUT() macro copies
describes the specific request being made. The data from the kernel buffer into the caller’s
return value should be zero on success or an “old” buffer. These macros return zero if the
error number from <sys/errno.h> on fail- copy is successful and an error number if it
ure. If EAGAIN is returned, then the request fails. In particular, if the caller buffer is too
will be retried within the kernel without small, the macros will fail and return ENOMEM.
returning to userland or checking for signals. These macros can be invoked multiple times.
Example 4 defines two integer nodes with a Each invocation advances an internal offset
custom handler that rejects attempts to set an into the caller’s buffer. Multiple invocations of
invalid value. It uses the predefined handler SYSCTL_OUT() append the kernel buffers
function sysctl_handle_int() that is passed to the macro to the caller’s “old”
used to implement SYSCTL_INT() to update buffer, and multiple invocations of
a local variable. If the request attempts to set SYSCTL_IN() will read sequential blocks of
a new value, it validates the new value and data from the caller’s “new” buffer.
only updates the associated variable if the new One of the values returned to userland after
value is accepted. a sysctl(3) invocation is the amount of data
This example uses a predefined handler returned in the “old” buffer. The count is
(sysctl_handle_int()) to publish the old advanced by the full length passed to
value and accept a new value. Some custom SYSCTL_OUT() even if the copy fails with an
handlers need to manage these steps directly. error. This can be used to allow userland to
The macros SYSCTL_IN() and query the necessary length of an “old” buffer
/*
* 'arg1' points to the variable being exported, and 'arg2' specifies a
* maximum value. This assumes that negative values are not permitted.
Example 4 */
static int
sysctl_handle_int_range(SYSCTL_HANDLER_ARGS)
{
int error, value;
14 FreeBSD Journal
for a node that returns a variable-sized buffer. the SCTL_MASK32 flag in req->flags. For
If it is expensive to generate the data copied to example, a node that returns a long value
the “out” buffer and a handler is able to esti- should return a 32-bit integer in this case. A
mate the amount of space needed, then the node that returns an array of structures corre-
handler can treat this case specially. A caller sponding to an internal list of objects may
queries length by using a NULL pointer for the need to return an array of structures with an
“old” buffer. The handler can detect this case alternate 32-bit layout.
by comparing req->oldptr against NULL. If a node allows the caller to alter its state
The handler can then make a single call to via a “new” value, the handler should com-
SYSCTL_OUT() passing NULL as the kernel pare req->newptr against NULL to deter-
buffer and the total estimated length as the mine if a “new” value is supplied. A handler
length. If the size of the data changes fre- should only invoke SYSCTL_IN() and
quently, then the handler should overestimate attempt to set a new value if req->newptr
the size of the buffer so that the caller is less is non-NULL.
likely to get an ENOMEM error on the subse- An example of a custom node handler that
quent call to query the node’s state. uses many of these features is the implemen-
The SYSCTL_OUT() and SYSCTL_IN() tation of the “kern.proc.proc” node. The in-
macros can access memory in a user process. kernel implementation is more complex, but a
These accesses can trigger page faults if a user simplified version is provided in Example 5.
page is not currently mapped. For this reason,
non-sleepable locks such as mutexes and read- Complex Branch Nodes
er/writer locks cannot be held when invoking A branch node declared via SYSCTL_NODE()
these macros. Some control nodes return an can specify a custom handler. If a handler is
array of state objects that correspond to a list specified, then it is always invoked when any
of objects inside the kernel where the list is node whose address begins with the address
protected by a non-sleepable lock. One option of the branch node is accessed. The handler
such handlers can use is to allocate a tempo- functions similarly to the custom handlers
rary buffer in the kernel that is large enough described above. Unlike SYSCTL_PROC(), the
to hold all of the output. The handler can pop- “arg1” and “arg2” parameters are not config-
ulate the kernel buffer while it walks the list urable. Instead, “arg1” points to an integer
under the lock and then pass the populated array containing the address of the node being
buffer to SYSCTL_OUT() at the end after accessed, and “arg2” contains the length of
releasing the lock. Another option is to drop the address. Note that the address specified by
the lock around each invocation of “arg1” and “arg2” is relative to the branch
SYSCTL_OUT() while walking the list. Some node whose handler is being invoked. For
handlers may not want to allocate a temporary example, if a branch node has the address 1.2
kernel buffer because it would be too large, and node 1.2.3.4 is accessed, the handler for
and they may wish to avoid dropping the lock the branch node will be invoked with “arg1”
because the resulting races are too painful to pointing to an array containing “3, 4” and
handle. The system provides a third option for “arg2” set to 2. A simplified version of the
these handlers: the “old” buffer of a request “kern.proc.pid” handler is given below as
can be wired by calling Example 6. Recall that this is the node invoked
sysctl_wire_old_buffer(). Wiring the by Example 1.
buffer guarantees that no accesses to the
buffer will fault allowing SYSCTL_OUT() to Dynamic Control Nodes
be used while holding a non-sleepable lock. The control nodes described previously are
Note that this option is only available for the static control nodes. Static nodes are defined
“old” buffer. There is no corresponding func- in a source file with a fixed name and are cre-
tion for the “new” buffer. The ated either when the kernel initializes the sys-
sysctl_wire_old_buffer() function tem control node subsystem or when a kernel
returns zero if it succeeds and an error number module is loaded. Static nodes in a kernel
if it fails. module are removed when the kernel module
If a sysctl node wishes to work properly in a is unloaded. The arguments passed to handlers
64-bit kernel when it is acccessed by a 32-bit for static nodes are also resolved at link time.
process, it can detect this case by checking for This means that static nodes generally operate
Jan/Feb 2014 15
static int
sysctl_kern_proc_proc(SYSCTL_HANDLER_ARGS)
{
#ifdef COMPAT_FREEBSD32
struct kinfo_proc32 kp32;
Example 5
#endif
struct kinfo_proc kp;
struct proc *p;
int error;
if (req->oldptr == NULL) {
#ifdef COMPAT_FREEBSD32
if (req->flags & SCTL_MASK32)
return (SYSCTL_OUT(req, NULL, (nprocs + 5) *
sizeof(struct kinfo_proc32)));
#endif
return (SYSCTL_OUT(req, NULL, (nprocs + 5) *
sizeof(struct kinfo_proc)));
}
16 FreeBSD
as when a driver attaches to a device). Only a
static int
reference to the context has to be maintained.
sysctl_kern_proc_pid(SYSCTL_HANDLER_ARGS)
A single call to sysctl_ctx_free() during
{
teardown (such as when a driver detaches from
struct kinfo_proc kp;
a device) is sufficient to remove the entire
struct proc *p;
group of control nodes.
int *mib;
Jan/Feb 2014 17
IMPLEMENTING CONTROL NODES
Example 7 static struct sysctl_ctx_list ctx;
static int
load(void)
{
static int value;
int error;
error = sysctl_ctx_init(&ctx);
if (error)
return (error);
if (SYSCTL_ADD_INT(&ctx, SYSCTL_STATIC_CHILDREN(_debug), OID_AUTO,
"dynamic", CTLFLAG_RW, &value, 0, "An integer") == NULL)
return (ENXIO);
return (0);
}
static int
unload(void)
{
return (sysctl_ctx_free(&ctx));
}
the parsed value. Note that overflows are null character. The value will be truncated if it
silently ignored. If the tunable is not found or is too long to fit into the buffer.
contains invalid characters, the integer variable The TUNABLE_*_FETCH() macros accept
is left unchanged. The macros provided for the same arguments as the corresponding
integers are: TUNABLE_INT() for signed TUNABLE_*() macro. They also have the
integers, TUNABLE_LONG() for signed long same semantics with one additional behavior.
integers, TUNABLE_ULONG() for unsigned These macros return an integer value of zero if
long integers, and TUNABLE_QUAD() for the tunable is found and successfully parsed,
signed 64-bit integers. and non-zero otherwise.
The string value of an integer tunable is System control nodes that have a corre-
parsed in the same manner as strtol(3) with a sponding tunable should use either the
base of zero. Specifically, a string that begins CTLFLAG_RDTUN or CTLFLAG_RWTUN flag to
with “0x” is interpreted as a hexadecimal specify the allowed access to the node. Note
value, a string that begins with “0” is inter- that this does not cause the system to implicit-
preted as an octal value, and all other strings ly fetch a tunable based on the node’s name.
are interpreted as a decimal value. In addition, The tunable must be fetched explicitly.
the string may contain an optional single char- However, it does provide a hint to the sysctl(8)
acter suffix that specifies a unit. The value is utility that is used in diagnostic messages.
scaled by the size of the unit. The unit is case- Example 8 demonstrates the use of a tun-
insensitive. Supported units are described in able in a device driver to fetch a default
Table 4. parameter. The parameter is available as a
String tunables are also supported by the read-only control node that can be queried by
TUNABLE_STR() macro. This macro accepts the user (this is helpful for the user when
three arguments: the name of the tunable, a determining the default value). It also includes
pointer to a character buffer, and the length a portion of the attach routine where the
of the character buffer. If the tunable does not global tunable is used to set the initial value of
exist in the kernel environment, the character a per-device control variable. A dynamic sysctl
buffer is left unchanged. If the tunable does is created for each device to allow the variable
exist, its value is copied into the buffer. The to be changed for each device independently.
string in the buffer is always terminated with a The sysctl is stored in the per-device sysctl tree
18 FreeBSD
static SYSCTL_NODE(_hw, OID_AUTO, foo, CTLFLAG_RD, NULL, Example 8
"foo(4) parameters");
static int
foo_attach(device_t dev)
{
struct foo_softc *sc;
char descr[64];
sc = device_get_softc(dev);
sc->widgets = foo_widgets;
snprintf(descr, sizeof(descr), "Number of widgets for %s",
device_get_nameunit(dev));
SYSCTL_ADD_INT(device_get_sysctl_ctx(dev),
SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO,
"widgets", CTLFLAG_RW, &sc->widgets, 0, descr);
...
}
created by the new-bus subsystem. It also uses The interface for tunables is defined in
the per-device sysctl context so that the sysctl <sys/kernel.h> and the implementation
is automatically destroyed when the device is can be found in sys/kern/kern_
detached. envronment.c. •
The interface for system control nodes is
defined in <sys/sysctl.h> and the imple- John Baldwin joined the FreeBSD Project as
mentation can be found in a committer in 1999. He has worked in sev-
sys/kern/kern_sysctl.c. It may be par- eral areas of the system, including SMP •
ticularly useful to examine the implementation infrastructure, the network stack, virtual
of the predefined handlers. First, they demon- memory, and device driver support. John
strate typical uses of SYSCTL_IN() and has served on the Core and Release •
SYSCTL_OUT(). Second, they can be used to Engineering teams and organizes an annu-
marshal data in custom handlers. al FreeBSD developer summit each spring.
SUBSCRIBE TO
JOURNAL
Jan/Feb 2014 19
•
FreeBSD and Commercial Workloads:
Managed BY
JOSEPH
Services KO N G
20 FreeBSD Journal
Figure 1: A basic PF, CARP, and pfsync setup.4
Jan/Feb 2014 21
Managed Services at NYI
advantage of FreeBSD’s GEOM mirroring to miti-
gate this risk.
GEOM mirroring is FreeBSD’s way of imple-
menting RAID 1, which creates reliable data stor-
age by generating an exact copy (or mirror) of a
data set on two or more disk drives. When a
• drive fails, the data remains available because it
can be provided by the other functioning drives,
allowing administrators to replace the failed
drive without interrupting their users.
One interesting feature of GEOM mirroring is
that it can also be used to quickly clone servers.
At NYI the process is as follows:
1. Remove a drive from the mirror.
2. Execute fsck(8) on the drive; this checks the
consistency and repairs any damaged file systems
Figure 3: Demonstrating how HAProxy works.6 on the drive.
3. Mount the drive in order to adjust any set-
tings; mounting a drive makes it accessible
In Figure 3, HAProxy accepts a request from through the operating system’s file system.
the external network (that is, the Internet) and 4. Adjust settings as needed.
forwards it to the least-used web server within 5. Unmount the drive.
the internal network. 6. Put the drive into a new server.
CARP, as mentioned previously, is what allows a Steps three through five can be omitted if no
backup system to assume the identity of a primary. settings need to be adjusted. In addition to
To ensure high availability for its three largest man- GEOM mirroring, NYI uses FreeNAS, ZFS, and
aged setups (Men’s Journal, Rolling Stone, and Us rsync to perform offsite backups in order to miti-
Magazine), NYI employs a pair of machines run- gate the risk of data loss.
ning HAProxy with CARP. If the primary load-bal- FreeNAS is based on an embedded version of
ancing machine fails, the backup will assume the FreeBSD and provides an open source network-
identity of the primary and take over. attached storage (NAS) solution. NAS systems
From this example, we can see that CARP can provide data storage to other devices on a net-
provide failover redundancy for systems beyond work and communicate in terms of files, rather
just firewalls. As another example, NYI employs than in disk blocks.
CARP with some of its managed Internet In Figure 4, the end users read and write files
Protocol Security (IPsec) virtual private networks to the NAS system over an Ethernet network.
(VPNs). NYI employs multiple FreeNAS machines with
over 20 TB (terabytes) of storage to house offsite
Disaster Recovery backups.
Backup and data recovery have long been stan- ZFS is the file system used by FreeNAS (and
dard data center disciplines and are equally optionally by FreeBSD). Its features include sup-
important for providers of managed services. port for high storage capacities, protection
Any data loss has the potential to significantly against data corruption, continuous integrity
impact the profitability of a company. NYI takes checking, automatic repair of data, software raid
22 FreeBSD Journal
(RAID-Z), instantaneous file system snapshots, host with its own set of files, processes, users,
and more. In short, ZFS is designed for data and root user. Unlike a chroot(2) environment,
integrity from top to bottom, which is desirable which restricts processes to a particular view of
when managing backups. the file system, jails restrict what a process can
rsync is the network protocol that NYI uses to do in relation to the rest of the system. Jailed
back up a machine’s entire file system offsite to a processes are sandboxed.8
FreeNAS machine. rsync minimizes data transfer Here is an example usage for jails: A small-
by using delta encoding, which transmits data in scale managed customer of NYI, Expand the
the form of differences rather than complete Room, requested a staging and production envi-
files. After the first full backup, rsync will only ronment for a website they were developing. The
transfer the differences between the local and solution was a FreeBSD machine with two jails
backed-up copy. that were identical in every way except for IP
address and hostname.
Compartmentalization NYI also uses jails internally for hardware-effi-
The desire to establish a clean and clear-cut sep- cient Domain Name System (DNS) servers. For
aration between services, for security purposes, example, a FreeBSD machine may contain a jail
has always been a challenge for system adminis- for recursive DNS, a second jail for authoritative
trators. Traditional Unix systems provide customer DNS, and a third jail for authoritative
chroot(2); however, chroot(2) has a number of NYI DNS. Each of these jailed DNS servers then
limitations (for example, it does not defend uses CARP within a cluster of machines to ensure
against intentional tampering by the root user). failover redundancy. Figure 5 demonstrates this.
FreeBSD has modified and improved on the tradi- In Figure 5, two servers (A and B) are used to
tional chroot(2) concept with jails. provide three distinct services. Each service is
FreeBSD jails compartmentalize the system. contained within its own jail (jail0, jail1, or jail2)
Each jail is a virtual environment running on the and uses CARP to ensure high availability.
Jan/Feb 2014 23
Managed Services at NYI
Customers which expected the challenge-response password
Managing customer expectations is always a prompt to be “Password: “ (note the space).
challenge. Customers expect and demand things However, the prompt in FreeBSD is “Password:”
to just work with (near) 100% uptime. This is (without a space after the colon) and this caused
exacerbated by the fact that managed services the SSH client to fail at authenticating. Since
providers cannot control the client software that FreeBSD is open source, NYI could easily patch
their customers employ, which makes compati- FreeBSD’s SSH server to include a space in the
password prompt, allowing their customer to
• bility an issue. FreeBSD’s open source nature
continue using the client of their choice.
helps address this challenge.
For example, one of NYI’s largest managed
customers used a particularly buggy SSH client, Conclusion
In 1996, when NYI was founded, FreeBSD was
the only viable open source Unix. Today, FreeBSD
continues to drive NYI for the reasons outlined
in this article and more. Boris Kochergin, NYI’s
Chief Rigor Officer, had these additional things
to say about why NYI uses FreeBSD:
“FreeBSD has excellent documentation. The
FreeBSD Handbook, which covers the day-to-day
use of FreeBSD, is clear, concise, and provides an
easy means for administrators to learn the sys-
tem. FreeBSD is open source and the code is
well organized, so it’s easy and possible to fully
understand it. Finally, FreeBSD continues the BSD
legacy of empowering the Internet!”9 •
Figure 5: Jailed services with CARP.
Joseph Kong is a self-taught computer enthusiast
who dabbles in the fields of exploit development,
About NYI reverse code engineering, rootkit development, and
Established in 1996, NYI is headquartered in systems programming (FreeBSD, Linux, and
the heart of Wall Street. Its core services Windows). He is the author of the critically acclaimed
include colocation, dedicated servers, web Designing BSD Rootkits and FreeBSD Device Drivers.
and email hosting, managed services, For more information about Joseph Kong visit
turnkey disaster recovery, and business con- www.thestackframe.org or follow him on Twitter
tinuity solutions. NYI owns and maintains its @JosephJKong.
own data centers and with its high-band-
width connectivity partners (Zayo, Verizon FOOTNOTES
Business, Optimum Lightpath, AT&T, Level 3, 1. “Managed services,” last modified August 01, 2013,
and GTT), NYI specializes in mission-critical http://en.wikipedia.org/wiki/Managed_services.
data services for the financial, architectural, 2. “Symantec Internet Security Threat Report: Trends for
fashion, law, life sciences, media, and real July–December ’07,” April, 2008, p. 29.
estate industries. NYI is SSAE 16 Type II- 3. “PF: Firewall Redundancy with CARP and pfsync,” last modi-
compliant as well as being PCI and HIPAA fied May 01, 2013, http://www.openbsd.org/faq/pf/carp.html.
compliant. For more information about NYI, 4. Figure 1 is adapted from “Firewall Failover with pfsync and
visit www.nyi.net. CARP,” accessed August 14, 2013,
The FreeBSD Foundation is a 501(c)(3) nonprofit http://www.countersiege.com/doc/pfsync-carp/.
organization dedicated to supporting the FreeBSD 5. Figure 2 is adapted from “Firewall Failover with pfsync and
Project. The Foundation gratefully accepts donations CARP,” accessed August 14, 2013,
http://www.countersiege.com/doc/pfsync-carp/.
from individuals and businesses, using them to fund
projects that further the development of the FreeBSD 6. Figure 3 is adapted from “HAProxy,” accessed August 14,
2013, http://haproxy.1wt.eu/.
operating system. In addition, the Foundation can
represent the FreeBSD Project in executing contracts, 7. Figure 4 is adapted from “Big data meets big storage,”
license agreements, and other legal arrangements accessed August 14, 2013, http://arstechnica.com/
business/2011/05/isilon-overview/2/.
that require a recognized legal entity. The FreeBSD
Foundation is entirely supported by donations. For 8. “FreeBSD jail,” last modified June 08, 2013,
http://en.wikipedia.org/wiki/FreeBSD_jail.
more information about the Foundation visit
www.freebsdfoundation.org. 9. The first widely-used TCP/IP implementation was from BSD.
24 FreeBSD Journal
®
®
®
FreeBSDJ O U R N A L
LEAKTHIS!
•BLOG IT
•TWEET IT
•READ IT
Bea g l e B o n e
BLA
CK
A
s you can see from the BeagleBoard.org web-
The BeagleBone Black (BBB) site, the potential of the BeagleBone Black
is a $45 PC that fits in an (BBB) is immense. The sidebar below shows a
few examples of possible applications: Best of
Altoids tin. It is built on the all, the BBB can run a standard FreeBSD system with all
same Texas Instruments the tools and support you’ve come to expect.
“Sitara” chip as the earlier
BeagleBone (which had a FreeBSD on BBB
Popular new ARM systems such as the BeagleBone and
white circuit board), but is Raspberry Pi have generated a lot of developer interest
now much cheaper and in FreeBSD/ARM. In the last year, most parts of
FreeBSD—boot loaders, kernel, toolchain, drivers, user-
significantly more capable. land, and ports—have seen significant improvement on
BY TIM KIENTZLE ARM platforms.
Right now, the development version of FreeBSD sup-
ports the BBB reasonably well:
• Serial console (requires an adapter cable similar to
Possible Applications Adafruit #954).
• Full FreeBSD boot loader, including boot-time module
• MICRO-SERVER: The 1GHz ARM loading and Forth scripting.
Cortex A8 processor, 512MB of RAM, • Device-tree based kernel.
10/100 Ethernet, and 2GB of flash is plenty • Micro-SD and eMMC support.
to run a personal or small-office web or • USB host support.
mail server. • Experimental USB client support (the BBB can act like
a USB device).
• EMBEDDED: Embedded: The BBB • 10/100 Ethernet.
includes extensive GPIO and hardware • FreeBSD/ARM now uses clang as the default compiler.
expansion support and requires only 2.5 • FreeBSD/ARM now uses the EABI calling convention,
watts for basic operation. which offers slightly better performance and better
(USB requires additional
compatibility with other compilers; if you have binaries
power.)
or libraries that were compiled before this change, you
• EDUCATION: will have to recompile them.
The low price and • FreeBSD can rebuild and upgrade natively on
expandability make BBB the BBB.
a good choice for learn- • A growing number of ports build and run on the BBB.
ing about software and
hardware development. Author’s Warning: I’m writing this in September 2013 based on the current sta-
tus of the FreeBSD development branch. Much of the following will have changed by
the time FreeBSD10 is finally released. Ask on the FreeBSD/ARM or FreeBSD current
mailing lists for more up-to-date information.
26 FreeBSD Journal
There are a few areas that still need improve- the output device (of), and the block size to use
ment, however. when copying (bs). You don’t have to specify a
• The FreeBSD package team has plans for a block size, but the default setting results in very
public ARM package repository, but it is not yet slow operation.
available. For example, if your micro-SD is connected as
• Video and audio drivers have yet to be written. ‘da7’, then the full command will look like this:
• Expansion capes are not supported. $ dd if=FreeBSD-BeagleBone.img
of=/dev/da7 bs=8m
Your First FreeBSD Boot Depending on how your system security is set
To boot FreeBSD, you first build a micro-SD card up, you will probably have to run this command
with FreeBSD installed, and then boot the BBB as root using ‘sudo’ or similar.
from the micro-SD card. Once the micro-SD card is imaged, you can
insert it into the BBB.
What you’ll need:
• BeagleBone Black. 3a. (Optional but Recomended)
• 5v power supply or Mini-USB cable. Attach Serial Cable
• Micro-SDHC card 4GB or larger. Once you have the SD card built, you’re ready to
• Serial cable such as Adafruit #954 or FTDI TTL- hook up the BeagleBone Black and boot
232R-3V3 (optional but highly recommended). FreeBSD. Since FreeBSD doesn’t yet support the
HDMI output on the BBB, you should consider
1. Build or Download a FreeBSD Image using a serial cable so you can see what’s going
Below, I’ll explain how you can build your own on. Without a serial cable, you can wait until it
FreeBSD image. To get started, you can download boots and try to connect over SSH, but it’s much
an image from the FreeBSD.org website: harder to diagnose if anything goes wrong.
ftp://ftp.freebsd.org/pub/FreeBSD/snapshots/ The BBB has a low-voltage serial interface that
http://ftp.freebsd.org/pub/FreeBSD/snapshots/ requires a special adapter cable. Make certain
Caveat: “Snapshot” images are built from you are using a 3.3v adapter, since similar cables
whatever source happened to be current in the come in 5v and 1.8v versions that will not work
FreeBSD development branch that day. Stable with BBB (Figure 1).
images will be released as soon as FreeBSD 10 3b. Open a Terminal Window
is finalized, which is expected to occur before the The serial adapter is powered by USB from the
end of 2013. Downloaded images are usually host system, so it starts working as soon as it is
compresssed; you’ll need to uncompress yours
before you can copy it onto a micro-SD card.
Jan/Feb 2014 27
BEAGLEBONE BLACK
Figure 2. The boot switch is just above the Figure 3. The reset switch is in the corner of the
micro-SD slot. board at the Ethernet adapter end.
plugged into the USB, even before the BBB has Hint: If you see the four LEDs start flashing
power. rapidly, you’ve booted the Linux image from
To use it from FreeBSD, use the ‘cu’ utility, eMMC. Remove power, hold the boot switch,
specifying the line speed of 115200 baud and and try again.
the appropriate “tty” device:
$ sudo cu -s 115200 -l /dev/ttyU0
Of course, you won’t see anything until you
What You Should See
actually apply power. When You Boot
If you’re familiar with how FreeBSD boots on
4. Hold the Boot Switch and Apply Power i386 or amd64 PCs, then the BBB boot process
At this point, you should NOT have any power will look very familiar, although there are a cou-
connected to your BBB. If you’ve already con- ple of differences. Most obviously, the initial boot
nected a 5v power supply or a mini-USB cable, stages are handled by “U-Boot”, a GPL boot
then unplug it and read the following carefully. loader project that supports a wide variety of
(The detailed logic for when the BBB boots hardware.
from eMMC or micro-SD is a little complicated.
I’ve been confused many times when the BBB 1. MLO/SPL: U-Boot First Stage
booted from the wrong source.) When the TI Sitara chip first initializes, it does
The “boot switch” determines whether the not have access to the main RAM. As a result,
BBB boots from eMMC (the default) or from the very first boot stage must fit into 128k of on-
micro-SD (Figure 2). chip memory.
To boot from micro-SD reliably, you must: U-Boot SPL 2013.04
* Hold down the boot switch (Aug 03 2013 - 21:27:30)
* Apply power OMAP SD/MMC: 0
* Count to 3 reading bb-uboot.img
* Release the boot switch reading bb-uboot.img
The BBB power chip remembers the boot U-Boot provides a small program called SPL
switch status, so it will continue to boot and which the TI Sitara chip loads from a file called
reboot from micro-SD until you disconnect the “MLO”. This program is just enough to initialize
power supply entirely. the DRAM chip and load the main U-Boot pro-
Hint: If you need to reboot, leave the power gram from the micro-SD card.
connected and tap the reset switch (Figure 3),
which will reboot from the same source. 2. U-Boot Main Loader
Hint: If you get random shutdowns and are U-Boot is a GPL-licensed boot loader that supports
powering with a mini-USB cable, try getting a a wide variety of hardware. Although originally
separate 5v power supply. The BBB power developed for Linux, U-Boot’s robust hardware sup-
requirements are just at the edge of what stan- port, scriptability, and active community make it a
dard USB ports will provide. good choice for booting FreeBSD as well.
28 FreeBSD Journal
U-Boot starts by initializing the USB, network, boot FreeBSD on i386/amd64, but with a few
and MMC/SD interfaces. changes so that it works with U-Boot instead of
U-Boot 2013.04 (Aug 03 2013 - the PC BIOS (hence the name “ubldr” for “U-
21:27:30) Boot compatible LoaDeR”).
... other messages ... Consoles: U-Boot console
reading bb-uEnv.txt Compatible API signature found
reading bbubldr @8f246240
240468 bytes read in 33 ms (6.9 MiB/s) Card did not respond to voltage
reading bboneblk.dtb select!
14210 bytes read in 7 ms (1.9 MiB/s) Number of U-Boot devices: 2
Booting from mmc ... FreeBSD/armv6 U-Boot loader, Revision
## Starting application at 0x88000054 1.2
... (root@fci386.localdomain, Fri Aug 16
Once it has the MMC/SD initialized, it reads 12:59:51 PDT 2013)
three files into memory. DRAM: 256MB
* bb-uEnv.txt is empty by default, but you can Device: disk
edit this to redefine the U-Boot startup Loading /boot/defaults/loader.conf
functions. /boot/kernel/kernel
* bbubldr is the FreeBSD boot loader that will data=0x449864+0x17d3c8
be run next. syms=[0x4+0x82890+0x4+0x4ec85]
* bboneblk.dtb is the DTB file described Hit [Enter] to boot immediately, or
below. any other key for command prompt.
Booting [/boot/kernel/kernel]...
3. About the DTB File Using DTB provided by U-Boot.
Operating systems for newer embedded proces- Kernel entry at 0x80200100...
sors are increasingly using a “device tree” file— Kernel args: (null)
sometimes called a “flattened device tree”
(fdt)—to initialize the kernel. This file lists all the 5. Load loader.rc, loader.conf
peripherals and helps the kernel decide which Ubldr pulls in a lot of standard FreeBSD configu-
drivers to enable. Device trees are compiled: The ration. In particular, it reads loader.conf and pos-
source version is called DTS and the binary com- sibly loader.rc. These can be used to load kernel
piled version is called a DTB file. modules into memory so they are available when
The U-Boot initialization checks which hard- the kernel first boots.
ware you are currently running and then loads
the appropriate DTB file into memory. This data is 6. Load FreeBSD Kernel
not directly used by U-Boot or by ubldr, but is Ubldr can now load the FreeBSD kernel proper.
eventually passed to the FreeBSD kernel. The key
advantage of this arrangement: The exact same 7. Start FreeBSD Kernel
kernel can run on both BeagleBone and Once everything is ready, ubldr actually starts the
BeagleBone Black since key configuration such as FreeBSD kernel. The last lines printed by ubldr
the amount of RAM and number of drives is pro- indicate how it is going to launch the kernel:
vided by the DTB. Booting [/boot/kernel/kernel]...
Eventually, the FreeBSD/ARM developers hope Using DTB provided by U-Boot.
to have a single GENERIC kernel that boots on a Kernel entry at 0x80200100...
number of boards. This requires more work on Kernel args: (null)
the kernel to ensure that the various board sup-
port routines can coexist. It also requires more 8. Initialize FreeBSD Kernel
work on the boot loader side to ensure that all of Unlike ubldr, which relies heavily on U-Boot, the
the various loaders correctly provide a DTB file to FreeBSD kernel runs completely on its own.
the kernel. So it must first set up its own memory man-
agement and console handling. Once that is
4. FreeBSD Ubldr done, the kernel can show its first message:
U-Boot knows a lot about the BBB hardware and KDB: debugger backends: ddb
how to initialize it, but does not know anything KDB: current backend: ddb
about the FreeBSD kernel and modules. Copyright (c) 1992-2013 The FreeBSD
So the BBB uses U-Boot to load “ubldr”. This Project.
is essentially the same as “BTX loader” used to Copyright (c) 1979, 1980, 1983, 1986,
Jan/Feb 2014 29
BEAGLEBONE BLACK
1988, 1989, 1991, 1992, 1993, 1994
You can configure it with the ifconfig command
The Regents of the University
or edit /etc/rc.conf to set it up on every boot.
of California. All rights reserved.
Most FreeBSD images should have DHCP
FreeBSD is a registered trademark of
enabled by default.
The FreeBSD Foundation.
Time: Since the BBB does not have a battery-
FreeBSD 10.0-CURRENT #0 r254265: Fri
backed clock, you’ll need to either set the time
Aug 16 12:58:43 PDT 2013
manually on boot-up or use NTP to set the time
root@fci386.localdomain:/usr/..../src/
from the network.
sys/BEAGLEBONE arm
Disk: The external micro-SD interface is called
The kernel then proceeds to use the device “mmcsd0”. The standard FreeBSD images for
tree data to identify each system that needs to BBB are formatted with two partitions:
be initialized. * mmcsd0s1 is the FAT slice with U-Boot and
other boot files
9. Start FreeBSD userland * mmcsd0s2 is the slice used by FreeBSD
After the FreeBSD kernel has finished initializing The root partition on mmcsd0s2a is generally
everything, it mounts the root filesystem so that formatted with Soft Updates + Journaling
it can load the first programs from the SD (SU+J). SU+J allows the system to reboot quickly
filesystem. when power is removed and reapplied.
Here are the last messages printed by the eMMC: The 2GB built-in eMMC chip is available
kernel: as “mmcsd1”. By supporting 8-bit transfers, it is
Trying to mount root from significantly faster than the micro-SD interface.
ufs:/dev/mmcsd0s2a [rw,noatime]... The BBB ships with a Linux distribution installed
warning: no time-of-day clock regis- on the eMMC, but you can easily reformat this
tered, system time will not be set and use it as an extra drive for FreeBSD. Soon,
accurately we expect to be able to install FreeBSD and
(In particular, the warning here is expected, since boot it directly from eMMC.
the BBB does not have a battery-backed RTC.) SU+J: The BBB doesn’t have an Off button; you
If you’ve used FreeBSD or Linux or any similar usually just remove power. This does lead to
system before, the remaining boot steps should data loss if you have software running when
be quite familiar: The rc system runs a bunch of you disconnect power. Using “UFS Soft Updates
scripts to set up various standard systems, with Journaling” (UFS SU+J) does not prevent
including network services such as SSHd and data from being lost, but does seem to do a
NTPd. The very first time you boot, this can take good job of avoiding fatal filesystem corruption.
a little while, since some of these services need Swap: Although 512MB RAM is sufficient for
to set up their initial configurations. Most obvi- many purposes, you will probably want to enable
ously, the SSHd service needs to create encryp- some swap. For a number of reasons, people are
tion keys for this particular machine. generally using a swap file on the root partition
Finally, the system is ready to accept logins. rather than a separate swap partition.
Wed Sep 4 00:46:40 UTC 2013 You can use “swapctl -l” to find out if the
FreeBSD/arm (beaglebone) (ttyu0) image you are using already has swap config-
login: ured. If not, it’s easy to add a swap file:
Most BBB images are set up to automatically 1) Create the file: dd if=/dev/zero
configure the Ethernet port and start sshd. of=/usr/swap0 bs=1m count=768
So you should be able to connect remotely 2) Add the following line to /etc/fstab and
using SSH at this point as well. reboot:
md none swap sw,file=/usr/swap0 0 0
Using FreeBSD on the Ports: If you have network access, then
installing a ports tree is quite simple:
BeagleBone Black $ portsnap fetch
The BBB runs a completely standard FreeBSD $ portsnap extract
system, so if you’re comfortable with FreeBSD You can then build and install ports as usual.
on i386 or amd64, then you should feel right at For example, to install the Apache web server:
home. $ cd /usr/ports/www/apache24
Here are a few notes to help get you started: $ make
Ethernet: The network interface is “cpsw0”. $ make install
30 FreeBSD Journal
Packages: The FreeBSD package team does plan to
provide ARM packages compatible with the new
Building Your Own
package-management tool ‘pkg’. As of September FreeBSD Image
2013 this hasn’t yet been implemented. If you are comfortable with the process for build-
A number of individuals have had good suc- ing and upgrading FreeBSD from source code,
cess using Poudriere to automatically build their you can use the Crochet tool to build a custom
own package sets. BBB image on a fast i386 or amd64 machine.
USB: USB generally works well on BBB. USB In particular, this makes it easy to track the
drives, USB network adapters, and printers have most recent changes to FreeBSD as the support
all been used successfully. There is one caveat, for BBB continues to improve.
though: You should not plug any USB peripherals Detailed instructions are at:
into the BBB unless the BBB is connected to a https://github.com/kientzle/crochet-freebsd;
separate power supply. If you are powering the the following is a quick summary:
BBB from a mini-USB cable and try to connect 1) Get Crochet. You’ll need the devel/git pack-
any USB device, the BBB will most likely shut off. age installed, and then you can get a copy of the
Crochet scripts:
$ git clone https://github.com/kientzle/
Updating FreeBSD crochet-freebsd
Once you have FreeBSD up and running, you can To update, use the “git pull” command from
download the FreeBSD source code and rebuild inside the source directory.
directly on the BBB. 2) Create a configuration file beagleblack.sh
Caveats: with the following contents:
* A full system rebuild on the BBB can take as board_setup BeagleBone
much as two days, depending on a number of option ImageSize 3900mb
factors. option UsrSrc
* A full source checkout is over 2G, so won’t fit option UsrPorts
on the eMMC. FREEBSD_SRC=${TOPDIR}/src
* FreeBSD-CURRENT (also called the ‘head’ The ‘option’ lines here preinstall a full FreeBSD
branch) is the current development branch; it has source tree in /usr/src and a full ports tree in
the newest features and the newest bugs. /usr/ports. Omitting those lines will result in a
You can use the ‘svnlite’ command (which is a smaller image.
standard part of FreeBSD now) to check out the 3) Build the image:
source code from the FreeBSD project’s $ sudo ./crochet.sh -c beagleblack.sh
Subversion repository: The script first checks whether you have all the
$ svnlite co http://svn.freebsd.org/base/head necessary source code and tools. If any are miss-
/usr/src ing, it will print instructions for obtaining them.
$ cd /usr/src Once it has all the pieces, a fast PC can compile a
Read /usr/src/UPDATING, especially the summa- complete FreeBSD system and assemble the
ry information near the end that outlines com- image in about an hour. •
mon upgrade scenarios. Generally, a full upgrade
from source looks like the following: Tim Kientzle has been a FreeBSD committer for
$ cd /usr/src 10 years and a FreeBSD user for much longer
$ make buildworld than that. Most recently, he’s been working on
$ make kernel image-building tools and boot support for
<reboot> BeagleBone and Raspberry Pi.
$ cd /usr/src
$ mergemaster -p
$ make installworld
$ mergemaster
<reboot>
The UPDATING file also explains how to do
partial updates, kernel-only updates, and some
techniques for doing partial upgrades.
Jan/Feb 2014 19
THE FUTURE OF STORAGE
BY ALLAN JUDE
THE
KEY
Distribution License (CDDL). In order to contin-
difference is that ZFS is, in ue developing and improving this open source
fact, more than just a file sys- fork of ZFS, the OpenZFS project was creat-
tem, as it combines the roles ed—a joint effort between FreeBSD, IllumOS,
of RAID controller, Volume the ZFS-On-Linux project, and many other
• Manager, and File System. developers and vendors. This new OpenZFS
Most previous file systems were designed to (included in FreeBSD 8.4 and 9.2 or later)
be used on a single device. To overcome this, changed the version number to “v5000 -
RAID controllers and volume managers would Feature Flags”, to avoid confusion with the
combine a number of disks into a single logical continued proprietary development of ZFS at
volume that would then be presented to the Oracle (currently at v34), and to ensure com-
file system. A good deal of the power of ZFS patibility and clarity between the various open
comes from the fact that the file system is inti- source versions of ZFS. Rather than continuing
mately aware of the physical layout of the to increment the version number, OpenZFS has
underlying storage devices and, as such, is able switched to “Feature Flags” as new features
to make more informed decisions about how to are added. The pools are marked with a prop-
reliably store data and manage I/O operations. erty, feature@featurename, so that only
Originally released as part of OpenSolaris, compatible versions of ZFS will import the
when Sun Microsystems was later acquired by pool. Some of these newer properties are
Oracle, it was decided that continued develop- read-only backwards compatible, meaning that
ment of ZFS would happen under a closed an older implementation can import the pool
license. This left the community with ZFS v28 and read, but not write to it, because they lack
under the original Common Development and support for the newer features.
32 FreeBSD Journal
What Makes ZFS Different? and would happily return the corrupted data.
The most important feature sets in ZFS are those ZFS, on the other hand, will attempt to recover
designed to ensure the integrity of your data. the data from the various forms of redundancy
ZFS is a copy-on-write (COW) file system, which supported by ZFS. When an error is encoun-
means that data is never overwritten in place, tered, ZFS increments the relevant counters dis-
but rather the changed blocks are written to a played by the zpool status command. If redun-
new location on the disk and then the metadata dancy is available, ZFS will attempt to correct the
is updated to point to that new location. This problem and continue normally; otherwise, it
ensures that in the case of a shorn write (where will return an error instead of corrupted data.
a block was being written and was interrupted The checksum algorithm defaults to fletcher, but
before it could finish) the original version of the the SHA256 cryptographic hashing algorithm is
data is not lost or corrupted, as it would be in a also available, offering a much smaller chance of
traditional file system. In the case of a power a hash collision in exchange for a performance
failure or system crash, the file is left in an penalty.
inconsistent state in which it contains a mix of
new and old data. Copy-on-write also enables Future-proof Storage
another powerful feature—snapshots. ZFS ZFS is designed to overcome the arbitrary limits
allows you to instantly create a consistent point- placed on previous file systems. For example, the
in-time snapshot of a dataset (and optionally of maximum size of a single file on an EXT3 file
all its child datasets). The new snapshot takes no system is 2^31 (2 TiB), while on EXT4 the limit is
additional space (aside from a miniscule amount 2^44 (16 TiB), compared to 2^55 (32 PiB) on
of metadata) and is read-only. Later, when a UFS2, and 2^64 (16 EiB) on ZFS. EXT3 is limited
block is changed, the older block becomes part to 32,000 subdirectories, with EXT4 limited to
of the snapshot, rather than being reclaimed as 64,000, while ZFS can contain up to 2^48
free space. There are now two distinct versions entries (files and subdirectories) in each directo-
of the file system, the snapshot (what the file ry. The limits in ZFS are designed to be so large •
system looked like at the time the snapshot was that they will never be encountered, rather than
taken) and the live file system (what it looks like just being good enough for the next few years. •
now). The only additional space consumed are Owing to the fact that ZFS is both the volume
those blocks that have been changed; the manager and the file system, it is possible to
unchanged blocks are shared between the snap- add additional storage devices to a live system
shot and the live file system until they are modi- and have the new space available on all the
fied. These snapshots can be mounted to recov- existing file systems in that pool immediately.
er the older versions of the files that they con- Each top level device in a zpool is called a vdev,
tain, or the live file system can be rolled back to which can be a simple disk or a RAID transform,
the time of the snapshot, discarding all modifi- such as a mirror or RAID-Z array. ZFS file systems
cations since the snapshot was taken. Snapshots (called datasets) each have access to the com-
are read-only, but they can be used to create a bined free space of the entire pool. As blocks
clone of a file system. A clone is a new live file are allocated, the free space available to the
system that contains all the data from its parent pool (and file system) is decreased. This
while consuming no additional space until it is approach avoids the common pitfall with exten-
written to. sive partitioning where free space becomes frag-
These features protect your data from the mented across the partitions.
usual problems: crashes, power failures, acciden-
tal deletion/overwriting, etc. However, what Doing It in Software Is Better?
about the cases where the problem is less obvi- Best practices dictate that ZFS be given unen-
ous? Disks can suffer from silent corruption, cumbered access to the raw disk drives, rather
flipped bits, bad cables, and malfunctioning than a single logical volume created by a hard-
controllers. To solve these problems, ZFS calcu- ware RAID controller. RAID controllers will gen-
lates a checksum for every block it writes and erally mask errors and attempt to solve them
stores that along with the metadata. When a rather than reporting them to ZFS, leaving ZFS
block is read, the checksum is again calculated unaware that there is a problem. If a hardware
and then compared to the stored checksum; if RAID controller is used, it is recommended it be
the two values do not match, something has set to IT "Target" or JBOD mode, rather than
gone wrong. A traditional file system would providing RAID functionality. ZFS includes its
have no way of knowing there was a problem,
Jan/Feb 2014 33
THE FUTURE OF STORAGE
own RAID functionality that is superior. copies, which controls the number of copies
When creating a ZFS Pool (zpool) there are a of each block that is stored. The default is 1,
number of different redundancy levels to but by increasing this value, ZFS will store each
choose from. Striping (RAID0, no redundancy), block multiple times, increasing the likelihood it
Mirroring (RAID1 or better with n-way mirrors), can be recovered in the event of a failure or
and RAID-Z. ZFS mirrors work very much the data corruption.
same as traditional RAID1 (except you can place
3 or more drives into a single mirror set for Faster Is Always Better!
In addition to providing very effective data
integrity checks, ZFS is also designed with per-
formance in mind. The first layer of performance
WHEN INITIALIZING NEW is provided by the Adaptive Replacement Cache
POOLS
(ARC), which is resident entirely in RAM.
Traditional file systems use a Least Recently Used
(LRU) cache, which is simply a list of items in the
cache sorted by when each object was most
recently used. New items are added to the top
and adding a device to an existing pool, of the list, and once the cache is full, items from
the bottom of the list are evicted to make room
ZFS will perform a whole-device TRIM, for more active objects. An ARC consists of four
erasing all blocks on the device to en- lists—the Most Recently Used (MRU) and Most
sure optimum starting performance. Frequently Used (MFU) objects, plus a ghost list
for each. These ghost lists track recently evicted
objects to prevent them from being added back
to the cache. This increases the cache hit ratio
additional redundancy). However, RAID-Z has by avoiding objects that have a history of only
some important differences compared to the being used occasionally. Another advantage of
analogous traditional RAID configurations using both an MRU and MFU is that scanning
(RAID5/6/50/60). Compared to RAID5, RAID-Z an entire file system would normally evict all
offers better distribution of parity and elimi- data from an MRU or LRU cache in favor of this
nates the “RAID5 write hole” in which the data freshly accessed content. In the case of ZFS,
and parity information become inconsistent since there is also an MFU that only tracks the
after an unexpected restart. When data is writ- most frequently used objects, the cache of the
ten to a traditional RAID5 array, the parity infor- most commonly accessed blocks remains. The
mation is not updated atomically, meaning that ARC can detect memory pressure (when anoth-
the parity must be written separately after the er application needs memory)
data has been updated. If something (like a and will free some of the memory reserved for
power failure) interrupts this process, then the the ARC. On FreeBSD, the ARC defaults to a
parity data is actually incorrect, and if the drive maximum of all RAM less 1 GB, but can be
containing the data fails, the parity will restore restricted using the vfs.zfs.arc_max loader
incorrect data. ZFS provides 3 levels of RAID-Z tunable.
(Z1 through Z3) which provide increasing levels The ARC can optionally be augmented by a
• of redundancy in exchange for decreasing levels Level 2 ARC (L2ARC). This is one or more SSDs
of usable storage. The number of drive failures that are used as a read cache. When the ARC is
the array can withstand corresponds to the full, other commonly used objects are written to
name, so a RAID-Z2 array can withstand two the L2ARC, where they can be more quickly read
drives failing concurrently. back than from the main storage pool. The rate
If you create multiple vdevs, for example, at which data is added to the cache devices is
two separate mirror sets, ZFS will stripe the data limited to prevent prematurely wearing out the
across the two mirrors, providing increased per- SSD with too many writes. Writing to the L2ARC
formance and IOPS. Creating a zpool of two or is limited by vfs.zfs.l2arc_write_max,
more RAID-Z2 vdevs will effectively create a except for during the “Turbo Warmup Phase”;
RAID60 array, striping the data across the until the L2ARC is full (the first block has been
redundant vdevs. evicted to make room for something new), the
ZFS also supports the dataset property write limit is increased by the value of
34 FreeBSD Journal
vfs.zfs.l2arc_write_boost. OpenZFS also throughput. This also means it is now possible to
features L2ARC compression controlled by the use dataset compression on file systems that are
secondarycachecompress dataset property. storing databases, without a heavy latency penal- •
This increases the effective size of the L2ARC by ty. LZ4 decompression at 1.5 GB/s on 8k blocks
the compression ratio, but also increases read means the additional latency is only 5 microsec-
performance as data is read as quickly as possi- onds, which is an order of magnitude faster than
ble but then decompressed, resulting in an even even the fastest SSDs currently available.
higher effective read speed. L2ARC compression ZFS also provides very fast and accurate
only uses the LZ4 algorithm because of its dataset, user and group space accounting in
extremely high decompression performance. addition to quotas and space-reservations. This
gives the administrator fine grained control over
Fine-Grained Control how space is allocated and allows critical file sys-
A great deal of the power of ZFS comes from tems to reserve space to ensure other file sys-
the fact that each dataset has a set of properties tems do not take all of the free space.
that control how it behaves, and are inherited by On top of all of this, ZFS also features a full
its children. A common best practice is to set the suite of delegation features. Delegating various
atime property (which tracks the last access administrative functions such as quota control,
time for each file) to "off". This prevents having snapshotting, replication, ACL management, and
to write an update to the metadata of a file control over a dataset’s ZFS properties can
each time it is accessed. Another powerful fea- increase security and flexibility and decrease an
ture of ZFS is transparent compression. It can be administrator’s workload. Using these features, it
enabled and tuned per dataset, so one can com- is possible to take consistent backups based on
press /usr/src and /usr/ports but disable compres- snapshots without root privileges. An administra-
sion for /usr/ports/distfiles. OpenZFS includes a tor could also choose to use a separate dataset
selection of different compression algorithms for each user’s home directory, and delegate
including: LZJB (modest compression, modest control over snapshot creation and compression
CPU usage), GZIP1-9 (better compression, but settings to that user.
more CPU usage, adjustable), ZLE (compresses
runs of 0s, useful in specific cases), and LZ4 Replication—
(added in v5000, greater compression and less Redundancy Beyond the Node
CPU usage than LZJB). LZ4 is a new BSD-licensed ZFS also features a powerful replication system.
high-performance, multi-core scalable compres- Using the zfs send and zfs receive commands it
sion algorithm. In addition to better compression is possible to send a dataset (and optionally its
in less time, it also features extremely fast children) to another dataset, another pool, or
decompression rates. Compared to the default another system entirely. ZFS replication also sup-
LZJB compression algorithm used by ZFS, LZ4 is ports incremental sends, sending only the blocks
50% faster when compressing compressible data that have changed between a pair of snapshots.
and over three times faster when attempting to OpenZFS includes enhancements to this feature
compress incompressible data. The performance that provide an estimate of how much data will
on incompressible data is a large improvement; need to be sent, as well as feedback while data
this comes from an “early abort” feature. If ZFS is being transferred. This is the basis of PCBSD’s
detects that the compression savings is less than Life Preserver feature. A planned feature for the
12.5%, then compression is aborted and the future will also allow resumption of interrupted
block is written uncompressed data, but once ZFS send/receive operations.
decompressed, provides a much higher effective
throughput. In addition, decompression is Harnessing the Power of
approximately 80% faster; on a modern CPU, LZ4 Solid State Drives
is capable of compression at 500 MB/s and
In addition to the L2ARC read-cache discussed
decompression at 1500 MB/s per CPU core. These
earlier, ZFS supports optional log devices, also
numbers mean that for some workloads, com-
known as ZFS Intent Log (ZIL). Some workloads,
pression will actually give increased perform-
especially databases, require an assurance that
ance—even with the CPU usage penalty—
the data they have written to disk has actually
because data can be read from the disks at the
reached “stable storage.” These are called syn-
same speed as uncompressed data, but then once
chronous writes, because the system call does
decompressed, provides a much higher effective
not return until the data has been safely written
Jan/Feb 2014 35
THE FUTURE OF STORAGE
to the disk. This additional safety traditionally must assume that any block that has ever been
comes at the cost of performance, but with written is still in use, and this leads to fragmen-
ZFS it doesn’t have to. The ZIL accelerates syn- tation and greatly diminished performance.
• chronous transactions by using storage devices
(such as SSDs) that are faster and have less
When initializing new pools and adding a
device to an existing pool, ZFS will perform a
latency compared to those used for the main whole-device TRIM, erasing all blocks on the
pool. When data is being written and the device to ensure optimum starting perform-
application requests a guarantee that the data ance. If the device is brand new or has previ-
has been safely stored, the data is written to ously been erased, setting the
the faster ZIL storage, and then later flushed vfs.zfs.vdev.trim_on_init sysctl to 0
out to the regular disks, greatly reducing the will skip this step. Statistics about TRIM opera-
latency of synchronous writes. In the event of tions are exposed by the
a system crash or power loss, when the ZFS file kstat.zfs.misc.zio_trim sysctl. In
system is mounted again, the incomplete order to avoid excessive TRIM operations and
transactions from the ZIL are replayed, ensur- increasing wear on the SSD, ZFS queues the
ing all of the data is safely in place in the main TRIM command when a block is freed, but
storage pool. Log devices can be mirrored, but waits (by default) 64 transaction groups before
RAID-Z is not supported. When specifying mul- sending the command to the drive. If a block is
tiple log devices, writes will be load balanced reused within that time, it is removed from the
across all devices, further increasing perform- TRIM list. The L2ARC also supports TRIM, but
based on a time limit instead of number of
transaction groups.
Open ZFS project (open-zfs.org) OpenZFS—
was created with the expressed Where Is It Going Next?
goals of raising awareness about The recently founded OpenZFS project
(open-zfs.org) was created with the expressed
open source ZFS, encouraging goals of raising awareness about open source
open communication ZFS, encouraging open communication
between the various implementations and ven-
between the various implementa- dors, and ensuring consistent reliability, func-
tions and vendors, and ensuring tionality, and performance among all distribu-
tions of ZFS. The project also has a number of
consistent reliability, functionality, ideas for future improvements to ZFS, includ-
and performance among all dis- ing: resumable send/receive, ZFS channel pro-
grams to allow multiple operations to be com-
tributions of ZFS. plete atomically, device removal, unified ashift
handling (for 4k sector “advanced format”
drives), increase maximum record size from
ance. The ZIL is only used for synchronous 128KB to 1MB (preferably in a way compatible
writes, so will not increase the performance of with Oracle ZFS v32), platform agnostic
(nor be busied by) asynchronous workloads. encryption, and improvements to dedupli-
OpenZFS has also gained TRIM support. Solid cation. •
State Disks (SSDs) work a bit differently than
traditional spinning disks. Due to the way that
Allan Jude is VP of operations at ScaleEngine
flash cells wear out over time, SSD’s Flash
Translation Layer (FTL)—which makes the SSD Inc., a global HTTP and Video Streaming
appear to the system like a typical spinning Content Distribution Network, where he
disk—often moves data to different physical makes extensive use of ZFS on FreeBSD. He is
locations in order to wear the cells evenly, and also the on-air host of the video podcasts
to work around worn-out cells. In order to do “BSD Now” with Kris Moore, and “TechSNAP”
this effectively, the SSD’s FTL needs to know on JupiterBroadcasting.com. Previously he
when a block has been freed (the data stored taught FreeBSD and NetBSD at Mohawk
on it can be overwritten). Without information College in Hamilton, Canada, and has 12
as to which blocks are no longer in use, the SSD years of BSD unix sysadmin experience.
36 FreeBSD Journal
Advertise here
and climb with us!
Looking for
qualified job
applicants?
Selling products
or services?
Let FreeBSD Journal
connect you with a
targeted audience!
Call
888/290-9469
®
Or Email
walter@
freebsdjournal.com
The First PORTS REPORT by Thomas Abthorpe
elcome to the inaugural Ports review for completeness. Do not get frustrated;
38 FreeBSD Journal
NEW PORTS COMMITTERS TIPS FOR
I t is a long-standing joke that if you submit too
many PRs, fix too many ports, and contribute on
the mailing lists in a helpful manner, you get pun-
PERSPECTIVE PORTERS
If you are fortunate enough to maintain a port
that just builds with little or no manipulation,
ished with a commit bit. So in recent months we then you are quite lucky. This is not the case
have punished the following: John Marino (mari- with all ports. You will often need to patch
no@), a contributor to many BSD projects, notably snippets of code to make it run for FreeBSD.
DragonflyBSD, in which he is responsible for One of the most tedious aspects of this task is
DPorts; Rusmir Dusko (nemysis@), who has con- maintaining the list of patches in the files sub-
currently been working with both FreeBSD ports folder of your port. Instead of running the diff
and PC-BSD PBIs; David Chisnall (theraven@), who manually to generate your patches, run
has spent recent years as a src committer, and “make makepatch” from your port, which will
with his wealth of experience will be instrumen- assemble all the patches for you. Please also
tal in getting ports working in the upcoming remember to share your patches with the
FreeBSD 10 release cycle; and Danilo Gondolfo, a developer of your port, as this will ensure
long-time contributor to the ports tree. ongoing compatibility and portabilityc •
Thomas Abthorpe is a server administrator with over 20 years in the industry. He got his Ports commit bit
August 2007, joined the Ports Management Team in March 2010, and was elected to FreeBSD Core Team in
July 2012. When he is not busy doing FreeBSD business, he volunteers as an apprentice bicycle mechanic with
Bicycles for Humanity.
Jan/Feb 2014 39
svn update by Glen Barber
u p d a t e d d e d
The FreeBSD 10-RELEASE cycle is in
high gear, and with 9.2-RELEASE officially
available, 10 is the primary focus of the
a
Software random number generators use
seeded entropy obtained from various sources.
For example, Ethernet interfaces and software
Release Engineering team. interrupts handlers can be used as sources for
entropy to seed random number generation.
In addition to bug fixes and stability Hardware random number generators gather
enhancements, FreeBSD 10-RELEASE will their entropy through physical means, such as
contain a number of exciting new features. thermal "noise" within the device. By using such
unpredictable physical entropy sources, the hard-
ware random number generator can gather a
VIRTIO SUPPORT
d
VirtIO module proivides a shared memory trans- CURRENT in revision (Link: r240135).
port between the virtual machine and the hyper-
visor. This shared memory transport is called the
"virtqueue."
The VirtIO PCI driver creates an emulated PCI
MULTI-PROCESSOR
SUPPORT IN PF
up g r
SINCE ORIGINALLY being imported from
a
device that is then made available to the virtual OpenBSD, one of the performance limitations of
machine. The emulated PCI devices use the PF (Packet Filter) was that it could only run
virtqueue to directly access memory allocated to bound to a single CPU. This meant that on multi-
the device, resulting in a performance gain with- processor systems, PF could not take advantage
in the virtualized environment. of the additional CPUs, which means that PF
VirtIO was originally developed for the Linux would not necessarily show any performance
KVM, but has since been adapted to other virtual gain when run on 2- or 24- core machines.
machine hypervisors, such as BHyVe, VirtualBox, Work done on FreeBSD 10-CURRENT intro-
and Qemu. duces multi-processor support to PF, which intro-
VirtIO support was added in revision (Link: duces fine-grain locking support. This allows PF
r227652). to take advantage of multiple CPUs on the sys-
re v i s e dBHyVe
BHyVe IS THE BSD Hypervisor, developed by Peter
tem, which significantly improves performance.
Multi-processor support for PF was introduced in
revision (Link: r240233).
The pf firewall, originally from OpenBSD, got
Grehan and Neel Natu. The design goal of BHyVe upgraded to support fine-grain locking and bet-
is to offer a lightweight paravirtualization envi- ter utilization on multi-cpu machines, which
ronment on FreeBSD. allows it to perform significantly faster.
BHyVe requires Intel CPUs with VT-x and
Extended Page Table (EPT) support. These fea-
tures are on all Nehalem CPUs and newer, but UNMAPPED IO IN DISK DRIVERS
not available on Atom CPUs. The FreeBSD kernel maps I/O buffers in the ker-
BHyVe appeared in FreeBSD 10-CURRENT in nel page table. On multi-core systems, the map-
revision (Link: r245652). ping must be flushed on all TLBs (translation
lookaside buffers) due to this global mapping.
RDRAND When the number of cores on the system
RDRAND is the Intel CPU instruction set used to increases, there is a performance bottleneck,
access the hardware random number generator. since during buffer creation and destruction, the
40 FreeBSD Journal
r ev i s e d
initiating thread must wait for all other cores on crochet-freebsd).
the system to execute. Raspberry Pi support was introduced in revision
FreeBSD 10 introduces unmapped I/O buffers, (Link: r239922).
which eliminate the need to perform translation
lookaside buffer shootdown for buffer creation CLANG AS THE DEFAULT
and destruction, eliminating up to 30% of system COMPILER
time on I/O-intensive workloads. GCC is no longer part of the default base system
Unmapped I/O support was initially introduced on most architectures. The FreeBSD Project has
in revision (Link:r248508) for the ahci(4) and switched from GCC to CLANG as the default com-
revised
md(4) drivers. Support for additional drivers fol- piler. This provides FreeBSD with a more modern,
actively-developed default compiler.
Although GCC is not built by default, it is still
RASPBERRY PI AND
BEAGLEBONE SUPPORT
FreeBSD 10 runs on the Raspberry Pi, BeagleBone,
and several other embedded platforms. Although
cha g
available in the FreeBSD 10 base system.
n e d
The change to disable GCC by default was con-
cluded with revision (Link: r255348).
ded
and a number of other platforms. Crochet can be
found here (Link: https://github.com/kientzle/
engineering in the Project. Glen lives in
Pennsylvania, USA.
BSDCAN 2014
The 11th Annual BSDCan!
THE TECHNICAL BSD CONFERENCE. High Value, Low
Cost, Something for Everyone! BSDCan, a BSD conference
held in Ottawa, Canada, has quickly established itself as the
technical conference for people working on and with 4.4BSD
based operating systems and related projects. The organizers
have found a fantastic formula that appeals to a wide range
of people from extreme novices to advanced developers.
Jan. 2013 FreeBSD 9.0-RELEASE was announced Architectures. It represents the cutting edge of TCP compres-
on January 6, 2013. Being a “dot zero” re- sion control research, which is needed in this ever-changing
lease, this one was chockfull of new features. One of the world of networking technologies.
interesting aspects of this release is that a number of the The Foundation collaborated with OMCnet Internet
larger frameworks, which take significant developer time to Service GmbH and TransIP BV to implement the Highly
design, implement, and test, were sponsored by the FreeBSD Available Storage (HAST) framework, which allows for syn-
Foundation, often in collaboration with other organizations. chronous block-level replication of any storage media over
For example, the Capsicum framework for application sand- a TCP/IP network. The Foundation also sponsored the porting
boxing is the result of collaboration between the University of of userland Dtrace.
Cambridge Computer Laboratory, Google, and the FreeBSD For better or worse, this release finally replaced the “inter-
Foundation in which FreeBSD became the reference implemen- im” sysinstall framework that Jordan Hubbard introduced for
tation for new research in application security. 2.0.5-RELEASE in mid-1995.
The pluggable congestion control framework, along with The FreeBSD Project dedicated this release to the memory
five new compression control algorithms, is the result of col- of Dennis M. Ritchie, one of the founding fathers of the
laboration between the Foundation and the Swinburne UNIX operating system and creator of the C programming
University of Technology’s Centre for Advanced Internet language.
Jan. 2009
2008 The first hour of Marshall Kirk McKusick's FreeBSD 7.1-RELEASE was announced on
January 5, 2009. Being the second release in the 7.x
course on FreeBSD kernel internals, based on his
book, The Design and Implementation of the FreeBSD Operating series, it didn’t introduce too many new features. However, some of the
System, was recorded and downloaded in 2008. This course has changes it did introduce remind us how far computing has moved along
been given at BSD Conferences and technology companies around since the turn of the century: the ability to boot from USB devices, the abili-
the world. http://www.youtube.com/watch?v=nwbqBdghh6E ty to boot from GPT, the ability to use the VESA BIOS for DPMS during sus-
pend and resume, and the ability for traceroute(8) to display an AS number.
Jan. 2004
FreeBSD 5.2- releases in this branch averaged every not be suitable for all users.” That is a
RELEASE was announced on January six months as the fledgling SMP sup- testament to both the cautious, let’s-
12,2004. While many of us remember port matured. While 5.2-RELEASE con- not-break-production-usage philoso-
waiting with baited breath for a very tained a number of significant stability phy of the Project and the amount of
long time for the much anticipated 5.0 and performance improvements over work and testing needed to move a
(SMP release) in 2003, the other FreeBSD 5.1, it was still advertised as code base from its uniprocessor
“a New Technology release that might assumptions to the new SMP world.
Jan. 1994
This was an interesting time for the 1.1 Release, our second full distribution of the FreeBSD
newly minted FreeBSD Project. Its first 1.0-RELEASE had Operating System.
moved from EPSILON status and had been unleashed to the FreeBSD 1.1 represents a milestone in our free software
world on November 1, 1993. Its future was in a state of flux efforts, both technically and legally. For quite some time, the
as the USL vs. BSDI lawsuit marched toward the settlement future of BSD has been somewhat in doubt due to the
that was finally announced, minus most of the terms of the UCB/USL lawsuit, and all Net/2 derived distributions have
agreement, on February 6, 1994. rested on uncertain legal ground. With the resolution of the
The settlement allowed the Project to continue its work on lawsuit, and subsequent clarification and agreements from
FreeBSD 1.1-RELEASE, which was announced on May 6, USL on our distribution terms, we can bring you this distribu-
1994. That announcement includes this text: tion without legal ambiguity, and with clear plans for a fully
The FreeBSD team is very pleased to announce FreeBSD unencumbered future.
. ... .
BSD operating systems requires a
serious level of knowledge
and expertise . . NEED
AN EDGE?
..
SHOW
YOUR STUFF!
Your commitment and
. . BSD Certification can
make all the difference.
Today's Internet is complex.
Companies need individuals with
proven skills to work on some of
the most advanced systems on
dedication to achieving the the Net. With BSD Certification
BSD ASSOCIATE CERTIFICATION YOU’LL HAVE
can bring you to the WHAT IT TAKES!
attention of companies
that need your skills.
BSDCERTIFICATION.ORG
Providing psychometrically valid, globally affordable exams in BSD Systems Administration