Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

jffs2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

JFFS : The Journalling Flash File System

David Woodhouse
Red Hat, Inc.
dwmw2@cambridge.redhat.com

Abstract These types of flash share their most important


characteristics — each bit in a clean flash chip will
be set to a logical one, and can be set to zero by a
Until recently, the common approach to using Flash write operation.
memory technology in embedded devices has been
to use a pseudo-filesystem on the flash chips to em- Flash chips are arranged into blocks which are typ-
ulate a standard block device and provide wear lev- ically 128KiB on NOR flash and 8KiB on NAND
elling, and to use a normal file system on top of that flash. Resetting bits from zero to one cannot be
emulated block device. done individually, but only by resetting (or “eras-
ing”) a complete block. The lifetime of a flash chip
JFFS is a log-structured file system designed by is measured in such erase cycles, with the typical
Axis Communications AB in Sweden specifically for lifetime being 100,000 erases per block. To ensure
use on flash devices in embedded systems, which is that no one erase block reaches this limit before the
aware of the restrictions imposed by flash technol- rest of the chip, most users of flash chips attempt
ogy and which operates directly on the flash chips, to ensure that erase cycles are evenly distributed
thereby avoiding the inefficiency of having two jour- around the flash; a process known as “wear level-
nalling file systems on top of each other. ling”.

This paper will give an overview of the restrictions Aside from the difference in erase block sizes, NAND
imposed by flash technology and hence the design flash chips also have other differences from NOR
aims of JFFS, and the implementation of both JFFS chips. They are further divided into “pages” which
and the improvements made in version 2, including are typically 512 bytes in size, each of which has
compression and more efficient garbage collection. an extra 16 bytes of “out of band” storage space,
intended to be used for metadata or error correc-
tion codes. NAND flash is written by loading the
required data into an internal buffer one byte at a
time, then issuing a write command. While NOR
1 Introduction flash allows bits to be cleared individually until
there are none left to be cleared, NAND flash allows
only ten such write cycles to each page before leak-
1.1 Flash age causes the contents to become undefined until
the next erase of the block in which the page resides.

Flash memory is an increasingly common storage


medium in embedded devices, because it provides 1.2 Flash Translation Layers
solid state storage with high reliability and high
density, at a relatively low cost.
Until recently, the majority of applications of flash
Flash is a form of Electrically Erasable Read Only for file storage have involved using the flash to em-
Memory (EEPROM), available in two major types — ulate a block device with standard 512-byte sectors,
the traditional NOR flash which is directly accessi- and then using standard file systems on that emu-
ble, and the newer, cheaper NAND flash which is lated device.
addressable only through a single 8-bit bus used for
both data and addresses, with separate control lines. The simplest method of achieving this is to use a
simple 1:1 mapping from the emulated block device translation in between. It is precisely such a filesys-
to the flash chip, and to simulate the smaller sector tem which Axis Communications AB released in late
size for write requests by reading the whole erase 1999 under the GNU General Public License.
block, modifying the appropriate part of the buffer,
erasing and rewriting the entire block. This ap-
proach provides no wear levelling, and is extremely
unsafe because of the potential for power loss be- 2 JFFS Version 1
tween the erase and subsequent rewrite of the data.
However, it is acceptable for use during develop-
ment of a file system which is intended for read- The design goals of JFFS are largely determined by
only operation in production models. The mtdblock the characteristics of flash technology and of the de-
Linux driver provides this functionality, slightly op- vices in which it is expected to be used — as embed-
timised to prevent excessive erase cycles by gather- ded and battery-powered devices are often treated
ing writes to a single erase block and only perform- by users as simple appliances, we must ensure re-
ing the erase/modify/writeback procedure when a liable operation when the system is uncleanly shut
write to a different erase block is requested. down.

To emulate a block device in a fashion suitable for


use with a writable file system, a more sophisticated 2.1 Storage Format
approach is required.

To provide wear levelling and reliable operation, sec- The original JFFS is a purely log-structured file sys-
tors of the emulated block device are stored in vary- tem [LFS]. Nodes containing data and metadata are
ing locations on the physical medium, and a “Trans- stored on the flash chips sequentially, progressing
lation Layer” is used to keep track of the current strictly linearly through the storage space available.
location of each sector in the emulated block de-
vice. This translation layer is effectively a form of In JFFS v1, there is only one type of node in the
journalling file system. log; a structure known as struct jffs raw inode.
Each such node is associated with a single inode.
The most common such translation layer is a com- It starts with a common header containing the in-
ponent of the PCMCIA specification, the “Flash ode number of the inode to which it belongs and all
Translation Layer” [FTL]. More recently, a variant the current file system metadata for that inode, and
designed for use with NAND flash chips has been in may also carry a variable amount of data.
widespread use in the popular DiskOnChip devices
produced by M-Systems. There is a total ordering between the all the nodes
belonging to any individual inode, which is main-
Unfortunately, both FTL and the newer NFTL are tained by storing a version number in each node.
encumbered by patents — not only in the United Each node is written with a version higher than all
States but also, unusually, in much of Europe and previous nodes belonging to the same inode. The
Australia. M-Systems have granted a licence for version is an unsigned 32-bit field, allowing for 4
FTL to be used on all PCMCIA devices, and allow milliard nodes to be written for each inode during
NFTL to be used only on DiskOnChip devices. the life span of the file system. Because the lim-
ited lifetime of flash chips means this number is
Linux supports both of these translation layers, but extremely unlikely to be reached, this limitation is
their use is deprecated and intended for backwards deemed to be acceptable.
compatibility only. Not only are there patent is-
sues, but the practice of using a form of journalling Similarly, the inode number is stored in a 32-bit
file system to emulate a block device, on which a field, and inode numbers are never reused. The same
“standard” journalling file system is then used, is logic applies to the acceptability of this limitation,
unnecessarily inefficient. especially as it is possible to remove this restriction
without breaking backwards compatibility of JFFS
A far more efficient use of flash technology would be file systems, if it becomes necessary.
permitted by the use of a file system designed specif-
ically for use on such devices, with no extra layers of In addition to the normal inode metadata such as
uid, gid, mtime, atime, mtime etc., each JFFS v1 medium of each range of data.
raw node also contains the name of the inode to
which it belongs and the inode number of the parent JFFS v1 stores all this information at all times while
inode.1 the file system is mounted. Each directory lookup
can be satisfied immediately from data structures
Each node may also contain an amount of data, and held in-core, and file reads can be performed by
if data are present the node will also record the off- reading immediately from the appropriate locations
set in the file at which these data should appear. on the medium into the supplied buffer.
For reasons which are discussed later, there is a re-
striction on the maximum size of physical nodes, Metadata changes such as ownership or permissions
so large files will have many nodes associated with changes are performed by simply writing a new node
them, each node carrying data for a different range to the end of the log recording the appropriate new
within the file. metadata. File writes are similar; differing only in
that the node written will have data associated with
Nodes containing data for a range in the inode which it.
is also covered by a later node are said to be obso-
leted, as are nodes which contain no data, where the
metadata they contain has been outdated by a later
node. Space taken by obsoleted nodes is referred to 2.3 Garbage Collection
as “dirty space”.

Special inodes such as character or block devices and The principles of operation so far are extremely
symbolic links which have extra information associ- simple. The JFFS code happily writes out new
ated with them represent this information — the jffs raw inode structures to the medium to mark
device numbers or symlink target string — in the each change made to the filesystem. . . until, that is,
data part of the JFFS node, in the same manner as it runs out of space.
regular files represent their data, with the exception
that there may be only one non-obsolete node for At that point, the system needs to begin to reclaim
each such special inode at any time. Because sym- the dirty space which contains old nodes which have
bolic links and especially device nodes have small been obsoleted by subsequent writes.
amounts of such data, and because the data in these
inodes are always required all at once rather than The oldest node in the log is known as the head, and
by reading separate ranges, it is simpler to ensure new nodes are added to the tail of the log. In a
that the data are not fragmented into many different clean filesystem which on which garbage collection
nodes on the flash. has never been triggered, the head of the log will
be at the very beginning of the flash. As the tail
Inode deletion is performed by setting a deleted approaches the end of the flash, garbage collection
flag in the inode metadata. All later nodes asso- will be triggered to make space.
ciated with the deleted inode are marked with the
same flag, and when the last file handle referring Garbage collection will happen either in the context
to the deleted inode is closed, all its nodes become of a kernel thread which attempts to make space
obsolete. before it is actually required, or in the context of
a user process which finds insufficient free space on
the medium to perform a requested write. In either
2.2 Operation case, garbage collection will only continue if there is
dirty space which can be reclaimed. If there is not
enough dirty space to ensure that garbage collection
The entire medium is scanned at mount time, each will improve the situation, the kernel thread will
node being read and interpreted. The data stored in sleep, and writes will fail with −ENOSPC errors.
the raw nodes provide sufficient information to re-
build the entire directory hierarchy and a complete The goal of the garbage collection code is to erase
map for each inode of the physical location on the the first flash block in the log. At each pass, the
1 The lack of distinction between directory entries and in- node at the head of the log is examined. If the node
odes means that the original JFFS cannot support hard links. is obsolete, it is skipped and the head moves on to
the next node.2 If the node is still valid, it must be nario to one and a half times the size of the flash
rendered obsolete. The garbage collection code does sectors in use.
so by writing out a new data or metadata node to
the tail of the log. In fact, the above is only an approximation — it
ignores the fact that a name is stored with each
The new node written will contain the currently node on the flash, and that renaming a file to a
valid data for at least the range covered by the origi- longer name will cause all nodes belonging to that
nal node. If there is sufficient free space, the garbage file to grow when they are garbage collected.3
collection code may write a larger node than the
one being obsoleted, in order to improve storage ef- The precise amount of space which is required in
ficiency by merging many small nodes into fewer, order to ensure that garbage collection can continue
larger nodes. is not formally proven and may not even be bounded
with the current algorithms.
If the node being obsoleted is already partially ob-
soleted by later nodes which cover only part of the Empirical results show that a value of four flash sec-
same range of data, some of the data written to the tors seems to be sufficient, while the previous de-
new node will obviously differ from the data con- fault of two flash sectors would occasionally lead to
tained in the original. the tail of the log reaching the head and complete
deadlock of the file system.
In this way, the garbage collection code progresses
the head of the log through the flash until a com-
plete erase block is rendered obsolete, at which point 2.5 Evolution
it is erased and becomes available for reuse by the
tail of the log.
The original version of JFFS was used by Axis in
their embedded devices in a relatively limited fash-
2.4 Housekeeping ion, on 2.0 version of the Linux kernel.

After the code was released, it was ported to the


The JFFS file system requires a certain amount of 2.3 development kernels by a developer in Sweden.
space to be available between the head and the tail Subsequently, Red Hat, Inc. were asked to port it to
of the log at all times, in order to ensure that it is the 2.2 series and provide commercial support for a
always possible to proceed with garbage collection contract customer.
by writing out new nodes as described above.
Although the design of the file system was impres-
A simplified analysis of this situation is as follows: sive, certain parts of the implementation appeared
not to have been tested by its use in Axis’ products.
In order to be able to erase the next block from Writing data anywhere other than at the end of a
the head of the log, there must be sufficient space file did not work, and deleting a file while a process
to write out new nodes to obsolete all the nodes had a valid file descriptor for it would cause a kernel
in that block. The worst case is that all nodes in oops.
the block are valid, the first node starts at the very
beginning of the block, and the final node starts After some weeks of reliability and compliance test-
just before the end of the block and extends into ing, JFFS reached stability. It is a credit to the
the subsequent block. clarity and quality of the original code that it was
possible to become familiar with it and bring it to
By restricting the maximum size of a data node to the current state in a relatively short space of time.
half the size of the flash erase sector, we limit the
3 An attempt was made to limit this growth by counting
amount of free space required in this worst case sce-
the number of valid nodes containing the current name of
2 Actually, if the node was obsoleted the reference to it each file, and writing out a name with a new node only if
would already have been removed from the linked list of nodes there were fewer than two such nodes. This attempt was
which JFFS stores. The head pointer only ever points to a abandoned because the initial implementation was buggy and
valid node. This is an implementation detail, though. The could lead to a situation with no valid copies of a file name,
point is that valid nodes are obsoleted, and obsoleted nodes and because it would not have solved the problem properly
are ignored — either explicitly or implicitly. even if the hard-to-find bugs were located and fixed.
However, during this time it became apparent that The JFFS2 code was intended to be portable, in par-
there were a few serious flaws in the original imple- ticular to eCos, Red Hat’s embedded operating sys-
mentation of the filesystem: tem targetted at consumer devices[eCos]. For this
reason, JFFS2 is released under a dual licence —
both under GPL and the MPL-style “Red Hat eCos
Garbage collection would proceed linearly Public License”, to be compatible with the licence
through the medium, writing out new nodes of the remainder of the eCos source.
to allow it to erase the oldest block in the
log, even if the block being garbage collected Although portability was intended, no ports have
contained only clean nodes. yet been completed, and the current code is only
usable with the 2.4 series of Linux kernels.
In the relatively common situation where a 16
MiB file system contained 12 MiB of static data
— libraries and program executables, 2 MiB of
3.1 Node Format and Compatibility
slack space and 2 MiB of dynamic data, the
garbage collection would move the 12 MiB of
static data from one place on the flash to an- While the original JFFS had only one type of node
other on every pass through the medium. JFFS on the medium, JFFS2 is more flexible, allowing
provided perfect wear levelling — each block new types of node to be defined while retaining
was erased exactly the same number of times — backward compatibility through use of a scheme in-
but this meant that the blocks were also erased spired by the compatibility bitmasks of the ext2 file
more often than was necessary. system.
Wear levelling must be provided, by occasion-
ally picking on a clean block and moving its Every type of node starts with a common header
contents. But that should be an occasional containing the full node length, node type and a
event, not the normal behaviour. cyclic redundancy checksum (CRC). The common
node structure is shown in Figure 1.
Compression was not supported by JFFS. Be-
cause of the cost of flash chips and the con- MSB LSB
stant desire to squeeze more functionality into Magic Bitmask
embedded devices, compression was a very im- 0x19 0x85 Node Type
portant requirement for a large proportion of
potential users of JFFS.
Total Node Length
Hard links were also not supported by the origi-
nal version of the filesystem. While this lack
was not particularly limiting, it was annoying, Node Header CRC
as was the fact that file names were stored
with each jffs raw inode, potentially leading
to unbounded space expansion upon renames. Figure 1: JFFS2 Common Node Header.

In addition to a numeric value uniquely identify-


ing the node structure and meaning, the node type
3 JFFS2 field also contains a bitmask in the most significant
two bits which indicates the behaviour required by
a kernel which does not support the node type used:
In January of 2001, another customer required com-
pression support in JFFS to be provided as part of a JFFS2 FEATURE INCOMPAT — on finding a node with
contract undertaken. After a period of discussion on this feature mask which is not explicitly sup-
the mailing list, it was concluded that the most ap- ported, a JFFS2 implementation must refuse
propriate course of action would be a complete reim- to mount the file system.
plementation, allowing all of the above-mentioned
deficiencies in the original implementation to be ad- JFFS2 FEATURE ROCOMPAT — a node with this feature
dressed. mask may be safely ignored by an implemen-
tation which does not support it, but the file nodes and blocks which contain at least one obso-
system must not be written to. leted node, respectively. In a new filesystem, many
erase blocks may be on the free list, and will con-
JFFS2 FEATURE RWCOMPAT DELETE — an unsupported tain only one valid node — a marker which is present
node with this mask may be safely ignored to show that the block was properly and completely
and the file system may be written to. Upon erased.
garbage collecting the sector in which it is
found, the node should be deleted. As mentioned previously, the garbage collection
code uses the lists to choose a sector for garbage col-
JFFS2 FEATURE RWCOMPAT COPY — an unsupported
lection. A very simple probabilistic method is used
node with this mask may be safely ignored
to determine which block should be chosen — based
and the file system may be written to. Upon
on the jiffies counter. If jiffies % 100 is non-
garbage collecting the sector in which it is
zero, a block is taken from the dirty list. Oth-
found, the node should be copied intact to a
erwise, on the one-in-one-hundred occasions that
new location.
the formula is zero, a block is taken from the
clean list. In this way, we optimise the garbage
It is an unfortunate matter of record that this com- collection to re-use blocks which are already par-
patibility bitmask was in fact the reason why it was tially obsoleted, but over time, we still move data
necessary to break compatibility with older JFFS2 around on the medium sufficiently well to ensure
file systems. Originally, the CRC was omitted from that no one erase block will be worn out before the
the common node header, and it was discovered that others.
because the INCOMPAT feature mask has more bits
set than the other bitmasks, it is relatively easy, by
interrupting erases, to accidentally generate a struc- 3.3 Node Types
ture on the medium which looks like an unknown
node with the INCOMPAT feature bit set. For this
reason, a CRC on the node header was added, in The third major change in JFFS2 is the separation
addition to the existing CRCs on the node contents between directory entries and inodes, which allows
and on the data or name field if present. JFFS2 to support hard links and also removes the
problem of repeating name information which was
referred to in the footnote on page 4.
3.2 Log Layout and Block Lists
At the time of writing there are three types of nodes
defined and implemented by JFFS2. These are as
Aside from the differences in the individual nodes, follows:
the high-level layout of JFFS2 also changed from a
single circular log format, because of the problem
JFFS2 NODETYPE INODE — this node is most similar
caused by strictly garbage collecting in order. In
JFFS2, each erase block is treated individually, and to the struct jffs raw inode from JFFS v1.
nodes may not overlap erase block boundaries as It contains all the inode metadata, as well as
they did in the original JFFS. potentially a range of data belonging to the in-
ode. However, it no longer contains a file name
This means that the garbage collection code can or the number of the parent inode. As with tra-
work with increased efficiency by collecting from ditional UNIX-like file systems, inodes are now
one block at a time and making intelligent decisions entirely distinct entities from directory entries.
about which block to garbage collect from next. An inode is removed when the last directory en-
try referring to it has been unlinked, and there
Each erase block may be in one of many states, it has no open file descriptors.
depending primarily on its contents. The JFFS2 Data attached to these nodes may be com-
code keeps a number of linked lists of structures pressed using one of many compression algo-
representing individual erase blocks. During the rithms which can be plugged into the JFFS2
normal operation of a JFFS2 file system, the ma- code. The simplest types are “none” and
jority of erase blocks will be on the clean list or “zero”, which mean that the data are uncom-
the dirty list, which represent blocks full of valid pressed, or that the data are all zero, respec-
tively. Two compression algorithms were de- icity guarantee required is for the behaviour of
veloped specifically for use in JFFS2, and also the target link only.
the JFFS2 code can contain yet another copy
of the zlib compression library which is already JFFS2 NODETYPE CLEANMARKER — this node is written
present in at least three other places in the to a newly erased block to show that the erase
Linux kernel source.4 operation has completed successfully and the
block may safely be used for storage.
In order to facilitate rapid decompression of
data upon readpage() requests, nodes contain The original JFFS simply assumed that any
no more than a single page of data, according to block which appeared at first scan to contain
the hardware page size on the target platform. 0xFF in every byte was free, and would se-
This means that in some cases JFFS2 filesys- lect the longest run of apparently free space at
tem images are not portable between hosts, but mount time to be the space between the head
this is not a serious problem because the nature and tail of the log. Unfortunately, extensive
of the flash storage medium makes transporta- power fail testing on JFFS proved this to be
tion between devices unlikely. JFFS2 is also unwise. For many types of flash chips, if power
entirely host-endian in its storage of numbers is lost during an erase operation, some bits may
larger than a single byte. be left in an unstable state, while most are reset
to a logical one. If the initial scan happens to
JFFS2 NODETYPE DIRENT — this node represents a di- read all ones and treat a block containing such
rectory entry, or a link to an inode. It contains unstable bits as usable, then data may be lost
the inode number of the directory in which the — and such data loss may not even be avoidable
link is found, the name of the link and the inode by the naı̈ve method of verification by reading
number of the inode to which the link refers. back data immediately by writing, because the
The version number in a dirent node is in the bit may just happen to return the correct value
sequence of the parent inode. A link is removed when read back for verification.
by writing a dirent node with the same name Empirical results showed that even rereading
but with target inode number zero — and ob- the entire contents of the block multiple times
viously a higher version. in an attempt to detect unstable bits was not
POSIX requires that upon renaming, for ex- sufficiently reliable to avoid data loss, so an al-
ample “passwd.new” to “passwd”, the replace- ternative approach was required. The accepted
ment of the passwd link should be atomic — solution was to write the marker node to the
there should not be any time at which a lookup flash block immediately after successful com-
of that name shall fail to return either the old or pletion of an erase operation. Upon encounter-
the new target. JFFS2 meets that requirement, ing flash blocks which do not appear to contain
although as with many other file systems, the any valid nodes, JFFS2 will trigger an erase op-
entire rename operation is not atomic. eration and subsequently write the appropriate
marker node to the erased block.
Renaming is performed in two stages. First a
new dirent node is written, with the new name This node type was introduced after JFFS2 had
and the inode number of the inode being re- started to be used in real applications, and uses
named. This atomically replaces the link to the the RWCOMPAT DELETE feature bitmask to sig-
original inode with the new one, and is identi- nify that an older JFFS2 implementation may
cal to the way in which a hard link is created. safely ignore the node.
Next, the original name is unlinked, by writing
a dirent node with the original name and target
inode number zero. 3.4 Operation
This two-stage process means that at some
point during the rename operation, the inode
The operation of JFFS2 is at a fundamental level
being renamed into place is accessible through
very similar to that of the original JFFS — nodes,
both the old and the new names. This be-
albeit now of various types, are written out sequen-
haviour is permitted by POSIX — the atom-
tially until a block is filled, at which point a new
4 This duplication is scheduled to be fixed fairly early dur- block is taken from the free list and writing con-
ing the 2.5 development series. tinues from the beginning of the new block.
When the size of the free list reaches a heuristic struct jffs2_raw_node_ref
threshold, garbage collection starts, moving nodes
next_in_ino
from an older block into the new block until space
can be reclaimed by erasing the older one. next_phys
Obsolete flag
flash_offset
However, JFFS2 does not keep all inode informa- Unused flag
totlen
tion in core memory at all times. During mount,
the full map is built as before — but the structures
kept in memory are strictly limited to the informa- next_in_ino
tion which cannot be recreated quickly on-demand.
next_phys
For each inode on the medium, there is a struct
jffs2 inode cache which stores its inode number, flash_offset
the number of current links to the inode, and a totlen
pointer to the start of a linked list of the physical
nodes which belong to that inode. These structures
are stored in a hash table, with each hash bucket next_in_ino
containing a linked list. The hash function is a very next_phys
primitive one - merely the inode number modulo
the size of the hash table. The distribution of inode flash_offset
numbers means this should be well-distributed.5 totlen

Each physical node on the medium is represented by


a smaller struct jffs2 raw node ref, also shown NULL
in Figure 2, which contains two pointers to other next
raw node references — the next one in the physical
nodes
erase block and the next one in the per-inode list —
and also the physical offset and total length of the ino
node. Because of the number of such structures and nlink
the limited amount of RAM available on many em-
bedded systems, this structure is extremely limited struct jffs2_inode_cache
in size.
Figure 2: Raw Node Reference Lists
Because all nodes on the medium are aligned to a
granularity of four bytes, the least significant two
bits of the flash offset field are redundant. They that structure has a NULL at the offset where the
are therefore available for use as extra flags. The struct jffs2 raw node ref would the pointer to
least significant bit is set to indicate that the node the next node in the inode, the code traversing the
represented is an obsolete node, and the other is not list knows it has reached the end, at which point
yet used. the pointer can be cast to the appropriate type and
the inode number and other information can be read
For garbage collection, it is necessary to find, given from the structure.
a raw node reference, the inode to which it be-
longs. It is preferable not to add four bytes con- The NULL field shown in the inode cache structure
taining this information to every such structure, so is used only during the initial scan of the filesystem
instead we play even more evil games with the point- for temporary storage, and hence can be guaranteed
ers. Rather than having a NULL-terminated linked to be NULL during normal operation.
list for the next in ino list, the last raw node ref-
erence actually contains a pointer to the struct During normal operation, the file system’s
jffs2 inode cache for the relevant inode. Because read inode() method is passed an inode number
and is expected to populate a struct inode
5 The size of the hash table is variable at compile time,
with appropriate information. JFFS2 uses the
and in all cases is currently only one entry - which effectively
means that all inode cache structures are stored in a single
inode number to look up the appropriate struct
linked list. If and when this becomes noticeably suboptimal, jffs2 inode cache in a hash table, then uses
it will be simple to correct. the list of nodes to directly read each node which
belongs to the required inode, thereby building up evil hack referred to earlier, for detecting that the
a complete map of the physical locations of each end of the next in ino list has been reached.
range of the inode’s data, similar to the information
which JFFS would have kept in memory even while
it was unused. 3.6 Garbage Collection
Once the full inode structure has been populated in
this manner, it remains in memory until the kernel In JFFS2, garbage collection moves data nodes by
later tries to prune its inode cache under memory determining the inode to which the node to be
pressure, at which point the extra information is garbage collected belongs, and calling the Linux ker-
freed, leaving only the raw node references and the nel’s iget() function for the inode in question. Of-
minimal JFFS2 inode cache structure which were ten, the inode will be in the kernel’s inode cache —
originally present. but sometimes, this will cause a call to the JFFS2
read inode() function as described above.

3.5 Mounting Once the full inode structure is obtained, a replace-


ment node can be written to obsolete the original
node. If it was a data node, the garbage collection
Mounting a JFFS2 file system involves a four-stage routine calls the standard readpage() function for
operation. First, the physical medium is scanned, the page for which the node contains data — again
the CRCs on all the nodes are checked for validity, using the existing file system caching mechanism be-
and the raw node references are allocated. During cause the required page may already be in the page
this stage, the inode cache structures are also allo- cache. Then, as much of the page as possible is re-
cated and inserted into the hash table for each inode compressed and written out in a new node. A par-
for which valid nodes are found. tial page may be written if the node to be garbage
collected is small and there is not sufficient slack
Extra information from the nodes on the flash is space to allow a full page to be written, or if the
cached, such as the version and the range of data page being garbage collected is at the end of the
covered by the node, to prevent the subsequent inode.
stages of the mount process from having to read
anything again from the physical medium. One of the features which is strongly desired for
JFFS2 is a formal proof of correctness of the garbage
After the physical scan is complete, a first pass collection algorithm. The current empirical method
through all the physical nodes is made, building a is not sufficient. The compression, however, gives
full map of the map of data for each inode so that rise to a serious potential problem with this proof.
obsoleted nodes can be detected as such, and in- If a full page is written which compresses extremely
creasing the nlink field in the inode cache of the well, and later a single byte is written in the mid-
linked inode for each valid directory entry node. dle of the page which reduces the compressibility of
the page, then when garbage collecting the original
A second pass is then made to find inodes which page we may find that the new node written out
have no remaining links on the file system and delete is larger than the original. Thus, there would be
them. Each time a directory inode is deleted, the no way to place an upper bound on the amount of
pass is restarted, as other inodes may have been or- space required to garbage collect an erase block full
phaned. In future, this behaviour may be modified of data.
to store orphaned inodes in a lost+found directory
instead of just removing them. The proposed solution to this is to allow the total
ordering of the version field to be relaxed to a par-
Finally, a third pass is made to free the temporary tial ordering. We allow two nodes to have the same
information which was cached for each inode; leav- version field as long as they have identical data.
ing only the information which is normally kept in Thus, when garbage collection finds a node which
the struct jffs2 inode cache during operation. would expand, yet insufficient slack space to allow
In doing so, the field in the inode cache which cor- it to do so, it may copy the original node intact, pre-
responds to the next in ino field of the raw node serving the original version so that the nodes which
reference is set to NULL, thereby enabling the slightly overlay the data contained therein will still continue
to do so. This special case is itself the reason for further com-
plication, because of concerns about expansion dur-
ing garbage collection. If a single byte is written to
3.7 Truncation and File Holes a page which was previously part of a hole, it is nec-
essary to ensure that garbage collection of either the
original hole node or the node containing the new
A problem which arose during the design stage for byte of data should not require more space than is
JFFS2 which had not already been addressed for taken by the original.
the original version involved truncation of files. The
sequence of events which could be problematic was The solution to this problem is the same as for com-
a truncation followed by a write to an offset larger pressed nodes which may expand when merged —
than the truncation point — leaving a “hole” in the if garbage collection would cause an expansion, and
file which should return all zeroes upon being read. there is insufficient slack space to accommodate such
growth, then the original node is copied exactly, re-
On truncation, the original JFFS merely wrote out taining the original version number.
a new node giving the new length, and marked (in
memory) the older nodes containing data beyond
the truncation point as obsolete. Later writes would
occur as normal. 4 Future Development

During a scan of the file system on remounting, the


sequential nature of the garbage collection ensured One oft-requested feature which is currently not
that all the old nodes containing actual data for planned for development in JFFS2 is eXecute In
the ranges which should be “holes” were garbage Place (XIP) functionality. When programs are run
collected before the truncation node. As nodes were from JFFS2, the executable code is copied from the
interpreted in version order after the physical scan, flash into RAM before the CPU can execute it. Like-
correct behaviour could be guaranteed, because the wise, even when the mmap() system call is used,
evidence of the truncation was still present at all data are not accessed directly from the flash but
times until the old data were erased. are copied into RAM when required.

For JFFS2, where blocks can be garbage collected XIP functionality in JFFS2 is not currently planned
out of order, it was necessary to ensure that old because it is fairly difficult to implement and be-
data could never “show through” the holes caused cause the potential benefits of XIP are not clearly
by truncation and subsequent extension of a file. sufficient to justify the effort required to do so.

For this reason, it was decided that there should be For obvious reasons, XIP and compression are mu-
no holes in the proper sense — a complete absence tually exclusive - if data are compressed, they can-
of information for the range of bytes in question. not be used directly in place. Given a prototype
Instead, upon receiving a request to write to an off- platform with sufficient quantities of both RAM and
set greater than the current size of a file, or a re- flash that neither XIP or compression are required,
quest to truncate to a larger size, JFFS2 inserts a and the desire to save money on the hardware, a
data node with the previously-mentioned compres- choice can be made between halving the amount of
sion type JFFS2 COMPR ZERO, meaning that no ac- RAM and using XIP, or halving the amount of flash
tual data are contained with the node, and the en- and using compression.
tire range represented by the node should be set to
zero upon being read. By choosing the latter option, the cost saving will
generally be greater than the former option, because
In the case where a file contains a very large hole, it flash is more expensive than RAM. The operating
is preferable to represent that hole by only a single system is able to be more flexible in its use of the
physical node on the medium, rather than a “hole” available RAM, discarding file buffers during peri-
node for each page in the range affected. Therefore, ods of high memory pressure. Furthermore, because
such hole nodes are a special case of data node; the write operations to flash chips are so slow, compress-
only type of data node which may cover a range of ing the data may actually be faster for many work-
more than one page. loads.
The main problem with XIP, however, is the interac- 4.2 Garbage Collection Space Require-
tion with memory management hardware. Firstly, ments
for all known memory management units, each page
of data must be exactly page-aligned on the flash
chip in order for it to be mapped into processes ad-
dress space – which makes such a file system even
more wasteful of space than the mere absence of A major annoyance for users is the amount of space
compression already implies. Secondly, while giv- currently required to be left spare for garbage collec-
ing write or erase commands to a flash chip, it tion. It is hoped that a formal proof of the amount
may return status words on all read cycles, there- of required space can be produced, but in the mean-
fore all currently valid mappings of the pages of the time a very conservative approach is taken — five
chip would have to be found and invalidated for the full erase blocks must be available for writing before
duration of the operation. These two limitations new writes will be permitted from user space.
make a writable filesystem with XIP functionality
extremely difficult to implement, and it is unlikely It should be possible to reduce this figure signifi-
that JFFS2 could support XIP without fundamen- cantly — hopefully to a single block for NOR flash
tal changes to its design. and to two or three blocks in the case of NAND
flash, where extra space should always be available
An read-only XIP filesystem would be a more rea- to copy away data from bad blocks.
sonable request, and an entirely separate file sys-
tem providing this functionality, based on the exist- The approach to this problem in JFFS1 was to eval-
ing ROMFS file system, is likely to be developed at uate and attempt to prove an upper bound on the
some time in the near future. amount of space required. This appeared to fail be-
cause there appeared to be no such upper bound.
For JFFS2, it is suspected that a more useful ap-
proach may be to define a reasonable upper bound,
such as a single erase block, and to modify the code
to make it true.
4.1 Improved Fault Tolerance

4.3 Transaction Support


The main area where JFFS2 still requires devel-
opment is in fault tolerance. There are still areas
where, although designed to be resilient, JFFS2 may
exhibit a more serious failure mode than is abso-
lutely necessary given a physical error. For storing database information in JFFS2 file sys-
tems, it may be desirable to expose transactions to
In particular, JFFS2 will need more sophisticated user space. It has been argued that user space can
methods of dealing with single-bit errors in flash implement transactions itself, using only the file sys-
chips. Currently, the node contains a 32-bit CRC, tem functionality required by POSIX. This is true
but this only gives error detection; it does not allow — but implementing a transaction-based system on
the file system to correct errors. Error correction top of JFFS2 would be far less efficient than using
is an absolute requirement for operation on NAND the existing journalling capability of the file system;
flash chips, which have lower tolerances. It is desir- for the same reason that emulating a block device
able even on NOR flash. and then using a standard journalling file system on
top of that was considered inadequate.
JFFS2 already has a primitive method of deal-
ing with blocks for which errors are returned by Little work — and relatively little thought — has
the hardware driver — it files them on a separate gone into this subject with respect to JFFS2, yet at
bad list and refuses to use them again until the first consideration it seems that to implement this in
next time the file system is remounted. This should JFFS2 would not be particularly difficult or obtru-
be developed. sive. It is an interesting avenue for future research.
5 Conclusion finally seemed to have reached stability.

Although JFFS2 is extremely young, it is relatively


mature, because it is developed from the excellent 7 Availability
start given by the design of JFFS v1.

The frequency of bugs being reported has reached JFFS was merged into the Linux kernel prior to the
a fairly stable low level, and the majority of recent 2.4.0 release. The current JFFS2 code is also, at
problems reported with JFFS2 have actually turned the time of writing, in Alan Cox’s 2.4-ac kernels.
out to be errors in the physical flash drivers or with The latest code for the 2.4 version of each, and for
other parts of kernel code — although sometimes the 2.2 version of JFFS v1, is available from the
this has highlighted an area where JFFS2 should be Linux-MTD CVS repository. Instructions for ac-
more fault-tolerant. cessing this, along with links to snapshot tarballs
for the firewall-challenged, are available from:
Both versions of JFFS are now in active use in a rea-
sonable number of embedded systems, and JFFS2
has been included as a fundamental part of the http://www.linux-mtd.infradead.org/
“Familiar” distribution of Linux for the Compaq
iPAQ handheld computer; replacing the read-only
The original web site for JFFS and the current
CRAMFS filesystem which was previously used on
code for the 2.0 kernels, along with a link to the
those devices.
jffs-dev mailing list which is used for discussion
of both JFFS and JFFS2, is at:
The existence of a fully-functional writable file sys-
tem for this class of device is an exciting develop-
ment, and was absolutely essential to the progress of
the Familiar distribution, allowing files to be over- http://developer.axis.com/software/jffs/
written individually without having to reset the de-
vice and use the bootloader to program a complete
replacement CRAMFS. At the time of writing, a web site specific to JFFS2
is intended to appear “shortly” at:
Commercial support for JFFS2 is available from
Red Hat, Inc., for customers wishing to use it in
http://sources.redhat.com/jffs2/
production systems with full backup from the de-
velopers.

References
6 Acknowledgements
[FTL] Intel Corporation, Understanding the Flash
Translation Layer (FTL) Specification, (1998).
The author would like to thank Björn Wesen and http://developer.intel.com/design/flcomp/applnots/297816.htm
the staff of Axis Communications AB for designing [LFS] Mendel Rosenblum and John K. Ousterhout,
the original JFFS and releasing it under the GNU The Design and Implementation of a Log-
General Public License — and in particular for then Structured File System, ACM Transactions on
answering a stream of silly questions about it. Computer Systems 10(1) (1992) pp. 26–52.
ftp://ftp.cag.lfs.mit.edu/dm/papers/rosenblum:lfs.ps.gz
The author is also grateful to Red Hat, Inc., who
for some reason took it upon themselves to actually [eCos] Red Hat, Inc., eCos — Embedded Config-
pay him for playing with this stuff. urable Operating System.
http://sources.redhat.com/ecos/
Also deserving of a special mention is Vipin Malik,
who has done a wonderful job of testing JFFS and
JFFS2, often managing to break the latter when it

You might also like