Sun's Network File System (NFS) : Client 0
Sun's Network File System (NFS) : Client 0
Sun's Network File System (NFS) : Client 0
Client 0
Client 1
Network Server
Client 2
RAID
Client 3
1
2 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 3
49.2 On To NFS
One of the earliest and quite successful distributed systems was devel-
oped by Sun Microsystems, and is known as the Sun Network File Sys-
tem (or NFS) [S86]. In defining NFS, Sun took an unusual approach: in-
stead of building a proprietary and closed system, Sun instead developed
an open protocol which simply specified the exact message formats that
clients and servers would use to communicate. Different groups could
develop their own NFS servers and thus compete in an NFS marketplace
while preserving interoperability. It worked: today there are many com-
panies that sell NFS servers (including Oracle/Sun, NetApp [HLM94],
EMC, IBM, and others), and the widespread success of NFS is likely at-
tributed to this “open market” approach.
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
4 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 5
It gets even worse when you consider the fact that a stateful server has
to deal with client crashes. Imagine, for example, a client that opens a file
and then crashes. The open() uses up a file descriptor on the server; how
can the server know it is OK to close a given file? In normal operation, a
client would eventually call close() and thus inform the server that the
file should be closed. However, when a client crashes, the server never
receives a close(), and thus has to notice the client has crashed in order
to close the file.
For these reasons, the designers of NFS decided to pursue a stateless
approach: each client operation contains all the information needed to
complete the request. No fancy crash recovery is needed; the server just
starts running again, and a client, at worst, might have to retry a request.
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
6 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 7
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
8 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
Client Server
fd = open(”/foo”, ...);
Send LOOKUP (rootdir FH, ”foo”)
Receive LOOKUP request
look for ”foo” in root dir
return foo’s FH + attributes
Receive LOOKUP reply
allocate file desc in open file table
store foo’s FH in table
store current file position (0)
return file descriptor to application
close(fd);
Just need to clean up local structures
Free descriptor ”fd” in open file table
(No need to talk to server)
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 9
T IP : I DEMPOTENCY I S P OWERFUL
Idempotency is a useful property when building reliable systems. When
an operation can be issued more than once, it is much easier to handle
failure of the operation; you can just retry it. If an operation is not idem-
potent, life becomes more difficult.
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
10 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
In this way, the client can handle all timeouts in a unified way. If a
WRITE request was simply lost (Case 1 above), the client will retry it, the
server will perform the write, and all will be well. The same will happen
if the server happened to be down while the request was sent, but back
up and running when the second request is sent, and again all works
as desired (Case 2). Finally, the server may in fact receive the WRITE
request, issue the write to its disk, and send a reply. This reply may get
lost (Case 3), again causing the client to re-send the request. When the
server receives the request again, it will simply do the exact same thing:
write the data to disk and reply that it has done so. If the client this time
receives the reply, all is again well, and thus the client has handled both
message loss and server failure in a uniform manner. Neat!
A small aside: some operations are hard to make idempotent. For
example, when you try to make a directory that already exists, you are
informed that the mkdir request has failed. Thus, in NFS, if the file server
receives a MKDIR protocol message and executes it successfully but the
reply is lost, the client may repeat it and encounter that failure when in
fact the operation at first succeeded and then only failed on the retry.
Thus, life is not perfect.
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 11
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
12 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
C1 C2 C3
cache: F[v1] cache: F[v2] cache: empty
Server S
disk: F[v1] at first
F[v2] eventually
(version 2), or F[v2] and the old version F[v1] so we can keep the two
distinct (but of course the file has the same name, just different contents).
Finally, there is a third client, C3, which has not yet accessed the file F.
You can probably see the problem that is upcoming (Figure 49.7). In
fact, there are two subproblems. The first subproblem is that the client C2
may buffer its writes in its cache for a time before propagating them to the
server; in this case, while F[v2] sits in C2’s memory, any access of F from
another client (say C3) will fetch the old version of the file (F[v1]). Thus,
by buffering writes at the client, other clients may get stale versions of the
file, which may be undesirable; indeed, imagine the case where you log
into machine C2, update F, and then log into C3 and try to read the file,
only to get the old copy! Certainly this could be frustrating. Thus, let us
call this aspect of the cache consistency problem update visibility; when
do updates from one client become visible at other clients?
The second subproblem of cache consistency is a stale cache; in this
case, C2 has finally flushed its writes to the file server, and thus the server
has the latest version (F[v2]). However, C1 still has F[v1] in its cache; if a
program running on C1 reads file F, it will get a stale version (F[v1]) and
not the most recent copy (F[v2]), which is (often) undesirable.
NFSv2 implementations solve these cache consistency problems in two
ways. First, to address update visibility, clients implement what is some-
times called flush-on-close (a.k.a., close-to-open) consistency semantics;
specifically, when a file is written to and subsequently closed by a client
application, the client flushes all updates (i.e., dirty pages in the cache)
to the server. With flush-on-close consistency, NFS ensures that a subse-
quent open from another node will see the latest file version.
Second, to address the stale-cache problem, NFSv2 clients first check
to see whether a file has changed before using its cached contents. Specif-
ically, before using a cached block, the client-side file system will issue a
GETATTR request to the server to fetch the file’s attributes. The attributes,
importantly, include information as to when the file was last modified on
the server; if the time-of-modification is more recent than the time that the
file was fetched into the client cache, the client invalidates the file, thus
removing it from the client cache and ensuring that subsequent reads will
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 13
go to the server and retrieve the latest version of the file. If, on the other
hand, the client sees that it has the latest version of the file, it will go
ahead and use the cached contents, thus increasing performance.
When the original team at Sun implemented this solution to the stale-
cache problem, they realized a new problem; suddenly, the NFS server
was flooded with GETATTR requests. A good engineering principle to
follow is to design for the common case, and to make it work well; here,
although the common case was that a file was accessed only from a sin-
gle client (perhaps repeatedly), the client always had to send GETATTR
requests to the server to make sure no one else had changed the file. A
client thus bombards the server, constantly asking “has anyone changed
this file?”, when most of the time no one had.
To remedy this situation (somewhat), an attribute cache was added
to each client. A client would still validate a file before accessing it, but
most often would just look in the attribute cache to fetch the attributes.
The attributes for a particular file were placed in the cache when the file
was first accessed, and then would timeout after a certain amount of time
(say 3 seconds). Thus, during those three seconds, all file accesses would
determine that it was OK to use the cached file and thus do so with no
network communication with the server.
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
14 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
concerns as well. When data (and metadata) is read from disk, NFS
servers will keep it in memory, and subsequent reads of said data (and
metadata) will not go to disk, a potential (small) boost in performance.
More intriguing is the case of write buffering. NFS servers absolutely
may not return success on a WRITE protocol request until the write has
been forced to stable storage (e.g., to disk or some other persistent device).
While they can place a copy of the data in server memory, returning suc-
cess to the client on a WRITE protocol request could result in incorrect
behavior; can you figure out why?
The answer lies in our assumptions about how clients handle server
failure. Imagine the following sequence of writes as issued by a client:
These writes overwrite the three blocks of a file with a block of a’s,
then b’s, and then c’s. Thus, if the file initially looked like this:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
We might expect the final result after these writes to be like this, with the
x’s, y’s, and z’s, would be overwritten with a’s, b’s, and c’s, respectively.
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccc
Now let’s assume for the sake of the example that these three client
writes were issued to the server as three distinct WRITE protocol mes-
sages. Assume the first WRITE message is received by the server and
issued to the disk, and the client informed of its success. Now assume
the second write is just buffered in memory, and the server also reports
it success to the client before forcing it to disk; unfortunately, the server
crashes before writing it to disk. The server quickly restarts and receives
the third write request, which also succeeds.
Thus, to the client, all the requests succeeded, but we are surprised
that the file contents look like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy <--- oops
cccccccccccccccccccccccccccccccccccccccccccc
Yikes! Because the server told the client that the second write was
successful before committing it to disk, an old chunk is left in the file,
which, depending on the application, might be catastrophic.
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 15
To avoid this problem, NFS servers must commit each write to stable
(persistent) storage before informing the client of success; doing so en-
ables the client to detect server failure during a write, and thus retry until
it finally succeeds. Doing so ensures we will never end up with file con-
tents intermingled as in the above example.
The problem that this requirement gives rise to in NFS server im-
plementation is that write performance, without great care, can be the
major performance bottleneck. Indeed, some companies (e.g., Network
Appliance) came into existence with the simple objective of building an
NFS server that can perform writes quickly; one trick they use is to first
put writes in a battery-backed memory, thus enabling to quickly reply
to WRITE requests without fear of losing the data and without the cost
of having to write to disk right away; the second trick is to use a file sys-
tem design specifically designed to write to disk quickly when one finally
needs to do so [HLM94, RO91].
49.12 Summary
We have seen the introduction of the NFS distributed file system. NFS
is centered around the idea of simple and fast recovery in the face of
server failure, and achieves this end through careful protocol design. Idem-
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
16 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]
S UN ’ S N ETWORK F ILE S YSTEM (NFS) 17
References
[AKW88] “The AWK Programming Language” by Alfred V. Aho, Brian W. Kernighan, Peter
J. Weinberger. Pearson, 1988 (1st edition). A concise, wonderful book about awk. We once had the
pleasure of meeting Peter Weinberger; when he introduced himself, he said “I’m Peter Weinberger, you
know, the ’W’ in awk?” As huge awk fans, this was a moment to savor. One of us (Remzi) then said,
“I love awk! I particularly love the book, which makes everything so wonderfully clear.” Weinberger
replied (crestfallen), “Oh, Kernighan wrote the book.”
[C00] “NFS Illustrated” by Brent Callaghan. Addison-Wesley Professional Computing Series,
2000. A great NFS reference; incredibly thorough and detailed per the protocol itself.
[ES03] “New NFS Tracing Tools and Techniques for System Analysis” by Daniel Ellard and
Margo Seltzer. LISA ’03, San Diego, California. An intricate, careful analysis of NFS done via
passive tracing. By simply monitoring network traffic, the authors show how to derive a vast amount
of file system understanding.
[HLM94] “File System Design for an NFS File Server Appliance” by Dave Hitz, James Lau,
Michael Malcolm. USENIX Winter 1994. San Francisco, California, 1994. Hitz et al. were greatly
influenced by previous work on log-structured file systems.
[K86] “Vnodes: An Architecture for Multiple File System Types in Sun U NIX” by Steve R.
Kleiman. USENIX Summer ’86, Atlanta, Georgia. This paper shows how to build a flexible file
system architecture into an operating system, enabling multiple different file system implementations
to coexist. Now used in virtually every modern operating system in some form.
[NT94] “Kerberos: An Authentication Service for Computer Networks” by B. Clifford Neu-
man, Theodore Ts’o. IEEE Communications, 32(9):33-38, September 1994. Kerberos is an early
and hugely influential authentication service. We probably should write a book chapter about it some-
time...
[O91] “The Role of Distributed State” by John K. Ousterhout. 1991. Available at this site:
ftp://ftp.cs.berkeley.edu/ucb/sprite/papers/state.ps. A rarely referenced dis-
cussion of distributed state; a broader perspective on the problems and challenges.
[P+94] “NFS Version 3: Design and Implementation” by Brian Pawlowski, Chet Juszczak, Peter
Staubach, Carl Smith, Diane Lebel, Dave Hitz. USENIX Summer 1994, pages 137-152. The small
modifications that underlie NFS version 3.
[P+00] “The NFS version 4 protocol” by Brian Pawlowski, David Noveck, David Robinson,
Robert Thurlow. 2nd International System Administration and Networking Conference (SANE
2000). Undoubtedly the most literary paper on NFS ever written.
[RO91] “The Design and Implementation of the Log-structured File System” by Mendel Rosen-
blum, John Ousterhout. Symposium on Operating Systems Principles (SOSP), 1991. LFS again.
No, you can never get enough LFS.
[S86] “The Sun Network File System: Design, Implementation and Experience” by Russel
Sandberg. USENIX Summer 1986. The original NFS paper; though a bit of a challenging read,
it is worthwhile to see the source of these wonderful ideas.
[Sun89] “NFS: Network File System Protocol Specification” by Sun Microsystems, Inc. Request
for Comments: 1094, March 1989. Available: http://www.ietf.org/rfc/rfc1094.txt.
The dreaded specification; read it if you must, i.e., you are getting paid to read it. Hopefully, paid a lot.
Cash money!
[V72] “La Begueule” by Francois-Marie Arouet a.k.a. Voltaire. Published in 1772. Voltaire said
a number of clever things, this being but one example. For example, Voltaire also said “If you have two
religions in your land, the two will cut each others throats; but if you have thirty religions, they will
dwell in peace.” What do you say to that, Democrats and Republicans?
T HREE
c 2008–19, A RPACI -D USSEAU
E ASY
P IECES
18 S UN ’ S N ETWORK F ILE S YSTEM (NFS)
Homework (Measurement)
In this homework, you’ll do a little bit of NFS trace analysis using real
traces. The source of these traces is Ellard and Seltzer’s effort [ES03].
Make sure to read the related README and download the relevant tar-
ball from the OSTEP homework page (as usual) before starting.
Questions
1. A first question for your trace analysis: using the timestamps found
in the first column, determine the period of time the traces were
taken from. How long is the period? What day/week/month/year
was it? (does this match the hint given in the file name?) Hint: Use
the tools head -1 and tail -1 to extract the first and last lines of
the file, and do the calculation.
2. Now, let’s do some operation counts. How many of each type of op-
eration occur in the trace? Sort these by frequency; which operation
is most frequent? Does NFS live up to its reputation?
3. Now let’s look at some particular operations in more detail. For
example, the GETATTR request returns a lot of information about
files, including which user ID the request is being performed for,
the size of the file, and so forth. Make a distribution of file sizes
accessed within the trace; what is the average file size? Also, how
many different users access files in the trace? Do a few users dom-
inate traffic, or is it more spread out? What other interesting infor-
mation is found within GETATTR replies?
4. You can also look at requests to a given file and determine how
files are being accessed. For example, is a given file being read or
written sequentially? Or randomly? Look at the details of READ
and WRITE requests/replies to compute the answer.
5. Traffic comes from many machines and goes to one server (in this
trace). Compute a traffic matrix, which shows how many different
clients there are in the trace, and how many requests/replies go to
each. Do a few machines dominate, or is it more evenly balanced?
6. The timing information, and the per-request/reply unique ID, should
allow you to compute the latency for a given request. Compute the
latencies of all request/reply pairs, and plot them as a distribution.
What is the average? Maximum? Minimum?
7. Sometimes requests are retried, as the request or its reply could be
lost or dropped. Can you find any evidence of such retrying in the
trace sample?
8. There are many other questions you could answer through more
analysis. What questions do you think are important? Suggest
them to us, and perhaps we’ll add them here!
O PERATING
S YSTEMS WWW. OSTEP. ORG
[V ERSION 1.01]