Technical Report: Using The Linux NFS Client With Network Appliance™ Storage
Technical Report: Using The Linux NFS Client With Network Appliance™ Storage
Technical Report: Using The Linux NFS Client With Network Appliance™ Storage
Technical Report: Using the Linux NFS Client with Network Appliance Storage Getting the Best from Linux and Network Appliance Technologies
Updated By Bikash R. Choudhury, Network Appliance, Inc. April, 2007 | TR-3183
Abstract This report helps you get the best from your Linux NFS clients when used in an environment that includes Network Appliance storage. You will learn what level of performance to expect from your Linux systems. You will learn how to tune your Linux clients and diagnose performance and reliability problems. Finally, you will learn where to look for more information when faced with problems that are difficult to diagnose. Storage tuning information specific to your application may be available in other Network Appliance technical reports.
This document is appropriate for customers, systems engineers, technical marketing engineers, and customer support personnel who install and configure Linux systems that use NFS to access Network Appliance storage and network caches.
TECHNICAL REPORT
Table of Contents
Abstract 1) 2) 3) 3.1) 3.2) 3.3) 3.4) 4) 4.1) 4.2) 4.3) 4.4) 4.5) 4.6) 4.7) 5) 5.1) 5.2) 5.3) 5.4) 5.5) 6) 6.1) 6.2) 6.3) 6.4) Typographic Conventions Introduction Which Linux NFS Client Is Right for Me? Identifying Kernel Releases Todays Linux distributions The NFS Client in the 2.4 Kernel The NFS Client in the 2.6 Kernel Foolproof Mount Options for Linux NFS Clients Choosing a Network Transport Protocol Capping the Size of Read and Write Operations Special Mount Options Tuning NFS client cache behavior Mounting with NFS version 4 Unmounting NFS File Systems Mount Option Examples Performance Linux NFS client performance Diagnosing Performance Problems with the Linux NFS Client Error Messages in the Kernel Log Oops Getting Help Other Sundries Telling Time Security Network Lock Manager Using the Linux Automounter 1 3 4 5 5 6 6 7 11 12 13 15 16 18 20 21 24 24 25 27 27 28 30 30 30 32 35
TECHNICAL REPORT
Net booting your Linux NFS clients Executive Summary Appendix Related Material Special network settings Controlling File Read-Ahead in Linux How to Enable Trace Messages How to Enable Uncached I/O on RHEL AS 2.1
36 38 39 39 40 42 43 43
1) Typographic Conventions Linux and appliance commands and file names appear in Courier New. Summary information appears in red italicized type at the end of each section, and an executive summary appears at the end of the document.
TECHNICAL REPORT
2) Introduction More and more Network Appliance customers recognize the value of Linux in their enterprises. Historically, the Linux NFS client has trailed the rest of Linux in providing the level of stability, performance, and scalability that is appropriate for enterprise workloads. In recent times, however, the NFS client has improved considerably and continues to improve in performance and ability to work under degraded network conditions. This document addresses several areas that concern those who are planning a new Linux deployment or are administering an existing environment that contains Linux NFS clients accessing Network Appliance appliances. These areas include: The level of performance and stability to expect from Linux NFS clients How to tune Linux NFS clients to perform well with appliances How to diagnose client and network problems that involve Linux NFS clients Which network interfaces and drivers work best How to configure other services required to provide advanced NFS features Where to find more tuning and diagnostic information on NOW (NetApp on the Web) and the Internet
Except for clients running Oracle databases, Network Appliance does not recommend specific Linux kernel releases or distributions. First, there are too many distributions and releases to qualify them all. There are more than a half-dozen distributions and thousands of different kernel releases. Add to that the many different versions of user-level helper applications (such as the mount command). Because all of these are open source, you can modify or replace any part of your client. Second, many hardware and application vendors specify a small set of releases or a single release and distribution that are certified and supported. It would be confusing for us to recommend one particular kernel or distribution when your hardware vendor recommends another, and your application vendor certifies yet a third. Finally, some applications are more sensitive to NFS client behavior than others. Recommending a particular Linux NFS client depends on the applications you want to run and what your performance and reliability requirements are. Therefore, instead of recommending one or two releases that work well, we provide some guidelines to help you decide among the many Linux distributions and releases, and we provide advice on how to make your Linux NFS clients work their best.
TECHNICAL REPORT
3) Which Linux NFS Client Is Right for Me? Before we begin our focus on technical issues, we cover some basic technical support challenges specific to Linux. The Linux NFS client is part of the Linux kernel. Because Linux is open source, you might think that it is easy to provide Linux kernel patches to upgrade the NFS client. In fact, providing a patch that fixes a problem or provides a new feature can be complicated by several facts of life in the Linux and open source worlds. There are many different parts to Linux, but the two we are concerned about are the distribution and the kernel. The distribution is the set of base operating system files that are included when customers install a Red Hat or SUSE Linux distribution on their hardware. This includes commands, applications, and configuration files. A kernel comes with a distribution, but customers can replace it, usually without affecting other files provided in the distribution. 3.1) Identifying Kernel Releases The version number of a Linux distribution and the release number of a Linux kernel use different naming schemes. While planning a distribution, each distributor chooses a particular kernel release (for example, 2.6.5) and adds some modifications of its own before placing the kernel into a distribution. To reduce the amount of variation they encounter in their support contracts, distributors support only a small set of kernels, most of which are carefully designed for a specific distribution. Because the NFS client is part of the kernel, updates to the NFS client require that you replace the kernel. Technically, it is easy to replace a kernel after a distribution is installed, but Linux customers risk losing distributor support for their installation if they install a kernel that was not built by the distributor. For this reason, Network Appliance does not recommend specific patches or kernel versions. Often support contracts constrain customers so they cannot install a particular patch until their chosen distributor provides a supported kernel that includes that patch. In the past kernels were released in two branches, known as the stable branch and the development branch. The stable branch is ostensibly the branch that is hardened, reliable, and has unchanging program interfaces, while the development branch can be (and often is) unstable and sometimes is even unbootable. Stable branches have even minor release numbers, such as 2.4, while development branches have odd minor release numbers, such as 2.5. However, upstream no longer maintains stable and development branches. They now have a single branch in which all new development is integrated. It is intended to remain stable at all times. Some new projects and submodules are marked as experimental when they are not ready for regular enterprise consumption. The last stable branch in the previous model is 2.4; the last stable and development branch remains 2.6. Linux kernels are not published on a time-based schedule. Kernel revisions are released when the branch maintainer decides they are ready. New features and API changes are allowed in development kernels, but there is no schedule for when such a kernel will become a stable release. Development branches have historically taken two years to 30 months to become stable branches. It is for this reason that there is often significant pressure to add new features to stable releases instead of working them into development releases. To expedite the addition of new features, kernel developers recently changed the way stable and development branches are treated. As of this writing, the 2.6 kernel is now being used as a development branch while distributors are treating their own kernels as the stable branches.
TECHNICAL REPORT
3.2) Todays Linux distributions As mentioned above, distributions are numbered differently than kernels. Each distributor chooses its own numbering scheme. When describing your Linux environment to anyone, be sure to list both the distribution release number, and the kernel release number. Distributors usually append another number on the end of their kernel versions to indicate which revision of that kernel is in use. For instance, Red Hat shipped kernel 2.4.18-3 with its 7.3 distribution, but made several other errata kernels available over time to fix problems in its kernel: 2.4.18-10, 2.4.18-17, 2.4.18-27, and so on. An important trend in commercial Linux distributions is the existence of enterprise Linux distributions. Enterprise distributions are quality-assured releases that come with special support contracts. Not all customers need this level of support, however. Red Hat recently changed its product line, dropping the professional series of distributions in favor of an openly maintained distribution called Fedora. Fedora is intended for developers and customers who can tolerate some instability on their Linux systems. SUSE continues to sell an enterprise distribution as well as a less expensive desktop product line. Distributions such as Mandrake and Debian are ideal for customers looking for a low-cost general purpose Linux distribution but who have enough expertise to support themselves. Network Appliance recommends that its Linux customers always use the latest actively maintained distributions available. Customers running older unsupported distributions no longer get the benefits of security fixes and quick bug fixes on their Linux systems. Most Linux distributors will not address bugs in older distributions at all. Especially if your clients and storage are not protected by a firewall, it is important for you to stay current with the latest errata patches available for your distribution. To find out which kernel your clients run, you can use this command: % uname r 2.4.21-15.0.4.ELsmp % Kernels built from community source code usually have only three or four dot-separated numbers, like 2.6.8.1. Distributors generally add a hyphen and more version numbers (in this case, -15.0.4), which indicate that additional patches over the community source base have been applied. The keyword on the end, such as hugemem or smp, shows additional hardware capabilities for which this kernel was built. 3.3) The NFS Client in the 2.4 Kernel The NFS client in this kernel has many improvements over the older 2.2 client, most of which address performance and stability problems. The NFS client in kernels later than 2.4.16 has significant changes to help improve performance and stability. Customers who use Linux on hardware with more than 896MB should know that a special kernel compile option, known as CONFIG_HIGHMEM, is required for the system to access and use physical memory above 896MB. The Linux NFS client has a known problem in these configurations in which an application or the whole client system can hang at random. This issue has been addressed in the Linux communitys 2.4.20 kernel, but still haunts kernels contained in distributions from Red Hat and SUSE that are based on earlier kernels. Early releases of Red Hat 7.3 contained a kernel that demonstrated very poor NFS client performance when mounting with UDP. Recent errata kernels that fix some of the performance problems are
TECHNICAL REPORT
available from Red Hat. Network Appliance recently published an article describing this problem in detail. See document ntapcs6648 on NOW (the URL is in the appendix). Earlier kernels in the 2.4 series may have some problems when using NFS over TCP. The latest 2.4based distributions (SUSE SLES 8 SP3, Fedora Core 1, and RHEL 3.0) use kernels that have a robust NFS over TCP implementation. Kernels older than 2.4.20 can suffer from problems with NFS over TCP that result from lossy networks and overloaded NFS servers. The problem, documented in BURT 96021, can cause mount points to become unusable until the client is rebooted. No matter which 2.4 kernel your distribution uses, you should always start with NFS over TCP first, as TCP has a number of important benefits over UDP. 3.4) The NFS Client in the 2.6 Kernel During 2004, distributions based on the 2.6 kernel appeared and became stable enough to be deployed in production environments. SUSE SLES 9 was the first enterprise Linux distribution to use 2.6; later Red Hat also introduced its own 2.6-based enterprise version, Red Hat Enterprise Linux 4. A new feature in the 2.6 kernel is support for the latest version of the NFS protocol, version 4. Developers are still in the process of retrofitting the Linux NFS client and server implementations for the new protocol version. Certain features like read and write delegations are available today in the 2.6.18 (RHEL5) kernel, but others, such as replication and migration support, are still under development and are not yet available. Support for NFS version 4 is available now in Fedora Core and RHEL 4 and RHEL 5, and is regularly tested with the NFS version 4 implementation in Data ONTAP as support for new features, such as delegation. Full file system replication and migration support still is not available in the mainline kernel. There is support for NFSv4 "referrals" in the 2.6.20 kernel. Referrals use the replication/migration protocol in order to tell a client that a given subdirectory exists on another server (rather like an automounter), but the current code still can't handle the case in which the server says, "I just moved the file you have open over to another server." Currently Data ONTAP does not support referrals or migration/replication. The 2.6 kernel also brings support for advanced authentication mechanisms such as Kerberos 5. Support for Kerberos works with NFS versions 2 and 3 as well as for NFS version 4. Kerberos authentication increases security by reducing the likelihood that user identities in NFS requests can be forged. It also provides optional facilities to ensure the integrity or privacy of communication between an NFS client and server. As with NFS version 4, developers are still refining support for NFS with Kerberos in Linux, so there are still some missing features. In RHEL5 (2.6.18 kernel), it is safe to use for Kerberos authentication and integrity. At the time there was a bug in the DES library that causes privacy to screw up. Within the next year we expect NFS with Kerberos in Linux to acquire robust support for Kerberos authentication, integrity, and privacy over both TCP and UDP. Because the Linux implementation of NFS with Kerberos is based on Internet standards and because it is tested with the leading proprietary Kerberos implementations regularly, Linux NFS with Kerberos will interoperate seamlessly with NetApp systems once the implementation is complete. The NFS client in the 2.6 kernel has demonstrated superior performance and stability over older Linux NFS clients, but as usual customers should be cautious about moving their production workloads onto releases of the Linux kernel. Kernels 2.4.21-32.EL and 2.4.21-37.EL are based on Fedora Core 3 and Red Hat packages those as RHEL3Update5 and RHEL3update6, respectively. Kernels 2.6.9-5, that is, RHEL4, is based on Fedora Core 4. The latest kernel from the RHEL4 branch is 2.4.9-42.EL. RHEL5 is based on Fedora Core 5 and the kernel version is 2.6.18-8.e15.
TECHNICAL REPORT
Highlights of the 2.6 kernel: RHEL4 and RHEL5 are based on the 2.6 kernel. RHEL4 started as the 2.6.9 kernel and RHEL5 started as the 2.6.18 kernel i) New hardware support The 2.6 kernel supports new architectures, such as the 64-bit PowerPC, the 64-bit AMD Opteron, and embedded processors. The x86 4G 4G kernel/user memory split provides increased address spaces. The new kernel supports up to 64GB main memory and ~4GB process virtual address space. ii) Hyper-threading Hyper-threading is a major hardware enhancement supported by the 2.6 kernel. Basically, hyperthreading can create multiple virtual processors based on a single physical processor using simultaneous multi-threading technology (SMT); multiple application threads can be run simultaneously on one processor. To take full advantage of it, applications need to be multithreaded. iii) Threading improvements The 2.6 kernel adapts the new thread library, Native POSIX Thread Library (NPTL). This new library is based on a 1:1 model and full POSIX compliance. NPTL gives the kernel a major performance boost for multi-threading applications in an SMP environment. Another threading improvement in the 2.6 kernel is that the number of PIDs that can be allocated has increased from 32,000 to 1 billion. The threading change improves the application-starting performance on heavily loaded systems. iv) I/O improvements Block I/O Layer The Block I/O Layer in the 2.6 kernel has been rewritten to improve kernel scalability and performance. The global I/O request lock in 2.4 has been removed. The block I/O buffer (kiobuf) in the 2.6 kernel allows I/O requests larger than PAGE_SIZE. Most of the problems that are seen are caused by the use of the buffer head and kiobuf and are addressed in the new layer. The I/O scheduler was completely rewritten. There are also major improvements that have been made on SCSI support. Asynchronous I/O Asynchronous I/O is new in the 2.6 kernel. It provides ways for enterprise applications such as Web servers and databases to scale up without resorting to complex internal pooling mechanisms for network connections. v) NUMA (Non-Uniform Memory Access) NUMA is another major feature that has been added in the Linux 2.6 kernel to improve system performance. In the traditional model for multiprocessor support (symmetric multiprocessing, or SMP), each processor has equal access to memory and I/O. The high contention rate of the processor bus becomes a performance bottleneck. The NUMA architecture can increase processor speed without increasing the load on the processor bus. In NUMA systems, each processor is close to some parts of memory and further from others. Processors are arranged in smaller regions called nodes. Each node has its own processors and memory; the nodes can talk to each other. It is quicker for processors to
TECHNICAL REPORT
gain access to memory in a local node than in different nodes. Minimizing the inter-node communications can improve the system performance. vi) Reverse Mappings Reverse Mapping or in other words RMAP was implemented in the 2.6 kernel in which the page tables keep track of the physical pages of memory that are used by a process, and they map the virtual pages to the physical pages. Some of these pages might not be used for long periods of time, making them good candidates for swapping out. However, before they can be swapped out, every single process mapping that page must be found so that the page-table entry for the page in that process can be updated. In the Linux 2.4 kernel, this can be a challenging task as the page tables for every process must be traversed in order to determine whether or not the page is mapped by that process. As the number of processes running on the system grows, so does the work involved in swapping out one of these pages. In spite of some drawbacks, RMAP minimizes a serious bottleneck with locating processes that map a page to a simple operation. It also helps the system continue to perform and scale well when large applications are placing huge memory demands on the kernel and multiple processes are sharing memory. Networking improves in the 2.6 kernel The 2.6 kernel improves the Network File System (NFS) by including version 4. Kernels 2.6.9-42.EL (RHEL4.4) and 2.6.18-8.el5 (RHEL5) or any of the latest kernels available from the RHEL4 and RHEL5 code are recommended for NFSv4. Apart from the NFSv4, the 2.6 kernel provides support for several new file systems, including JFS, XFS, Autofsv5 and the Andrew File System (AFS). The ext3 file system size can grow to 8TB (x86/AMD64/EM64T) and 16TB (Itanium2/POWER). The 2.6 kernel also supports Zero Copy networking. This means on any network node, during message transmission, there is no data copy among memory segments and all message transmissions are operated directly between the user application space and outside the network through network interfaces. The 2.6 kernel also introduced IPSEC, which provides users with security services for traffic at the Internet Protocol (IP) layer. IP payload compression is another addition to the kernel that reduces the size of IP datagrams. This 2.6 networking feature improves the performance of communication between two endpoints, provided both machines have sufficient computational power and the communication takes place over congested and/or slow links. There are two new members of the TCP/IP protocol family, Stream Control Transmission Protocol (SCTP) and Internet Protocol version 6 (IPv6), which was introduced in the 2.6 kernel. The SCTP provides new features beyond TCP, such as multi-streaming and multi-homing, that are critical to certain workloads, such as telephony signaling over IP networks. The newer 2.6 kernels feature improved security options with IPv6. Besides extending IPSec, IPComp, and tunneling support to work over IPv6, the 2.6 kernel provides IPv6 Privacy Extensions too. The 2.6.18 kernel, otherwise called RHEL5, has some additional enhancements over 2.6.9 (RHEL4).
TECHNICAL REPORT
All the changes from 2.6.9 up to 2.6.18 are all included in RHEL5. There are performance enhancements in the Big kernel Lock (BKL) in multi-processor environmentsthey exhibit more parallelism. Other performance improvements include the on-the-fly changes to the I/O scheduler and the addition of another level in page tables to increase the usable memory up to 128TB. The enhanced memory manager in the 2.6.18 kernel provides more stability. Some changes, like Highmem Page-Table Entries (PTE) and Large pages, work to reduce the overhead caused by memory management, thereby providing better performance and more efficiency. RHEL5-AP (Advanced Platform) supports Xen Virtualization. Not all of the enhancements provided by the 2.6 kernel will apply to each enterprise application. Some of them do have specific hardware and software requirements. However, most of the enhancements listed here are general kernel improvements that will help Linux break the enterprise barrier.
In summary: You should use the latest distribution and kernel available from your distributor when installing a new deployment, and attempt to keep your existing Linux clients running the latest updates from your distributor. Always check with your hardware and application vendors to be certain they support the distribution and kernel you choose to run. Contact us if you have any special requirements.
10
TECHNICAL REPORT
4) Foolproof Mount Options for Linux NFS Clients If you have never set up mount options on an NFS client before, review the nfs man page on Linux to see how these terms are defined. You can type man nfs at a shell prompt to display the page. In addition, OReillys Managing NFS & NIS covers many of these topics (see the appendix for URL and ISBN information). You can look in /etc/fstab on the client to see what options the client attempts to set when mounting a particular file system. Check your automounter configuration to see what defaults it uses when mounting. Running the mount command at a shell prompt tells you what options are actually in effect. Clients negotiate some options, for example, the rsize option, with servers. Look in the clients /proc/mounts file to determine exactly what mount options are in effect for an existing mount. The default NFS protocol version (2, 3, or 4) used when mounting an NFS server can change depending on what protocols the server exports, which version of the Linux kernel is running on the client, and what version of the mount utilities package you use. Version 2 of NFS is the default on earlier versions of 2.4 kernels, but most modern distributions make version 3 the default. To ensure that your client uses the NFSv3 protocol, you should specify vers=3 when mounting a system. Be sure that the NFSv3 protocol is enabled on your appliance before trying to mount using vers=3 by using the options nfs command on your appliances console. The hard mount option is the default on Linux and is mandatory if you want data integrity. Using the soft option reduces the likelihood of client instability during server and network outages, but it exposes your applications to silent data corruption, even if you mount file systems read-only. If a soft timeout interrupts a read operation, the clients cached copy of the file is probably corrupt. To purge a corrupted file requires that some application locks and unlocks the file, that the whole file system is unmounted and remounted, or that another client modifies the files size or mtime. If a soft timeout interrupts a write operation, there is no guarantee that the file on the server is correct, nor is there any guarantee that the clients cached version of the file matches what is on the server. A client can indicate that a soft timeout has occurred in various ways. Usually system calls return EIO when such a timeout occurs. You may also see messages in the kernel log suggesting that the client had trouble maintaining contact with the server and has given up. If you see a message that says the client is still trying, then the hard mount option is in effect. As an alternative to soft mounting, consider using the intr option, which allows users and applications to interrupt the NFS client when it gets stuck waiting for server or network recovery. On Linux, interrupting applications or mount commands does not always work, so sometimes rebooting your client is necessary to recover a mount point that has become stuck because the server is not available. When running applications such as databases that depend on end-to-end data integrity, you should use hard,nointr. Oracle has verified that using intr instead of nointr can expose your database to the risk of corruption when a database instance is signaled (for example, during a shutdown abort sequence). The soft option is useful only in a small number of cases. If you expect significant server or network instability, try using the soft option with TCP to help reduce the impact of temporary problems. When using the soft option, especially with UDP, specify a relatively large number of retries. This reduces the likelihood that very brief outages or a few dropped packets will cause an application failure or data corruption.
11
TECHNICAL REPORT
4.1) Choosing a Network Transport Protocol Although UDP is a simple transport protocol that has less CPU and network overhead than TCP, NFS over UDP has deficiencies that are exposed on congested networks, such as routed multispeed networks, DSL links, and slow WANs, and should never be used in those environments. TCP is practically always a safe bet, especially on versions of Linux more recent than 2.4.19. Future versions of the NFS protocol may not support UDP at all, so it is worth planning a transition to TCP now if you still use primarily NFS over UDP. Note that older releases of Data ONTAP do not enable NFS over TCP by default. From Data ONTAP release 6.2 onward, TCP connections are enabled automatically on new system installations. For all workloads including database environments we recommend using TCP. NFS over TCP can handle multispeed networks (networks in which the links connecting the server and the client use different speeds), higher levels of packet loss and congestion, fair bandwidth sharing, and widely varying network and server latency, but can cause long delays during server recovery. Although TCP has slightly greater network and CPU overhead on both the client and server, you will find that NFS performance on TCP remains stable across a variety of network conditions and workloads. If you find UDP suits your needs better and you run kernels released previous to 2.4.22, be sure to enlarge your clients default socket buffer size by following the instructions listed in the appendix of this guide. Bugs in the IP fragmentation logic in these kernels can cause a client to flood the network with unusable packets, preventing other clients from accessing the system. The Linux NFS client is especially sensitive to IP fragmentation problems that can result from congested networks or undersized switch port buffers. If you think IP fragmentation is an issue for your clients using NFS over UDP, the netstat s command on the client and on the appliance will show continuous increases in the number of IP fragmentation errors. Be certain to apply this change to all Linux NFS clients on your network in order for the change to be completely effective. Note that RHEL 3.0s kernels, even though based on 2.4.21-pre1, already set transport socket buffer sizes correctly. In Linux kernels previous to 2.4.21, a remote TCP disconnect (for example during a cluster failover) occasionally may cause a deadlock on the client that makes a whole mount point unusable until the client is rebooted. There is no workaround other than upgrading to RHE4.4 and later versions of the Linux kernel where this issue is addressed. For databases, we require NFS over TCP. There are rare cases where NFS over UDP on noisy or very busy networks can result in silent data corruption. In addition, Oracle9i and 10g RAC are certified on NFS over TCP. You can control RPC retransmission timeouts with the timeo option. Retransmission is the mechanism by which clients ensure a server receives and processes an RPC request. If the client does not receive a reply for an RPC within a certain interval for any reason, it retransmits the request until it receives a reply from the server. After each retransmission, the client doubles the retransmit timeout up to 60 seconds to keep network load to a minimum. By default, the client retransmits an unanswered UDP RPC request after 0.7 seconds. In general, it is not necessary to change the retransmission timeout for UDP, but in some cases, a shorter retransmission timeout for NFS over UDP may shorten latencies due to packet losses. As of kernel 2.4.20, an estimation algorithm that adjusts the timeout for optimal performance governs the UDP retransmission timeout for some types of RPC requests. Early versions of this estimator allowed
12
TECHNICAL REPORT
extremely short retransmit timeouts, resulting in low performance and unnecessary error messages in the clients kernel log. The latest updates of RHEL 3 and SLES 8 contain fixes to address this problem. The Linux NFS client quietly retransmits RPC requests several times before reporting in the kernel log that it has lost contact with an NFS server. You can control how many times the client retransmits the same request before reporting the loss using the retrans mount option. Remember that whenever the hard mount option is in effect, an NFS client never gives up retransmitting an RPC until it gets a reply. Be careful not to use the similar-sounding retry mount option, which controls how long the mount command retries a backgrounded mount request before it gives up. Retransmission for NFS over TCP works somewhat differently. The TCP network protocol contains its own timeout and retransmission mechanism that ensures packets arrive at the receiving end reliably and in order. The RPC client depends on this mechanism for recovering from the loss of RPC requests and thus uses a much longer timeout setting for NFS over TCP by default. Due to a bug 151097 in the mount command, the default retransmission timeout value on Linux for NFS over TCP is quite small, unlike other NFS client implementations. This bug is fixed in RHEL3. RHEL4.4 and later releases will exhibit normal behavior. To obtain standard behavior, we strongly recommend using timeo=600,retrans=2 explicitly when mounting via TCP. Unlike with NFS over UDP, using a short retransmission timeout with NFS over TCP does not have performance benefits and may increase the risk of data corruption. In summary, we strongly recommend using TCP as the transport of choice for NFS on modern Linux distributions. To avoid IP fragmentation issues on both the client and the appliance, consider disabling NFS over UDP entirely on the appliance and explicitly specifying tcp on all your NFS mounts. In addition, we strongly recommend the explicit use of the timeo=600 mount option on Linux to work around bugs in the mount command which shorten the retransmit timeout. If you must use UDP for NFS, ensure you have properly sized the transport socket buffers, and explicitly set a large number of retransmissions with the retrans= mount option. 4.2) Capping the Size of Read and Write Operations NFS clients break application read and write requests into smaller chunks when communicating with NFS servers. The maximum size, in bytes, that a client uses for NFS read requests is called the rsize, and the maximum size a client uses for NFS write requests is called the wsize. Together, these two are often referred to as the transfer size, as there are few cases where the two need to have different values. On Linux, the maximum transfer size is 32KB, but the transfer size can be anywhere between 1KB and 32KB. The client ensures these values are rounded down to the nearest power of 2. You can specify the transfer sizes explicitly when mounting an NFS server with the rsize and wsize mount options, or you can allow the client to choose a default value. The server and client take these values and negotiate the maximum allowed size based on what both client and server support. In fact, you can tell your appliance to limit the maximum transfer size using Data ONTAPs options nfs command to prevent clients from using a large transfer size. Be careful not to change the maximum transfer size while there is outstanding read or write activity, as some older Linux clients cant recover gracefully from a change in the middle of a read or write operation. The network transport protocol (TCP or UDP) you choose interacts in complicated ways with the transfer size. When you encounter poor performance because of network problems, using NFS over TCP is a better way to achieve good performance than using small read or write sizes over UDP. The NFS client and server fragment large UDP datagrams, such as single read or write operations larger
13
TECHNICAL REPORT
than the networks MTU, into individual IP packets. RPC over UDP retransmits a whole RPC request if any part of it is lost on the network, whereas RPC over TCP efficiently recovers a few lost packets and reassembles the complete request at the receiving end. Thus with NFS over TCP, 32KB read and write size usually provides good performance by allowing a single RPC to transmit or receive a large amount of data. With NFS over UDP, 32KB read and write size may provide good performance, but often using NFS over UDP results in terrible performance if the network becomes congested. For Linux, a good compromise value when using NFS over UDP is to use a small multiple of the networks MTU. For example, typical Ethernet uses an MTU of 1,524 bytes, so using a transfer size of 4,096 or 8,192 is probably going to give the best performance as network conditions vary over time. If you find even that does not work well, and you cannot improve network conditions, we recommend switching to NFS over TCP if possible. By default, NFS client implementations choose the largest transfer size a server supports. However, if you do not explicitly set rsize or wsize when you mount an NFS file system on a 2.4-based Red Hat NFS client, the default value for both is a modest 4,096 bytes. Red Hat chose this default because the default transport is UDP, and the small default transfer size setting allows the Linux NFS client to work without adjustment in most environments. Usually on clean high-performance networks, or with NFS over TCP, you can improve NFS performance by explicitly increasing these values. However, this has changed in 2.6-based kernels. The default rsize and wsize are set to 32768 unless otherwise specified. Currently the default transport is TCP, which is highly recommended in all kinds of workloads. In Linux, the rsize and wsize mount options have additional semantics compared with the same options as implemented in other operating systems. Normally, the Linux client caches application write requests, issuing NFS WRITE operations when it has at least wsize bytes to write. The NFS client often returns control to a writing application before it has issued any NFS WRITE operations. It also issues NFS READ operations in parallel before waiting for the server to reply to any of them. If rsize is set below the systems page size (4KB on x86 hardware), the NFS client in 2.4 kernels issues individual read operations one at a time and waits for each operation to complete before issuing the next read operation. If wsize is set below the systems page size, the NFS client issues synchronous writes without regard to the use of the sync or async mount options. As with reads, synchronous writes cause applications to wait until the NFS server completes each individual write operation before issuing the next operation or before letting an application continue with other processing. When performing synchronous writes, the client waits until the server has written its data to stable storage before allowing an application to continue. Some hardware architectures allow a choice of different page sizes. Intel Itanium systems, for instance, support pages up to 64KB. On a system with 64KB pages, the rsize and wsize limitations described above still apply; thus all NFS I/O is synchronous on these systems, significantly slowing read and write throughput. This limitation has been removed in 2.6 kernels so that all read and write traffic is asynchronous whenever possible, independent of the transfer size settings. When running on hardware that supports different page sizes, choose a combination of page size and r/wsize that allows the NFS client to do asynchronous I/O if possible. Usually distributors choose a single large page size, such as 16KB, when they build kernels for hardware architectures that support multiple page sizes. In the 2.6-based kernel, the memory manager deals with memory in 4 KB pages on x86 systems. The actual page size is architecture dependent. For most uses, pages of this size are the most efficient way for the memory manager to deal with memory. Some applications, however, make use of extremely
14
TECHNICAL REPORT
large amounts of memory. Large databases are a common example of this. For every page mapped by each process, page-table entries must also be created to map the virtual address to the physical address. If you have a process that maps 1 GB of memory with 4 KB pages, it would take 262,144 page-table entries to keep track of those pages. If each page-table entry consumes 8 bytes, then that would be 2 MB of overhead for every 1 GB of memory mapped. This is quite a bit of overhead by itself, but the problem becomes even worse if you have multiple processes sharing that memory. In such a situation, every process mapping that same 1 GB of memory would consume its own 2 MB worth of page-table entries. With enough processes, the memory wasted on overhead might exceed the amount of memory the application requested for use. One way to help alleviate this situation is to use a larger page size. Most modern processors support at least a small and a large page size, and some support even more than that. On x86, the size of a large page is 4 MB, or 2MB on systems with physical address extension (PAE) turned on. Assuming a large page size of 4 MB is used in the same example from above, that same 1 GB of memory could be mapped with only 256 page-table entries instead of 262,144. This translates to only 2,048 bytes of overhead instead of 2 MB. The use of large pages can also improve performance by reducing the number of translation lookaside buffer (TLB) misses. The TLB is a sort of cache for the page tables that allows virtual to physical address translation to be performed more quickly for pages that are listed in the table. Of course, the TLB can only hold a limited number of translations. Large pages can accommodate more memory in fewer actual pages, so as more large pages are used, more memory can be referenced through the TLB than with smaller page sizes. It is very important to note that the capabilities of the Linux NFS server are different from the capabilities of the Linux NFS client. As of the 2.6.9 and later kernel release, the Linux NFS server does support NFS over TCP with an rsize and a wsize of 32KB. The Linux NFS client, however, supports NFS over both UDP and TCP and rsize and wsize up to 32KB. Some online documentation is confusing when it refers to features that Linux NFS supports. Usually such documentation refers to the Linux NFS server, not the client. Check with your Linux distributor to determine whether their kernels support serving files via NFS over TCP. 4.3) Special Mount Options Consider using the bg option if your client system needs to be available even if it cannot mount some servers. This option causes mount requests to put themselves in the background automatically if a mount cannot complete immediately. When a client starts up and a server is not available, the client waits for the server to become available by default. The default behavior, which you can adjust with the retry mount option, results in waiting for almost a week before giving up. The fg option is useful when you need to serialize your mount requests during system initialization. For example, you probably want the system to wait for /usr to become available before proceeding with multiuser boot. If you mount /usr or other critical file systems from an NFS server, you should consider using fg for these mounts. The retry mount option has no effect on foreground mounts. A foreground mount request will fail immediately without any retransmission if any problem occurs. This was bugzilla 152599, which was fixed in 2.6.9-42 (RHEL4.4). For security, you can also use the nosuid mount option. This causes the client to disable the special bits on files and directories. The Linux man page for the mount command recommends also disabling or removing the suidperl command when using this option. Note that the storage also has a nosuid
15
TECHNICAL REPORT
export option, which does roughly the same thing for all clients accessing an export. Interestingly, the storages nosuid export option also disables the creation of special devices. If you notice programs that use special sockets and devices (such as screen) behaving strangely, check for the nosuid export option on your storage. To enable Kerberos authentication on your NFS mounts, you can specify the sec=krb mount option. In addition to Kerberos authentication, you can also choose to enable authentication with request integrity checking (sec=krb5i), or authentication with privacy (sec=krb5p). Note that most Linux distributions do not yet support krb5p or other advanced security flavors such as SPKM3 or lipkey as of the current writing. More on how to configure your NFS client to use NFS with Kerberos is described in section 6.2 below. 4.4) Tuning NFS client cache behavior Other mount options allow you to tailor the clients attribute caching and retry behavior. It is not necessary to adjust these behaviors under most circumstances. However, sometimes you must adjust NFS client behavior to make NFS appear to your applications more like a local file system, or to improve performance for metadata-intensive workloads. There are a few indirect ways to tune client-side caching. First, the most effective way to improve client-side caching is to add more RAM to your clients. Linux will make appropriate use of the new memory automatically. To determine how much RAM you need to add, determine how large your active file set is and increase RAM to fit. This greatly reduces the cache turnover rate. You should see fewer read requests and faster client response time as a result. Some working sets will never fit in a clients RAM cache. Your clients may have 128MB or 4GB of RAM, for example, but you may still see significant client cache turnover. In this case, reducing cache miss latency is the best approach. You can do this by improving your network infrastructure and tuning your server to improve its performance. Because a client-side cache is not effective in these cases, you may find that keeping the clients cache small is beneficial. Normally, for each file in a file system that has been accessed recently, the client caches file attribute information, such as a files last modification time and size. To detect file changes quickly yet efficiently, the NFS protocol uses close-to-open cache semantics. When a client opens a file, it uses a GETATTR operation to check that the file still exists and any cached data it has is still up to date. A client checks back with the server only after a timeout indicates that the files attributes may be stale. During such a check, if the servers version of the attributes has changed, the client purges its cache. A client can delay writes to a file indefinitely. When a client closes a file, however, it flushes all pending modifications to the file to the server. This allows a client to provide good performance in most cases, but means it might take some time before an application running on one client sees changes made by applications on other clients. Clients check back with the server every so often to be sure cached attribute information is still valid. However, adding RAM on the client will not affect the rate at which the client tries to revalidate parts of the directory structure it has already cached. No matter how much of the directory structure is cached on the client, it must still validate what it knows when files are opened or when attribute cache information expires. You can lengthen the attribute cache timeout with the actimeo mount option to reduce the rate at which the client tries to revalidate its attribute cache. With the 2.4.19 kernel and later releases, you can also use the nocto mount option to reduce the revalidation rate even further, at the expense of cache coherency among multiple clients. The nocto mount option is appropriate for read-
16
TECHNICAL REPORT
only mount points where files change infrequently, such as a lib, include, or bin directory; static HTML files; or image libraries. In combination with judicious settings of actimeo you can significantly reduce the number of on-the-wire operations generated by your NFS clients. Be careful to test this setting with your application to be sure that it will tolerate the delay before the NFS client notices file changes made by other clients and fetches the new versions from the server. Certain distributions also support an option that disables Access Control List support on NFS version 3 mounts. After update 3, RHEL 3 allows you to disable ACL support, eliminating all ACCESS operations on the wire for a mount point. ACCESS operations are synchronous, slowing applications unnecessarily if they do not use or support Access Control Lists. Using the noacl mount option is safe when your files are stored on appliances that are accessed only by NFS version 2 or 3. Do not use the noacl mount option if your files reside on a volume that is being shared by other types of clients that have native ACL support, such as CIFS or NFS version 4 clients. The Linux NFS client delays application writes to combine them into larger, more efficiently processed requests. You can guarantee that a client immediately pushes every write system call an application makes to servers by using the sync mount option. This is useful when an application needs the guarantee that data is safe on disk before it continues. Frequently such applications already use the O_SYNC open flag or invoke the flush system call when needed. Thus, the sync mount option is often not necessary. Delayed writes and the clients attribute cache timeout can delay detection of changes on the server by many seconds while a file is open. The noac mount option prevents the client from caching file attributes. This means that every file operation on the client that requires file attribute information results in a GETATTR operation to retrieve a files attribute information from the server. Note that noac also causes a client to process all writes to that file system synchronously, just as the sync mount option does. Disabling attribute caching is only one part of noac; it also guarantees that data modifications are visible on the server so that other clients using noac can detect them immediately. Thus noac is shorthand for actimeo=0,sync. When the noac option is in effect, clients still cache file data as long as they detect that a file has not changed on the server. The noac mount option allows a client to keep very close track of files on a server so it can discover changes made by other clients quickly. Normally you will not use this option, but it is important when an application that depends on single system behavior is deployed across several clients. Using the noac mount option causes 40% performance degradation on typical workloads, but some common workloads, such as sequential write workloads, can be impacted by up to 70%. Database workloads that consist of random reads and writes are generally less affected by noac. Noac generates a very large number of GETATTR operations and sends write operations synchronously. Both of these add significant protocol overhead. The noac mount option trades off single-client performance for client cache coherency. Only applications that need tight cache coherency among multiple clients require that file systems be mounted with the noac mount option. Some applications require direct, uncached access to data on a server. Using the noac mount option is sometimes not good enough, because even with this option, the Linux NFS client still caches reads. To ensure your application sees the servers version of a files data and not potentially stale data cached by the client, your application can lock and unlock the file. This pushes all pending write operations
17
TECHNICAL REPORT
back to the server and purges any remaining cached data, so the next read operation will go back to the server rather than reading from a local cache. Alternatively, the Linux NFS client in the RHEL and SLES kernels supports direct I/O to NFS files when an application opens a file with the O_DIRECT flag. Direct I/O is a feature designed to benefit database applications that manage their own data cache. When this feature is enabled, an applications read and write system calls are translated directly into NFS read and write operations. The Linux kernel never caches the results of any read or write when a file is opened with this flag, so applications always get exactly whats on the server. Because of I/O alignment restrictions in some versions of the Linux O_DIRECT implementation, applications must be modified to support direct I/O properly. See the appendix for more information on this feature, and its equivalent in RHEL AS 2.1, uncached I/O. The 2.6-based kernel provides finer control over direct I/O since it offers this functionality at the file level. It means there is no special mount level setting required to enable direct I/O on specific files. To enable direct I/O, one must open the file with O_DIRECT flag. The good part about this is that in a volume, some files can be opened with O_SYNC and some can be opened with O_DIRECT as per needs. So, init.ora parameter filesystemio.option can help decide whether to open a file with O_SYNC, O_DIRECT or something else. Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of a read or write, data is guaranteed to have been transferred. Under Linux 2.4, transfer sizes and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte boundaries suffices. For some servers or applications, it is necessary to prevent the Linux NFS client from sending Network Lock Manager requests. You can use the nolock mount option to prevent the Linux NFS client from notifying the servers lock manager when an application locks a file. Note, however, that the client still flushes its data cache and uses more restrictive write back semantics when a file lock is in effect. The client always flushes all pending writes whenever an application locks or unlocks a file. For detailed information on configuring and tuning storage in an Oracle environment, see the Network Appliance tech reports at www.netapp.com. 4.5) Mounting with NFS version 4 RHEL 4.0 and 2.6-based Fedora Core distributions introduce support for the latest version of NFS, version 4. In this section well discuss how to use NFS version 4 on your Linux NFS client, and what mount options your client supports. You should ensure your storage is running Data ONTAP 7.2.2 or newer before trying NFS version 4. We highly recommend using NFSv4 in your production environment with Data ONTAP 7.3 and later. The Linux NFS client now recognizes two different file system types. The nfs file system type uses the vers= mount option to determine whether to use NFS version 2 or NFS version 3 when communicating with a server. The nfs4 file system type supports only NFS version 4 and does not recognize the vers= mount option. If you have scripts that specify particular file system types to act on NFS file systems, you will need to modify them to work with both nfs and nfs4 file systems. The NFS version 4 wire protocol represents users and groups as strings instead of UIDs and GIDs. Before attempting to mount with NFS version 4, you must ensure that the clients UID mapper is configured and running. Otherwise, the client will map all UIDs and GIDs to the user nobody. The mappers configuration file is /etc/idmap.conf. Typically the only change needed is to specify the
Network Appliance Inc.
18
TECHNICAL REPORT
real domain name of the client so that it can map local UIDs and GIDs on the client to network names. When this is done, start the UID mapper daemon with this command: /etc/init.d/rpcidmapd start To mount a server that supports NFS version 4, you might use the following command: mount t nfs4 o filer:/vol/vol0 /mnt/nfs Some mount options you may be accustomed to using with NFS versions 2 and 3 are no longer supported with the nfs4 file system type. As mentioned, vers= is not supported. The udp and tcp mount options are no longer supported; instead, use proto= if you would like to choose a transport that is not TCP. NetApp appliances follow the NFS version 4 specification and do not support NFS version 4 on UDP, so proto=tcp is the only protocol you can use with NFS version 4 when your clients communicate with NetApp appliances. Other mount options that do not work with nfs4 file systems include noacl, nocto, and nolock. There are some NFS version 4 features that are not yet included in common Linux distributions, or that still have major known problems. The Linux NFS version 4 client found in RHEL 4 and Fedora Core as of this writing does not yet support NFS version 4 ACLs or named attributes. There are still some issues while using ACLs over NFSv4. The following two options suggest setting ACLs from a Linux client against a V4 server, and have the server enforce access by Linux or Solaris 10 clients. Option #1 POSIX to NFSv4 ACL mapping The Linux ACL utilities use and expect POSIX-draft semantics, in which you have the user/group/other semantics for a number of individually listed users. (The POSIX draft was never finalized into a standard.) NFSv4 ACLs are much richer, therefore you'll need to patch your Linux ACL utilities and library to map the POSIX-draft ACL into an over-the-wire NFSv4 ACL. The mapping is not perfect. If you don't apply these patches and try to run getfacl/setfacl on an NFS mounted file, you'll get an "Operation not supported" error. You can find the set of required patches to acl-2.2.41-CITI_NFS4_ALL at: http://www.citi.umich.edu/projects/nfsv4/linux/ The easiest thing is to get the source tree using git and build the RPMs yourself. Use "git clone" and "git pull" to get the source of, build and apply the RPMs under the build directory. A bug is open with RHEL 4 and RHEL 5 to request that these patches be integrated by default into their distribution. You can find the details at: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231118 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231231 Once you've applied the RPMs, you'll be able to set and get ACLs over NFS. You can set ACLs on Linux and they will also be understood by Solaris 10 and vice versa. If you're interested in the details of how the mapping is done and some of the limitations, take a look at: http://tools.ietf.org/html/draft-ietf-nfsv4-acl-mapping-05 As an example of potential problems during the mapping of POSIX to NFSv4 ACLS, the following is an example describing the issue. A Linux client accessing a file on a NetApp appliance (7.2.1) is not being able to truncate the file when a user has an ACE (Access Control Entry) that allows him/her to modify the file. The Linux client wants to update "mtime" during truncation, but because the POSIX ACE only specifies that the user can modify data, and not attributes (like owner and time), the truncation is denied
19
TECHNICAL REPORT
and therefore you can't modify an NFSv4 file with "vi" when you are not the owner but have an ACE that would allow you to modify the file. The fix may be in the ACL command/library, the Linux client, or relaxing the requirements of the NetApp appliance. The Linux server and the Solaris server are less strict about their enforcement of the attribute modification, so it is not a problem for them. You will not have a problem with a Solaris client against a NetApp server either, since it uses a different mechanism. Option #2 Native NFSv4 ACL utilities There is an alternative that uses NFSv4 ACLs natively. You need to install a new package from the following link: http://www.citi.umich.edu/projects/nfsv4/linux/nfs4-acl-tools/nfs4-acl-tools0.3.1.tar.gz Here you end up with a new set of NFSv4 ACL utilities. These are new utilities to set and read ACLs using the NFSv4 model instead of the POSIX model. NFSv4 ACLs present a richer model, and you have greater granularity of control when all your clients talk NFSv4 natively. This works well if you're in an all Linux environment or when you set and read the ACLs from a single architecture. Since the ACL is enforced by the server, it doesn't matter if the client can not understand how to display the value of the ACL, the enforcement is still done by the server. When using the new package, you will be able to define and read ACLs on the Linux client, and verify that the server correctly enforced access. It is also recommend that you use this package with the 2.6.18 (RHEL5) kernel instead of the 2.6.9-42 (RHEL4.4) kernel. Our testing has indicated that this package is more compatible with RHEL5. This package has recently been added to the Fedora 7 distribution and we expect it to be included in RHEL 5 within the year. It also has some problems with delegation, so you should disable read and write delegation on your storage using the options nfs command if you plan to access the storage via NFS version 4 from a Linux NFS client. The Linux NFS version 4 client also does not yet support file migration and replication via FS_LOCATIONS, nor does it fully support crossing security boundaries (going from, say, a volume exported with sec=sys to a volume exported with sec=krb5) automatically. The support for referrals went into the 2.6.20 kernel although most of the code is in RHEL5; it just needs a couple of bug fixes. SECINFO is still a work in progress. The code is under testing right now, and that might go into Linux 2.6.23. 4.6 Unmounting NFS File Systems This section discusses the unique subtleties of unmounting NFS file systems on Linux. Like other *NIX operating systems, the umount command detaches one or more file systems from a clients file system hierarchy (like several other common features of *NIX, the name of this command is missing a letter). Normally, you use umount with no options, specifying only the path to the root of the file system you want to unmount. Sometimes you might want more than a standard unmount operationfor example, when programs appear to be waiting indefinitely for access to files in a file system or if you want to clear the clients data cache. If you want to unmount all currently mounted NFS file systems, use:
20
TECHNICAL REPORT
umount a t nfs Note that a separate command is necessary to unmount NFS version 4 file systems: umount a t nfs4 Sometimes unmounting a file system becomes stuck. For this, there are two options: umount f /path/to/filesystem/root This forces an unmount operation to occur if the mounted NFS server is not reachable, as long as there are no RPC requests on the client waiting for a reply. After kernel 2.4.11, umount l /path/to/filesystem/root usually causes the kernel to detach a file system from the clients file system hierarchy immediately, but allows it to clean up RPC requests and open files in the background. As mentioned above, unmounting an NFS file system does not interrupt RPC requests that are awaiting a server reply. If a umount command fails because processes are waiting for network operations to finish, you must interrupt each waiting process using ^C or an appropriate kill command. Stopping stuck processes usually happens automatically during system shutdown so that NFS file systems can be safely unmounted. The NFS client allows these methods to interrupt pending RPC requests only if the intr option is set for that file system. To identify processes waiting for NFS operations to complete, use the lsof command. The umount command accepts file systemspecific options via the -o flag. The NFS client does not have any special options. 4.7 Mount Option Examples We provide the following examples as a basis for beginning your experimentation. Start with an example that closely matches your scenario, then thoroughly test the performance and reliability of your application while refining the mount options you have selected. On older Linux systems, if you do not specify any mount options, the Linux mount command (or the automounter) automatically chooses these defaults: mount o rw,fg,vers=2,udp,rsize=4096,wsize=4096,hard,intr, timeo=7,retrans=5 These default settings are designed to make NFS work right out of the box in most environments. Almost every NFS server supports NFS version 2 over UDP. Rsize and wsize are relatively small because some network environments fragment large UDP packets, which can hurt performance if there is a chance that fragments can be lost. The RPC retransmit timeout is set to 0.7 seconds by default to accommodate slow servers and networks. On clean single-speed networks, these settings are unnecessarily conservative. Over some firewalls, UDP packets larger than 1,536 are not supported, so these settings do not work. On congested networks, UDP may have difficulty recovering from large numbers of dropped packets. NFS version 2 write performance is usually slower than NFS version 3. As you can see, there are many opportunities to do better than the default mount options.
21
TECHNICAL REPORT
Here is an example of mount options that are reasonable defaults. In fact, on many newer Linux distributions, these are the default mount options. mount o rw,bg,vers=3,tcp,timeo=600,rsize=32768,wsize=32768,hard,intr Using the bg option means our client will be able to finish booting without waiting for appliances that may be unavailable because of network outages. The hard option minimizes the likelihood of data loss during network and server instability, while intr allows users to interrupt applications that may be waiting for a response from an unavailable server. The tcp option works well on many typical LANs with 32KB read and write size. Using timeo=600 is a good default for TCP mounts, but for UDP mounts, timeo=4 might be more appropriate. Here is an example of a poor combination of mount options: mount o rw,soft,udp,rsize=1024,wsize=1024 Using soft mounts with UDP is a recipe for silent data corruption. On Linux 2.4 with these mount options it is made worse. The wsize=1024 mount option on Linux 2.4 mandates synchronous writes; writes go to the server one at a time rather than in groups, and the client requires the server to push data onto disk before responding to the clients write operation request. If a server gathers writes to improve disk bandwidth, it delays its response to each write request waiting for more write requests, which can trigger the client to retry write requests. A server that delays responding to write as long as several hundred milliseconds will probably cause the client to drop requests unnecessarily after several retries (note that appliances usually do not delay writes because they can cache these requests in nonvolatile RAM to fulfill NFSv2s requirement for stable writes). To address these issues, always use the hard option and use read and write sizes larger than your clients page size. When mounting a group of home directories over a WAN, you might try: mount o rw,bg,vers=3,nosuid,tcp,timeo=600,retrans=2,rsize=2048, wsize=2048,soft,intr This example uses NFS over TCP because NFS clients often reside on slower, less capable networks than servers. In this case, the TCP protocol can provide fast recovery from packet losses caused by network speed transitions and noisy phone lines. Using the nosuid mount option means users cannot create or use suid programs that reside in their home directories, providing a certain degree of safety from Trojan horses. Limiting the maximum size of read and write operations gives interactive sessions on slow network links an advantage by keeping very large packets off the wire. On fast networks, large rsize and wsize values, such as 32768, are more appropriate. The soft option helps to recover quickly from server or network outages with a minimal risk of possible data loss, and the timeo=600 option allows the TCP protocol a long time to attempt recovery before the RPC client interferes. When mounting an appliance from an anonymous FTP or HTTP server, you could use: mount o ro,fg,vers=3,tcp,timeo=600,retrans=2,rsize=32768,wsize=32768, hard,nointr,nocto,actimeo=600 Here we use the fg option to ensure that NFS files are available before the FTP or HTTP server is started. The ro option anticipates that the FTP or HTTP server will never write data into files. The nocto option helps reduce the number of GETATTR and LOOKUP operations at the expense of tight cache coherency with other clients. The FTP server will see changes to files on the server after its attribute
22
TECHNICAL REPORT
cache times out (usually after about one minute). Lengthening the attribute cache timeout also reduces the attribute cache revalidation rate. When mounting an appliance for use with a single-instance Oracle database system over a clean Gigabit Ethernet, you might try: mount o rw,fg,vers=3,tcp,timeo=600,retrans=2,hard,nointr Again, the fg option ensures that NFS file systems are available before the database instance starts up. We use TCP here because even though the physical network is fast and clean, TCP adds extra data integrity guarantees. The hard option ensures data integrity in the event of network problems or a cluster failover event. The nointr option prevents signals from interrupting NFS client operations. Such interruptions may occur during a shutdown abort, for instance, and are known to cause database corruption. File locking should be enabled when running databases in production as a degree of protection against corruption caused by improper backup procedures (for example, another instance of the same database running at a disaster recovery site against the same files as your normal production instance). See the appendix for more information on how to set up databases on Linux, including details on how to adjust the Linux read-ahead algorithm. In summary: Use of NFS version 3 over TCP is strongly recommended. If NFS over UDP is slow or hangs, this is a sign of network problems, so try TCP instead. Avoid using the soft mount option. Try the special mount options if you need an extra boost in performance.
23
TECHNICAL REPORT
5) Performance This section covers aspects of Linux client performance, with a special focus on networking. 5.1) Linux NFS client performance The Linux NFS client runs in many different environments, from light desktop usage to a database with a dedicated private SAN. In general, the Linux NFS client can perform as well as most other NFS clients, and better than some, in these environments. However, you must plan your mount options and observe network behavior carefully to ensure the Linux NFS client performs at its best. On low-speed networks (10Mb/sec or 100Mb/sec), the Linux NFS client can read and write as fast as the network allows. This means Linux NFS clients running anything faster than a 400 MHz processor can saturate a 100Mb link. Slower network interfaces (for example, 16-bit PCMCIA cards on laptops) noticeably reduce client-side NFS bandwidth. Be sure your clients have enough CPU to drive the network while concurrently handling your application workload. If your clients use high-performance networking (gigabit or faster), you should plan to provide enough CPU and memory bandwidth on your clients to handle the interrupt and data rate. The NFS client software and the gigabit driver cut into resources available to user-level applications, so make sure there are enough resources to go around. Most gigabit cards that support 64-bit PCI or better should provide good performance. For most purposes, Gigabit Ethernet over copper works about as well as Gigabit Ethernet over fiber for short-distance links. Category 5E or Category 6 cables are necessary for reliable performance on copper links. Fiber adds long-haul capabilities and even better reliability, but at a significant cost. Some find that copper terminations and cabling are more rugged and reliable than fiber. As always, please refer to your card manufacturers website for updated driver information. When using Gigabit Ethernet, ensure both ends of every link have enabled full flow control. The command to check and enable flow control on Linux depends on the distribution you use; some use mii-tool, others prefer ethtool. Some switches, particularly mid-range Cisco switches, do not support flow control in both directions. Discuss support for full flow control with your switch vendor to ensure that your gigabit NIC and routers support it properly. Further, if you use Linux NFS clients and storage together on an unrouted network, consider using jumbo frames to improve the performance of your application. Be sure to consult your switchs command reference to make sure it is capable of handling jumbo frames in your environment. There are some known problems in Linux drivers and the networking layer when using the maximum frame size (9,000 bytes). If you experience unexpected performance slowdowns when using jumbo frames, try reducing the MTU to, say, 8,960 bytes. When using jumbo frames on more complex networks, ensure that every link in the network between your client and server supports them and have the support enabled. If NFS over TCP is working with jumbo frames, but NFS over UDP is not, that may be a sign that some part of your network does not support jumbo frames. The Linux NFS client and network layer are sensitive to network performance and reliability. After you have set your mount options as we recommend, you should get reasonable performance. If you do not and your workload is not already CPU-bound, you should begin looking at network conditions between your clients and servers. For example, on a clean Gigabit Ethernet network, a single Linux client can write to an F880 appliance as fast as the appliance can put data on disk. If there is other network traffic or packet loss, write
24
TECHNICAL REPORT
performance from a Linux client on NFS over UDP can drop rapidly, though on NFS over TCP, performance should remain reasonable. Read performance depends on the size and speed of the clients and the storages memory. The most popular choices for 100BaseT network interfaces are the Intel EEPro/100 series and the 3Com 3C905 family. These are often built into server mainboards and can support network booting via PXE, described in more detail later in this document. A recent choice among mainboard integrators is the RTL chipset. All of these implementations perform well in general, but some versions of their driver software have known bugs. Check your distributors errata database for more information. Recent versions should work fairly well. When choosing a 100BaseT card for your Linux systems, look for a card that has a mature driver that is already integrated into the Linux kernel. Features such as checksum offloading are beneficial to performance. You can use Ethernet bonding, or trunking, to improve the reliability or performance of your client. Most network interface cards use autonegotiation to obtain the fastest settings allowed by the card and the switch port to which it attaches. Sometimes, PHY chipset incompatibilities may result in constant renegotiation or negotiating half duplex or a slow speed. This is especially common when a 100BaseT host adapter is attached to a gigabit switch port. When diagnosing a network problem, be sure your Ethernet settings are as you expect before looking for other problems. To solve an autonegotiation problem, you may be inclined to hard code the settings you want, but avoid this because it only masks a deeper problem. Work with your switch and card vendors to resolve these problems. In summary: Whether you use TCP or UDP, be sure you have a clean network. Ensure your network cards always negotiate the fastest settings, and that your NIC drivers are up to date. Increasing the size of the clients socket buffer (see appendix) is always a wise thing to do when using UDP. 5.2) Diagnosing Performance Problems with the Linux NFS Client The client works best when the network does not drop any packets. The NFS and RPC clients also compete with applications for available CPU resources. These are the two main categories of client performance problems you may encounter. Checking for network packet loss is the first thing to do to look for problems. With NFS over UDP, a high retransmission count can indicate packet loss due to network or server problems. With NFS over TCP, the network layer on the client handles network packet loss, but server problems still show up as retransmissions. On some 2.4 kernels, TCP retransmissions are also a sign of large application writes on the client that have filled the RPC transport sockets output buffer. To see retransmissions, you can use nfsstat c at a shell prompt. At the top of the output, you will see the total number of RPCs the client has sent and the number of times the client had to retransmit an RPC. The retransmit rate is determined by dividing the number of retransmissions by the total number of RPCs. If the rate exceeds a few tenths of a percent, network losses may be a problem for your performance. NFS over TCP does not show up network problems as clearly as UDP and performs better in the face of packet loss. If your TCP mounts run faster than your UDP mounts, thats a sure sign that the network between your clients and your storage is dropping packets or is otherwise bandwidth-limited. Normally UDP is as fast as or slightly faster than TCP. The client keeps network statistics that you can view with netstat s at a shell prompt. Look for high error counts in the IP, UDP, and TCP sections of this
25
TECHNICAL REPORT
commands output. The same command also works on an appliance's console. Here look for nonzero counts in the fragments dropped after timeout and fragments dropped (dup or out of space) fields in the IP section. There are a few basic sources of packet loss. 1. If the end-to-end connection between your clients and servers contains links of different speeds (for instance, the server is connected via Gigabit Ethernet, but the clients are all connected to the network with 100Base-TX), packet loss occurs at the point where the two speeds meet. If a gigabit-connected server sends a constant gigabit stream to a 100Mb client, only one packet in 10 can get to the client. UDP does not have any flow control built in to slow the servers transmission rate, but TCP does; thus, it provides reasonable performance through a link speed change. 2. Another source of packet loss is small packet buffers on switches. If either the client or server bursts a large number of packets, the switch may buffer them before sending them on. If the switch buffer overflows, the packets are lost. It is also possible that a switch can overrun a clients NIC in a similar fashion. This becomes a greater possibility for large UDP datagrams because switches and NICs tend to burst an entire IP packet (all of its fragments) at once. 3. The clients IP layer will drop UDP datagrams if the clients send or receive socket buffer runs out of space for any reason. The clients RPC client allocates a socket for each mount. By default, these sockets use 64KB input and output buffers, which are too small on systems that use large rsize or wsize or generate a large number of NFS operations in a short period. To increase the size of these buffers, follow the instructions in the appendix to this document. There is no harm in doing this on all your Linux clients unless they have less than 16MB of RAM. If you have resolved these issues and still have poor performance, you can attempt end-to-end performance testing between one of your clients and a similar system on the servers LAN using a tool such as ttcp or iPerf. This exposes problems that occur in the network outside of the NFS protocol. When running tests such as iPerf, select UDP tests, as these directly expose network problems. If your network is full duplex, run iPerf tests in both directions concurrently to ensure your network is capable of handling a full load of traffic in both directions simultaneously. If you find some client operations work normally while others cause the client to stop responding, try reducing your rsize and wsize to 1,024 to see if there are fragmentation problems preventing large operations from succeeding. One more piece of network advice: Become familiar with network snooping tools such as tcpdump, tcpslice, and ethereal. In RHEL5 ethereal/tethereal is replaced by wireshark and tshark. On the storage, you can run pktt, which generates trace files in tcpdump format that you can analyze later on a client. These tools provide the last word in what is really happening on your network between your clients and storage. You may need to run both tcpdump on a client and pktt on your storage at the same time and compare the traces to determine where the problem lies. You must explicitly specify several options to collect clean network traces with tcpdump. Be sure the snaplen option (-s) is set large enough to capture all the interesting bytes in each packet, but small enough that tcpdump is not overwhelmed with incoming traffic. If tcpdump is overwhelmed, it drops incoming packets, making the network trace incomplete. The default value is 96 bytes, which is too
26
TECHNICAL REPORT
short to capture all the RPC and NFS headers in each packet. Usually a value of 256 bytes is a good compromise for UDP, but you can set it to zero if you need to see all the data in each packet. Snooping TCP packets requires a zero snaplen because TCP can place several RPC requests in a single network packet. If snaplen is short, the trace will miss RPCs that are contained near the end of long packets. In addition, always use filtering to capture just the traffic between the client and the server. Again, this reduces the likelihood that tcpdump or your local file system will be overwhelmed by incoming traffic and makes later analysis easier. You can collect traffic to or from your client using the hostname filter. Several other tcpdump options allow you to collect traffic destined for one or more hosts at a time; read the manual to find out more. An automounter can cause a lot of network chatter, so it is best to disable the automounter on your client and set up your mounts by hand before taking a network trace. To find out if your application is competing for CPU resources with the NFS client, RPC client, or network layer on your client system, you can use the top program. If you see the rpciod process at the top of the listing, you know that NFS operations are dominating the CPU on your system. In addition, if you see the system CPU percentage increase significantly when your application accesses NFS data, this also can indicate a CPU shortage. In many cases, adding more CPUs or faster CPUs helps. Switching to UDP may also be a choice for reducing CPU load if an uncongested high-speed network connects your clients and storage. As the Linux NFS client and networking layer improve over time, they will become more CPU-efficient and include features such as TCP offload that reduce the amount of CPU required to handle large amounts of data. There are certain cases in which processes accessing NFS file systems may hang. This is most often due to a network partition or server outage. Todays client implementation is robust enough to recover in most cases. Occasionally a client fails because of high load or some other problem. Unfortunately, little can be done in these cases other than rebooting the client and reporting the problem. 5.3) Error Messages in the Kernel Log There are two messages that you may encounter frequently in the kernel log (this is located in /var/log/messages on Linux systems). The first is server not responding. This message occurs after the client retransmits several times without any response from a server. If you know the server is up, this can indicate the server is sluggish or there are network problems. If you know the server is down, this indicates the client is waiting for outstanding operations to complete on that server, and it is likely there are programs waiting for the server to respond. The second, perhaps more frustrating, message is cant get request slot. This message indicates that the RPC client is queuing messages and cannot send them. This is usually due to network problems such as a bad cable, incorrectly set duplex or flow control options, or an overloaded switch. It may appear as if your client is stuck at this point, but you should always wait at least 15 minutes for network and RPC client timeouts to recover before trying harsher remedies such as rebooting your client or storage. 5.4) The Oops If a program encounters an unrecoverable situation in user space, it stops running and dumps core. If the same process encounters an unrecoverable situation within the Linux kernel, the kernel attempts to isolate the process, stop it, and report the event in the kernel log. This is called an oops. Many times
27
TECHNICAL REPORT
a failing process holds system resources that are not released during an oops. The kernel can run for a few more moments, but generally other processes deadlock or terminate with an oops when they attempt to allocate resources that were held by the original failing process. If you are lucky, the kernel log contains useful information after the system recovers. Red Hat kernels automatically translate the kernel log output into symbolic information that means something to kernel developers. If the oops record contains only hexadecimal addresses, try using the ksymoops tool in the kernel source tree to decode the addresses. Sometimes using a serial console setup can help capture output that a failing kernel cannot write into its log. Use a crossover cable (also sometimes called a file transfer or null modem cable) to connect your client to another systems serial console via its COM1 or COM2 port. On the receiving system, use a tool such as the minicom program to access the serial port. Serial console support is built into kernels distributed by Red Hat, but be sure to enable the serial console option in kernels you build yourself. Finally, you should set appropriate boot command line options (instructions provided in the Documentation directory of the Linux kernel source tree) to finish enabling the serial console. Optionally, you can also update /etc/inittab to start a mingetty process on your serial console if youd like to log in there. 5.5) Getting Help Most Linux NFS client performance problems are due to lack of CPU or memory on the client, incorrect mount options, or packet losses on the network between the client and servers. If you have set up your client correctly and your network is clean, but you still suffer from performance or reliability problems, you should contact experts to help you proceed further. Currently, there is no professionally maintained knowledge base that tracks Linux NFS client issues. However, expert help is available on the Web at nfs.sourceforge.net, where you can find a Linux NFS Frequently Asked Questions list, as well as several how-to documents. There is also a mailing list specifically for helping administrators get the best from Linux NFS clients and servers. Network Appliance customers can also search the NOW database for Linux-related issues. Network Appliance also maintains some of the more salient Linux issues within its BURT database. See the appendix in this report for more information. If you find there are missing features or performance or reliability problems, we encourage you to participate in the community development process. Unlike proprietary operating systems, new features appear in Linux only when users implement them. Problems are fixed when users are diligent about reporting them and following up to see that they are really fixed. If you have ever complained about the Linux NFS client, here is your opportunity to do something about it. When you have found a problem with the Linux NFS client, you can report it to your Linux distributor. Red Hat, for instance, supports an online bug database based on bugzilla. You can access Red Hats bugzilla instance at http://bugzilla.redhat.com/. When filing a BURT that relates to Linux client misbehavior with an appliance, be sure you report: The Linux distribution and the Linux kernel release (e.g., Red Hat 7.2 with kernel 2.4.7-10). The clients kernel configuration (/usr/src/linux/.config is the usual location) if you built the kernel yourself.
28
TECHNICAL REPORT
Any error messages that appear in the kernel log, such as oops output or reports of network or server problems. All mount options in effect (use cat /proc/mounts to display them, and do not assume they are the same as the options you specified on your mount commands). Details about the network topology between the client and the storage, such as how busy the network is, how many switches and routers, what link speeds, and so on. You can report network statistics on the client with nfsstat c and netstat s. Client hardware details, such as SMP or UP, which NIC, and how much memory. You can use the lspci v command and cat /proc/cpuinfo, cat /proc/meminfo on most distributions to collect most of this. Include a network trace and/or a dump of debugging messages (see the appendix).
Most importantly, you should carefully describe the symptoms on the client. A client hang is generally not specific enough. This could mean the whole client system has deadlocked or that an application on the client has stopped running. Always be as specific as you can. In summary: If you cannot find what you need in this paper or from other resources, contact your Linux distributor or Network Appliance to ask for help.
29
TECHNICAL REPORT
6) Other Sundries This section covers auxiliary services you may need to support advanced NFS features. 6.1) Telling Time The clock on your Linux clients must remain synchronized with your storage to avoid problems such as authentication failures or incomplete software builds. Usually you set up a network time service such as NTP and configure your storage and clients to update their time using this service. After you have properly configured a network time service, you can find more information on enabling NTP on your storage in the Data ONTAP System Administrators Guide. Linux distributions usually come with a prebuilt network time protocol (NTP) daemon. If your distribution does not have an NTP daemon, you can build and install one yourself by downloading the latest ntpd package from the Internet (see the appendix). There is little documentation available for the preinstalled NTP daemon on Linux. To enable NTP on your clients, be sure the ntpd startup script runs when your client boots (look in /etc/rc.d or /etc/init.d; the exact location varies, depending on your distribution; for Red Hat systems, you can use chkconfig level 35 ntpd on). You must add the network time servers IP address to /etc/ntp/step-tickers and /etc/ntp.conf. If you find that the time protocol daemon is having some difficulty maintaining synchronization with your time servers, you may need to create a new drift file. Make sure your clients /etc/ntp directory and its contents are permitted to the ntp user and group to allow the daemon to update the drift file, and disable authentication and restriction commands in /etc/ntp.conf until you are sure everything is working correctly. As root, shut down the time daemon and delete the drift file (usually /etc/ntp/drift). Now restart the time daemon again. After about 90 minutes, it will write a new drift file into /etc/ntp/drift. Your client system should keep better time after that. Always keep the date, time, and time zone on your appliance and clients synchronized. Not only will you ensure that any time-based caching on your clients works correctly, but it will also make debugging easier by aligning time stamps in client logs and on client network trace events with the appliances message log and pktt traces. 6.2) Security Today, most versions of the Linux NFS client support only two types of authentication: AUTH_NULL and AUTH_UNIX. Linux distributions based on 2.6 kernels support Kerberos 5, just as Solaris does today, via RPCSEC GSS. Later versions of the NFS protocol (e.g., NFSv4) support a wide variety of authentication and security models, including Kerberos, and a form of public key authentication called SPKM. In this section, well cover some security basics, and then describe how to configure the Linux NFS client to use Kerberos. NFSv4 is firewall friendly too as it communicates over one single port 2049 with the caveat that callbacks require a separate port to be opened on the client. NFSv4.1 removes this restriction with the introduction of sessions.
To maintain the overall security of your Linux clients, be sure to check for and install the latest security updates from your distributor. You can find specific information on Linux NFS client security in Chapter 6 of the NFS how-to located at nfs.sourceforge.net (see appendix).
30
TECHNICAL REPORT
When a host wants to send UDP packets that are larger than the networks maximum transfer unit, or MTU, it must fragment them. Linux divides large UDP packets into MTU-sized IP fragments and sends them to the receiving host in reverse order; that is, the fragment that contains the bytes with the highest offset are sent first. Because Linux sends IP fragments in reverse order, its NFS client may not interoperate with some firewalls. Certain modern firewalls examine the first fragment in a packet to determine whether to pass the rest of the fragments. If the fragments arrive in reverse order, the firewall discards the whole packet. A possible workaround is to use only NFS over TCP when crossing such firewalls. If you find some client operations work normally while others cause the client to stop responding, try reducing your rsize and wsize to 1,024 to see if there are fragmentation problems preventing large RPC requests from succeeding. Firewall configuration can also block auxiliary ports that the NFS protocol requires to operate. For example, traffic may be able to pass on the main NFS port numbers, but if a firewall blocks the mount protocol or the lock manager or port manager ports, NFS cannot work. This applies to standalone router/firewall systems as well as local firewall applications such as tcpwrapper, ipchains, or iptables that might run on the client system itself. Be sure to check if there are any rules in /etc/hosts.deny that might prevent communications between your client and server. The Linux NFS client in 2.6 kernels supports NFS with Kerberos. The implementation is not yet mature, so there are some limitations you should be aware of as you consider deploying it. First, NFS with Kerberos is supported for all three versions of NFS. However, because NFS version 4 has combined all the auxiliary protocols (NLM, NSM, mountd, and ACL) into one protocol, it is not exposed to traditional attacks on Kerberized NFS. Finally, NFS with Kerberos is supported only with Kerberos 5; you cannot use NFS with Kerberos in a Kerberos 4 realm. However, NFS with Kerberos works with either Active Directory Kerberos or with MIT Kerberos hosted on a UNIX system. The Linux NFS client has some trouble with NFS over UDP with Kerberos. We recommend that if you want to use NFS with Kerberos, you should use NFS over TCP. As mentioned previously, the NFS client in RHEL 4.0 and 2.6-based Fedora Core distributions supports only Kerberos authentication and Kerberos authentication with request integrity checking. And, support for SECINFO and WRONGSEC is still missing from the 2.6 NFS client, so there is still no support for crossing NFS version 4 mounts that use different security flavors. The first step for configuring your client is to ensure the local UID mapper is configured and running. Instructions for this are found above in section 4.5. An overview describing how to configure Kerberos on your Linux client for use with NFS is available here: http://www.citi.umich.edu/projects/nfsv4/. To begin configuring Kerberos itself, make sure your /etc/krb5.conf reflects the proper configuration of your local Kerberos realm. You can use authconfig on your Red Hat clients to provide information about your local Kerberos realm. If you are not sure what this means, contact your local Kerberos administrator. The next step is to acquire the host key for your client, stored in /etc/krb5.keytab. This file (by convention) stores Kerberos keys for use on your client. A normal user types in a password which is transformed into a key. For a service or daemon, the key is stored in a file without the need to enter a password. A keytab entry allows a service to verify a client's identification. It can also be used by a daemon to authenticate to another Kerberized service. Your local Kerberos administrator can help you with this, as Linux uses the same standard Kerberos commands found on most UNIX systems. The form of the key is 'nfs/<client hostname>@<KERBEROS REALM>'.
31
TECHNICAL REPORT
Your storage will also need a keytab and similar configuration, and you need to add an appropriate set of sec= export rules in your appliances /etc/exports file. Instructions describing how to do this can be found in recent releases of Data ONTAP documentation or in TR-3481. Now you are ready to start the clients GSS daemon. On some distributions, you may need to add the string SECURE_NFS=YES to the file /etc/sysconfig/nfs. Then, start the daemon with this command: /etc/init.d/rpcgssd start Some distributions use the chkconfig command to cause these scripts to run automatically when your client boots. To mount using NFS with Kerberos, simply add the sec=krb5 mount option to your mount command line or in /etc/fstab. Kerberos integrity checking is enabled by using sec=krb5i instead of sec=krb5. This will require users to acquire Kerberos credentials before they can access files in NFS file systems mounted with Kerberos. There are two typical ways in which users acquire these credentials. Users can use the kinit command after logging in. Or, you can configure your client to use the Kerberos PAM library to acquire the credentials automatically when users log in. Consult your distributions documentation for more information on how to set up Kerberos and PAM security on your Linux client. 6.3) Network Lock Manager The NFS version 2 and 3 protocols use separate side-band protocols to manage file locking. On Linux 2.4 kernels, the lockd daemon manages file locks using the NLM (Network Lock Manager) protocol, and the rpc.statd program manages lock recovery using the NSM (Network Status Monitor) protocol to report server and client reboots. The lockd daemon runs in the kernel and is started automatically when the kernel starts up at boot time. The rpc.statd program is a user-level process that is started during system initialization from an init script. If rpc.statd is not able to contact servers when the client starts up, stale locks will remain on the servers that can interfere with the normal operation of applications. The rpcinfo command on Linux can help determine whether these services have started and are available. If rpc.statd is not running, use the chkconfig program to check that its init script (which is usually /etc/init.d/nfslock) is enabled to run during system bootup. If the client hosts network stack is not fully initialized when rpc.statd runs during system startup, rpc.statd may not send a reboot notification to all servers. Some reasons network stack initialization can be delayed are slow NIC devices, slow DHCP service, or CPU-intensive programs running during system startup. Network problems external to the client host may also cause these symptoms. Because status monitoring requires bidirectional communication between server and client, some firewall configurations can prevent lock recovery from working. Firewalls may also significantly restrict communication between a clients lock manager and a server. Network traces captured on the client and server at the same time usually reveal a networking or firewall misconfiguration. Read the section on using Linux NFS with firewalls carefully if you suspect a firewall is preventing lock management from working. Your clients nodename determines how an appliance recognizes file lock owners. You can easily find out what your clients nodename is using the uname n or hostname command. (A systems
32
TECHNICAL REPORT
nodename is set on Red Hat systems during boot using the HOSTNAME value set in /etc/sysconfig/network.) The rpc.statd daemon determines which name to use by calling gethostbyname(3), or you can specify it explicitly when starting rpc.statd using the -n option. Check that netfs is running at the proper init levels:
/sbin/chkconfig --list netfs
Netfs should be running at init levels 3 and 5. If netfs is not running at the proper init levels:
/sbin/chkconfig --levels 35 netfs on
Portmap should be running at init levels 3 and 5. If portmap is not running at the proper init levels, set it so it will run at the proper levels:
/sbin/chkconfig --levels 35 portmap on.
Portmap should be running and owned by the user rpc. If portmap is not running start it:
/etc/init.d/portmap start
nfslock should be running at init levels 3 and 5 If nfslock is not running at the proper init levels, set it so it will run at the proper levels:
/sbin/chkconfig --levels 35 nfslock on
The daemon rpc.statd should be running and owned by the user rpcuser. The problem of nfslock (rpc.statd) not running has been encountered many times on 2.4 kernels.
33
TECHNICAL REPORT
rpc.statd uses gethostbyname() to determine the client's name, but lockd (in the Linux kernel) uses "uname n." By changing the HOSTNAME= fully qualified domain name, that means that lockd will then use an FQDN when contacting the storage. If there is a lnx_node1.iop.eng.netapp.com and also a lnx_node5.ppe.iop.eng.netapp.com contacting the same NetApp storage, the storage will be able to correctly distinguish the locks owned by each client. Therefore, we recommend using the fully qualified name in /etc/sysconfig/network. In addition to this, sm_mon l or lock break on the storage will also clear the locks on the storage which will fix the lock recovery problem. If the clients nodename is fully qualified (that is, it contains the hostname and the domain name spelled out), then rpc.statd must also use a fully qualified name. Likewise, if the nodename is unqualified, then rpc.statd must use an unqualified name. If the two values do not match, lock recovery will not work. Be sure the result of gethostbyname(3) matches the output of uname n by adjusting your clients nodename in /etc/hosts, DNS, or your NIS databases. Similarly, you should account for client hostname clashes in different subdomains by ensuring that you always use a fully qualified domain name when setting up a clients nodename during installation. With multihomed hosts and aliased hostnames, you can use rpc.statds -n option to set unique hostnames for each interface. The easiest approach is to use each clients fully qualified domain name as its nodename. When working in high-availability database environments, test all worst-case scenarios (such as server crash, client crash, application crash, network partition, and so on) to ensure lock recovery is functioning correctly before you deploy your database in a production environment. Ideally, you should examine network traces and the kernel log before, during, and after the locking/disaster/locking recovery events. The file system containing /var/lib/nfs must be persistent across client reboots. This directory is where the rpc.statd program stores information about servers that are holding locks for the local NFS client. A tmpfs file system, for instance, is not sufficient; the server will fail to be notified that it must release any POSIX locks it might think your client is holding if it fails to shut down cleanly. That can cause a deadlock the next time you try to access a file that was locked before the client restarted. Locking files in NFS can affect the performance of your application. The NFS client assumes that if an application locks and unlocks a file, it wishes to share that files data among cooperating applications running on multiple clients. When an application locks a file, the NFS client purges any data it has already cached for the file, forcing any read operation after the lock to go back to the server. When an application unlocks a file, the NFS client flushes any writes that may have occurred while the file was locked. In this way, the client greatly increases the probability that locking applications can see all previous changes to the file. However, this increased data cache coherency comes at the cost of decreased performance. In some cases, all of the processes that share a file reside on the same client; thus aggressive cache purging and flushing unnecessarily hamper the performance of the application. Solaris allows administrators to disable the extra cache purging and flushing that occur when applications lock and unlock files with the
34
TECHNICAL REPORT
llock mount option. Note well that this is not the same as the nolock mount option in Linux. The nolock mount option disables NLM calls by the client, but the client continues to use aggressive cache purging and flushing. Essentially this is the opposite of what Solaris does when llock is in effect. 6.4) Using the Linux Automounter For an introduction to configuring and using NFS automounters, consult Chapter 9 of OReillys Managing NFS and NIS, 2nd Edition (see the appendix for URL and ISBN information). Because Linux minor device numbers have only eight bits, a single client cannot mount more than 250 or so NFS file systems. The major number for NFS mounts is the same as for other file systems that do not associate a local disk with a mount point. These are known as anonymous file systems. Because the NFS client shares the minor number range with other anonymous file systems, the maximum number of mounted NFS file systems can be even less than 250. In later releases of Linux, more anonymous device numbers are available, thus the limit is somewhat higher. The preferred mechanism to work around this problem is to use an automounter. This also helps performance problems that occur when mounting very large root-level directories. There are two Linux automounters available: AMD and automounter. The autofs file system is required by both and comes built into modern Linux distributions. More information is available on Linux automounters on the Web; see the appendix for the specific URL. A known problem with the automounter in Linux is that it polls NFS servers on every port before actually completing each mount to be sure the mount request wont hang. This can result in significant delays before an automounted file system becomes available. If your applications hang briefly when they transition into an automounted file system, make sure your network is clean and that the automounter is not using TCP to probe the appliances portmapper. Ensure your infrastructure services, such as DNS, respond quickly and with consistent results. Also consider upgrading your appliance to the latest release of Data ONTAP. An automounter can cause a lot of network chatter, so it is best to disable the automounter on your client and set up static mounts before taking a network trace. Automounters depend on the availability of several network infrastructure services. If any of these services is not reliable or performs poorly, it can adversely affect the performance and availability of your NFS clients. When diagnosing an NFS client problem, triple-check your automounter configuration first. It is often wise to disable the automounter before drilling into client problem diagnosis. The Linux automounter is a single process, and handles a single mount request at a time. If one such request becomes stuck, the automounter will no longer respond, causing applications to hang while waiting to enter a file system that has yet to be mounted. If you find hanging applications on a client that is managed with the automounter, be sure to check that the automounter is alive and is responding to requests. Some versions of Data ONTAP do not allow mount operations to occur during the small window when a fresh version of a SnapMirror replica is brought online. If the automounter attempts to mount a volume during this brief window, it will fail, but a mount request moments later will succeed. There is no workaround for this problem by adjusting your automounter configuration, but upgrading to the latest version of Data ONTAP should resolve the issue. Using an automounter is not recommended for production servers that may need immediate access to files after long periods of inactivity. Oracle, for example, may need immediate access to its archive files
35
TECHNICAL REPORT
every so often, but an automounter may unmount the archive file system due to inactivity. However, autofs v5 is highly recommended if there is use of automounter in the production environment. Refer to the Autofs TR-xxxx for best practices information. 6.5) Net booting your Linux NFS clients Intel systems can support network booting using DHCP, BOOTP, and TFTP. The Intel standard for supporting network booting is called a preexecution environment, or PXE. Usually this requires network interface hardware that contains a special PROM module that controls the network boot process. Network booting is especially helpful for managing clusters, blade servers, or a large number of workstations that are similarly configured. Generally, Linux is loaded via a secondary loader such as grub, lilo, or syslinux. The secondary loader of choice for network booting is pxelinux, which comes in the syslinux distribution. See the appendix for information on how to obtain and install the syslinux distribution. Data ONTAP releases 6.3.2 and 6.4 support pxelinux, allowing Linux to boot over a network. Earlier versions of Data ONTAP support booting storage, but do not support pxelinux because certain TFTP options were missing in the appliances TFTP server. To enable TFTP access to your storage, see the Data ONTAP System Administrators Guide. You must ensure that your client hardware supports network booting. Both the clients mainboard and network interface card must have support for network booting built in. Usually you can tell whether network booting is supported by reviewing the BIOS settings on your client or by consulting the mainboard manual for information on how to set up network booting on your client hardware. The specific settings vary from manufacturer to manufacturer. You must configure a DHCP server on the same LAN as your clients. A DHCP server provides unique network configuration information for each host on a commonly administered network. For network booting, the DHCP server also instructs each client where to find that clients boot image. You can find instructions for configuring your DHCP server to support network booting included with the pxelinux distribution. The specific instructions for setting up a DHCP server vary from vendor to vendor. If you intend to share a common root file system among multiple Linux clients, you must create the root file system on an appliance using NFSv2. This is because of problems with how appliances interpret major and minor device numbers and because of differences between how NFSv2 and NFSv3 Linux clients transmit these numbers. Linux clients, unless told otherwise, attempt to mount their root file systems with NFSv2. If the file system was created using NFSv3, the major and minor numbers will appear incorrect when mounted with NFSv2. Kernels use these numbers to match the correct device driver to device special files (files that represent character and block devices). If the numbers are wrong, the Linux kernel will not be able to find its console or root file system, thus it cannot boot. When setting up multiple clients with NFS root file systems, common practice is to maintain a separate root file system for each client and mount each with ro,nolock. Sharing a root file system among clients is not recommended. Note that /var/lib/nfs must be persistent across reboots and unique for each client. A tmpfs file system per client, for instance, is not sufficient; the server will fail to be notified that it must release any POSIX locks it might think your client is holding if it fails to shut down cleanly. That can cause a deadlock the next time you try to access a file locked before the client restarted. If each client mounts a
36
TECHNICAL REPORT
private /var/lib/nfs directory via NFS, it must be mounted using the nolock mount option, and before rpc.statd and lockd have started on the client. For more information on Linux cluster computing, search the Internet for Beowulf or OpenMosix, or see the Linux high-availability site listed in the appendix of this document. In summary: Make sure to check with your Linux distributor for any errata on a regular basis to maintain a secure system. Be sure your storage and clients agree on what time it is. If your client has trouble seeing the storage, look for packet filtering on your client or switches. Make each clients nodename its fully qualified domain name.
37
TECHNICAL REPORT
7) Executive Summary When setting up a Linux NFS client, you should try to get the latest kernel supported by your Linux distributor or hardware vendor. Mount with NFS over TCP and NFS version 3 where possible, and use the largest rsize and wsize that still provide good performance. Start with the hard and intr mount options. Be sure the network between your clients and your storage drops as few packets as possible. If you must use NFS over UDP, make sure to follow the instructions in the appendix for enlarging the transport socket buffers on all your UDP mounts. If you have special needs, review this document carefully. Always look for the latest errata and bug fixes from your Linux distributor and watch for new Network Appliance technical reports.
38
TECHNICAL REPORT
8) Appendix 8.1) Related Material Network Appliance NOW Web site: Type in Linux in the NOW PowerSearch text box. SysKonnects Gigabit Ethernet performance study www.syskonnect.com/syskonnect/performance/gig-over-copper.htm Red Hat 7.3 performance alert details http://now.netapp.com/Knowledgebase/solutionarea.asp?id=ntapcs6648 OReillys Managing NFS & NIS, Second Edition (ISBN 1-56592-510-6) www.oreilly.com/catalog/nfs2 Linux tcpdump manual page man tcpdump Tcpdump home page www.tcpdump.org Linux ethereal manual page man ethereal Ethereal home page www.ethereal.com Data ONTAP packet tracer type pktt list at your appliances console Linux NFS manual page man nfs Linux NFS FAQ and how-to http://nfs.sourceforge.net Linux NFS mailing list nfs@list.sourceforge.net Linux network boot loader information http://syslinux.zytor.com Linux manual page with automounter information man autofs Linux automounter information
www.spack.org/index.cgi/AutomountHelp?action=show&redirect=LinuxAutomounter
Linux NFSv4 and NFS with Kerberos development www.citi.umich.edu/projects/nfsv4/ Network Time Protocol daemon home pages www.cis.udel.edu/~ntp www.ntp.org
39
TECHNICAL REPORT
8.2) Special network settings Enlarging the transport socket buffers your client uses for NFS traffic helps reduce resource contention on the client, reduces performance variance, and improves maximum data and operation throughput. In Linux kernels after 2.4.20, the following procedure is not necessary, as the client will automatically choose an optimal socket buffer size. 1. 2. 3. 4. 5. 6. 7. Become root on your client cd into /proc/sys/net/core echo 262143 > rmem_max echo 262143 > wmem_max echo 262143 > rmem_default echo 262143 > wmem_default Remount your NFS file systems on the client
This is especially useful for NFS over UDP and when using Gigabit Ethernet. You should consider adding this to a system startup script that runs before the system mounts NFS file systems. The size we recommend is the largest safe socket buffer size weve tested. On clients smaller than 16MB, you should leave the default socket buffer size setting to conserve memory. Most modern Linux distributions contain a file called /etc/sysctl.conf where you can add changes such as this so they will be executed after every system reboot. Add these lines to your /etc/sysctl.conf file on your client systems: net.core.rmem_max = 262143 net.core.wmem_max = 262143 net.core.rmem_default = 262143 net.core.wmem_default = 262143 All Linux kernels later than 2.0 support large TCP windows (RFC 1323) by default. No modification is needed to enable large TCP windows. Window scaling is enabled by default. Some customers have found the following settings to help performance in WAN and high-performance LAN network environments. Use these settings only after thorough testing in your own environment over TCP. > > > > # Netapp storage: nfs.tcp.recvwindowsize nfs.ifc.xmt.high nfs.ifc.xmt.low 2097152 64 8
> # Linux NFS client: > net.core.rmem_default=65536 > net.core.wmem_default=65536 > net.core.rmem_max= 8388608 > net.core.wmem_max= 8388608 > net.ipv4.tcp_rmem = 4096 87380 4194304 > net.ipv4.tcp_wmem = 4096 16384 4194304 > # following is in pages, not bytes > net.ipv4.tcp_mem = 4096 4096 4096 In 2.6.18 kernel RHEL5, the values for tcp_rmem and tcp_wmem are 4096 87380 4194304 and 4096 16384 4194304 respectively. These values may be reasonable for systems with 1Gb
40
TECHNICAL REPORT
memory. For systems with large memory and with very high NFS activity, the following changes may be gradually made to TCP buffer sizes until it starts to hurt the performance. echo 256960 > /proc/sys/net/core/rmem_default echo 16777216 > /proc/sys/net/core/rmem_max echo 256960 > /proc/sys/net/core/wmem_default echo 16777216 > /proc/sys/net/core/wmem_max Note: The Linux kernel supports RFC 1323 and RFC 2018 and dynamically adjusts the TCP send and receive window size by default. The kernel sets the actual memory limit to twice the requested value (effectively doubling rmem_max and wmem_max) to provide for sufficient memory overhead. You do not need to adjust these unless you are planning to use some form of application tuning. However, TCP must be preferred over WAN connections or in other high-loss networks. If you use TCP in an environment that has high packet loss, you could adjust the net.ipv4.tcp_syn_retries parameter. The net.ipv4.tcp_syn_retries parameter specifies the maximum number of SYN packets to send to try to establish a TCP connection. The default is 5; the maximum is 255. The default value corresponds to a connection time of approximately 180 seconds. Usually the following setting is the default for common GbE hardware: ifconfig eth <dev> txqueuelen 1000 And net.core.netdev_max_backlog=3000 If using a 10g interface set the above parameter to 30000. net.core.netdev_max_backlog = 30000 Linux 2.6 kernels support advanced TCP algorithms that may help with WAN performance. Enabling tcp_bic or tcp_westwood can have some beneficial effects for WAN performance and overall fairness of sharing network bandwidth resources among multiple connections. net.ipv4.tcp_bic=1 net.ipv4.tcp_westwood=1 The following parameter does not cache ssthresh from the previous connection: net.ipv4.tcp_no_metrics_save = 1 For very long fast paths, it may be worth trying HTCP or BIC-TCP if Reno is not performing as desired. To set this, do the following: sysctl -w net.ipv4.tcp_congestion_control=bic Linux 2.4 kernels cache the slow start threshold in a single variable for all connections going to the same remote host. So, packet loss on one RPC transport socket will affect the slow start threshold on all sockets connecting to that server. The cached value remains for ten minutes. For example: The value for ssthresh for a given path is cached in the routing table. This means that if a connection has a retransmission and reduces its window, then all connections to that host for the next 10 minutes will
41
TECHNICAL REPORT
use a reduced window size, and not even try to increase its window. The only way to disable this behavior is to do the following before making all new connections. To flush the cache, you can use (as root): sysctl w net.ipv4.route.flush=1 This might be necessary in case some network conditions caused problems, but have been cleared. Cached ssthresh values will prevent good performance for ten minutes on new connections made after the network problems have been cleared up. If you experience ARP storms, this could be the result of client or storage ARP caches that are too small. A reasonable workaround is to use routers to reduce the size of your physical networks. However, in the 2.6 kernel there is a setting to fix the ssthresh caching weirdness. net.ipv4.tcp_no_metrics_save = 1 These tuning parameters are documented in the kernel source tree in Documentation/networking/ip-sysctl.txt.
8.3) Controlling File Read-Ahead in Linux Read-ahead occurs when Linux predicts that an application may soon require file data it has not already requested. Such a prediction is not always accurate, so tuning read-ahead behavior can have some benefit. Certain workloads benefit from more aggressive read-ahead, while other workloads perform better with little or no read-ahead. By default, Linux 2.4 kernels will attempt to read ahead by at least three pages, and up to 31 pages, when they detect sequential read requests from an application. Some file systems use their own private read-ahead values, but the NFS client uses the system defaults. To control the amount of read-ahead performed by Linux, you can tune the system default read-ahead parameters using the sysctl command. To see what your systems current default read-ahead parameters are, you can try: sysctl vm.min-readahead vm.max-readahead The min-readahead parameter sets the least amount of read-ahead the client will attempt, and the max-readahead parameter sets the most read-ahead the client may attempt, in pages. Linux determines dynamically how many pages to read ahead based on the sequentiality of your applications read requests. Note that these settings affect all reads on all NFS file systems on your client system. You can increase read-ahead if you know your workload is mostly sequential data or you are trying to improve WAN performance. 1. Become root 2. sysctl w vm.max-readahead=255 3. sysctl w vm.min-readahead=15 will set your systems read-ahead minimum and maximum to relatively high values, allowing Linuxs read-ahead algorithm to read ahead as many as 255 pages. This value takes effect immediately. Usually the best setting for min-readahead is the number of pages in rsize, minus one. For example, if your client typically uses rsize=32768 when mounting NFS servers, you should set min-readahead to
42
TECHNICAL REPORT
7. You can add this to the /etc/sysctl.conf file on your client if it supports this; see the section above for details. The 2.6 Linux kernel does not support adjusting read-ahead behavior via a sysctl parameter. However the readaheads are more adaptive to automatically detect the size of the I/O request. This change eliminates the need for treating large random I/O as sequential and all of the averaging code that exists just to support this. Tests have indicated that multi-threaded sequential reads using the new readahead code in the 2.6.9 kernel and later is always faster (20-30%). For random reads the new code is equal to the old for all cases in which the request size is less than or equal to the max_readahead size. In RHEL5 (2.6.18) we noticed that for request sizes larger than the max_readahead the new code is as much as 50% faster. On client systems that support a database workload, try setting the minimum and maximum read-ahead values to one or zero. This optimizes Linuxs read-ahead algorithm for a random-access workload and prevents the read-ahead algorithm from polluting the clients data cache with unneeded data. As always, test your workload with these new settings before making changes to your production systems. 8.4) How to Enable Trace Messages Sometimes it is useful to enable trace messages in the NFS or RPC client to see what it does when handling (or mishandling) an application workload. Normally you should use this only when asked by an expert for more information about a problem. You can do this by issuing the following commands: 1. 2. 3. Become root on your client sysctl w sunrpc.nfs_debug=1 sysctl w sunrpc.rpc_debug=1
The sysrq key is one of the best (and sometimes the only) way to determine what a machine is really doing. It is useful when a system appears to be "hung" or for diagnosing elusive, transient, kernelrelated problems.
sysctl -w kernel/sysrq=1; echo t > /proc/sysrq-trigger kernel.sysrq = 1 To turn this off, after the problem occurs: sysctl -w kernel/sysrq=0 echo t > /proc/sysrq-trigger kernel.sysrq = 0 Trace messages appear in your system log, usually /var/log/messages. To disable these trace messages, echo a zero into the same files. This can generate an enormous amount of system log traffic, so it can slow down the client and cause timing-sensitive problems to disappear or change in behavior. You should use this when you have a simple, narrow test case that reproduces the symptom you are trying to resolve. To disable debugging, simply echo a zero into the same files. To help the syslogger keep up with the log traffic, you can disable synchronous logging by editing /etc/syslog.conf and appending a hyphen in front of /var/log/messages. Restart the syslog daemon to pick up the updated configuration. 8.5) How to Enable Uncached I/O on RHEL AS 2.1 Red Hat Enterprise Linux Advanced Server 2.1 Update 3 introduces a new feature that is designed to assist database workloads by disabling data caching in the operating system. This new feature is called uncached NFS I/O and is similar to the NFS O_DIRECT feature found in Enterprise Linux 3.0 and SUSEs SLES 8 Service Pack 3. When this feature is enabled, an applications read and write system
43
TECHNICAL REPORT
calls are translated directly into NFS read and write operations. The Linux kernel never caches the results of any read or write, so applications always get exactly whats on the server. Uncached I/O affects an entire mount point at once, unlike NFS O_DIRECT, which affects only a single file at a time. System administrators can combine mount points that are uncached (say, for shared data files) and mount points that cache data normally (say, for program executables or home directories) on the same client. Also unlike NFS O_DIRECT, uncached I/O is compatible with normal I/O. There are no alignment restrictions, so any application can use uncached I/O without modification. When uncached I/O is in effect, it changes the semantics of the noac mount option. Normally the noac mount option means that attribute caching is disabled. When the uncached I/O feature is in effect, noac is changed to mean that data caching is disabled. When uncached I/O is not in effect, noac mount points behave as before. Uncached I/O is turned off by default. To enable uncached I/O, follow this procedure: 1. 2. 3. Become root Start your favorite editor on /etc/modules.conf Add this line anywhere: options nfs nfs_uncached_io=1
Uncached I/O will take effect after you reboot your client. Only mount points that use the noac mount option will be affected by this change. For more information on how to use this feature with Oracle9i RAC, see NetApp TR 3189.
2007 Network Appliance, Inc. All rights reserved. Specifications subject to change without notice. NetApp, the Network Appliance logo, Data ONTAP, and SnapMirror are registered trademarks and Network Appliance and NOW are trademarks of Network Appliance, Inc. in the U.S. and other countries. Oracle is a registered trademark and Oracle9i and Oracle10g are trademarks of Oracle Network Appliance Inc. Corporation. Intel is a registered trademark of Intel Corporation. Linux is a registered trademark of Liinus Torvalds. UNIX is a registered trademark of The Open Group. Solaris is a trademark of Sun Microsystems. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such.
44