Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Tips and Tricks For Diagnosing Lustre Problems On Cray Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Tips and Tricks for Diagnosing Lustre

Problems on Cray Systems

Cory Spitz, Cray Inc. and Ann Koehler, Cray Inc.

ABSTRACT: As a distributed parallel file system, Lustre is prone to many different


failure modes. The manner in which Lustre fails can make diagnosis and serviceability
difficult. Cray deploys Lustre file systems at extreme scales, which further compounds
the difficulties. This paper discusses tips and tricks for diagnosing and correcting Lustre
problems for both CLE and esFS installations. It will cover common failure scenarios
including node crashes, deadlocks, hardware faults, communication failures, scaling
problems, performance issues, and routing problems. Lustre issues specific to Cray
Gemini networks are addressed as well.

KEYWORDS: Lustre, debugging, performance, file systems, esFS

documentation or http://www.lustre.org for definitions.


1. Introduction The paper does not address common cases discussed in
Due to the distributed nature and large scale of Cray the Lustre Operations Manual, especially the
deployed Lustre file systems, administrators may find it troubleshooting or debugging chapters.
difficult to get a handle on operational problems.
However, Lustre is a critical system resource and it is 2. Working with Console logs
important to quickly understand problems as they
One of the first places to look when you suspect a
develop. It is also important to be able to gather the
Lustre problem is the console log. On Cray direct-
necessary debug information to determine the root cause
attached systems, most of the console output from Lustre
of problems without lengthy downtime or many attempts
clients and servers are funnelled into the console log on
to reproduce the problem.
the SMW. Some of the minor printk-level Lustre,
This paper will touch on a broad range of common
LNET, and LND messages are recorded in the syslog
Lustre issues and failures. It will offer tips on what
messages file on the SDB node as well. On systems
information to acquire from the Lustre /proc interfaces
on the clients, routers, and servers to aid diagnosis and with esFS, the logs could be spread out over the servers or
funnelled up to the external service Maintenance Server
problem reporting. Some of the more common and
(esMS), but the client and router messages are still routed
cryptic Lustre error messages will be discussed as well.
Both the traditional Cray Lustre product, which up to the mainframe‟s console log on the SMW.
consists of direct-attached systems where Lustre servers Lustre log messages can be overwhelming as Lustre
are embedded on I/O nodes (SIO or XIO) within the Cray is extremely chatty, especially at scale. For instance if an
mainframe, and external services File Systems (esFS) OSS fails, each client and the MDS will become quite
vocal as requests fail and are subsequently retried. In
offered by the Cray Custom Engineering Data
systems with thousands of clients, Lustre can easily
Management Practice (CE DMP) will be covered.
The intended audience for this paper is system generate hundreds of thousands of lines of console output
administrators, operators, and users that are familiar with for a single problem. printk limiting can only go so far
Lustre terms and components. Please refer to the Lustre and tricks need to be employed by administrators to make
1.8 Operations Manual (1) included in your CLE release sense of it all.

Cray User Group 2011 Proceedings 1 of 15


2.1 How to read a console log messages, the node most likely hit a software bug. You
The first thing you should do when investigating will need to dump the node and warmboot.
console logs is to separate the server logs from the clients Next, verify that Lustre started correctly. Due to the
and from one another. When the Lustre logs are so resilient nature of Lustre, problems that occur while
voluminous that sequential messages from a single host starting the file system may not always be obvious. The
can span pages, they can be very difficult to follow; console log contains a record of the start process. The
separating each server‟s logs makes it much easier to section below entitled “Lustre Startup Messages”
understand what is going on with a given server for a describes what to look for.
given timeframe. Cray has written a script, Once you have ruled out a system panic or Lustre not
lustrelogs.sh [Appendix A], that pulls the server starting, then examine the log for other common failures.
identities from the logs and writes out per-server logs. These are commonly signalled by the strings „evict‟,
Since the tool does not require the configuration from the „suspect‟, and „admindown‟. Messages containing
Lustre MGS or the <filesystem>.fs_defs file, it can these strings may be the result of a bug or some transient
be used even if failover occurred. condition. You will need to keep looking to further
After the logs are separated, it is much easier to see diagnose the issue.
what is happening and to identify which clients are
2.2.1 Lustre Startup Messages
affected. Those clients can then be inspected separately.
Each server issues an informational message when it
Since Lustre is a client-server architecture, understanding
successfully mounts an MDT or OST. In Lustre 1.8.4, the
the interaction is imperative to determining the root cause
message is of the form „Now serving <object> on
of failure.
Lustre identifies endpoints based on their LNET <device>‟. The string „Now serving‟ uniquely
names and you will need to understand this to identify identifies these messages. In earlier Lustre versions, the
which nodes are specified in the logs. On Cray systems, message is „Server object on <device> has
you can determine what nodes messages refer to by the started‟. Search the log for the string „Lustre:
strings <#>@ptl or <#>@gni. The number indicated is Server‟. There may not be a message for each OST due
the nid number and the second half of the name is the to message throttling, but there should be at least one such
LNET network. For example, 10@gni identifies message from each server. These messages tend to be
nid00010. Unfortunately, Cray console log messages clustered together in the log. When you find one,
are prefixed with the Cray cname so it will be beneficial examine the surrounding messages for indications of
to cross-reference the nid name with the output from mount failures. For example, if quotas are enabled, the
xtprocadmin or the /etc/hosts file. Endpoints on the
file system can start but fail to enable quotas for OSTs. In
this case, you will see the message „abort quota
esFS Infiniband network look like <IP-address>@o2ib.
The IPoIB address is used as a name even though IP is not recovery‟.
used for o2iblnd LNET traffic. Once the devices are mounted, the MDS and OSSs
Lustre error messages often include strings of cryptic attempt to connect with one another. Connection failure
data with an embedded error type or return code, typically messages are normal at this stage since the servers start at
different rates. Eventually, all OSSs will report
rc, that clarifies the error once the code is deciphered.
„received MDS connection from <#>@<network>‟.
These return codes are simply Linux/POSIX errno values.
If they do not, look for networking errors or signs that a
Keeping errno-base.h and errno.h from
server is down.
/usr/include/asm-generic handy will make
If all the servers start successfully, then each client
scanning the logs much more efficient. Then when a
will report mounting Lustre successfully with the message
node complains that, „the ost_write operation
„Client <fsname>-client has started‟.
failed with -30‟, you will know that it was because
You can find more information about events during
the file system was mounted (or re-mounted as) read-only
start up in the output from the `/etc/init.d/lustre
as -30 equates to EROFS.
start` or `lustre_control.sh
2.2 What to look for in the console logs <filesystem>.fs_defs start` command. A copy
The next step is to identify any major system faults. of the output is logged to a temporary file on the boot
Look for the strings „LBUG‟, „ASSERT‟, „oops‟, and node named /tmp/lustre_control.id. If you suspect
„Call Trace‟. Cray systems enable panic_on_lbug, a problem with the file system configuration, try running
so the first three will result in node panics. „Call the `lustre_control.sh <filesystem>.fs_defs
Trace‟ messages may be the result of watchdog timers verify_config` command for clues to what may be
being triggered, but typically, if you find any one of these wrong. The verify_config option checks that the
device paths for targets match what is specified in the

Cray User Group 2011 Proceedings 2 of 15


<filesystem>.fs_defs file. You may also want to information about the history and status for all
try logging onto the server and issuing the `mount -t connections. More information about the import interface
lustre` command directly. Alternatively, you can is detailed in section 3.2.
sometimes gain insight by mounting the device as an Attempted I/O during this window will receive error
ldiskfs file system, just use `mount –t ldiskfs` codes such as -108, ESHUTDOWN. Most common would
instead. In that case mounting the device read-only is be -5, EIO. Typically, applications do not handle these
preferable. The journal will be replayed upon mount. If failures well and exit. In addition, EIO means that I/O
there are errors received here, it is an indication that was lost. Users might portray this as corruption,
major corruption has occurred and that the device needs especially in a distributed environment with shared files.
repair. Application writers need to be careful if they do not
follow POSIX semantics for syncs and flushes.
2.2.2 Client Eviction
Otherwise, it is possible that they completed a write to the
Client eviction is a common Lustre failure scenario buffer cache that was not yet committed to stable storage
and it can occur for a multitude of reasons. In general, when the client was evicted.
however, a server evicts a client when the client fails to There are twists on this common failure scenario.
respond to a request in a timely manner or fails to ping One of which is when a client is evicted because it is out
within two ping intervals, which is defined as one-quarter
of memory, the so-called OOM condition. OOM
of the obd_timeout value. For example, a client is conditions are particularly hard on Cray systems due to
evicted if it does not acknowledge a glimpse, completion, the lack of swap space. When the Linux kernel attempts
or blocking lock callback within the ldlm_timeout. On to write out pages to Lustre in order to free up memory,
Cray systems, the ldlm_timeout defaults to 70 seconds. Lustre may need to allocate memory to set up RDMA
A client can be evicted by any or all of the servers. If actions. Under OOM, this can become slow or block
a client is evicted by all or many servers, there is a good completely. This behavior can result in frequent
chance that there is something truly wrong with that connection and reconnection cycles and not necessarily
client. If however, a client is only evicted by a single include evictions. This will generate frequent console
server, it could be a hint that the problem is not on the messages as the kernel can move in fits and starts on its
client or along its communication path and might instead way out of OOM.
indicate that there is something wrong with that server. In Router failures could be another cause of client
addition, for these cases, interesting behavior can occur if evictions. In routed configurations for esFS, both clients
a client is evicted from an MDS, but not the OSSs as it and servers will round-robin messages through all
might be able to write existing open files, but not perform available routers. Servers never resend lock callbacks, so
a simple directory listing. clients could be evicted if the router completely fails or
A client does not recognize that it has been evicted drops the callback request. Clients will eventually retry
until it can successfully reconnect to a server. Since the RPC transmissions so they are not as prone to suffer
server is unable to communicate with the client, there is secondary faults after router failure. The point here is to
little reason to attempt to inform it since that message examine all routers between the client and server for
would likely fail as well. Eventually the client will notice failure or errors if clients are unexpectedly evicted.
that it is no longer connected. You will see „an error
occurred while communicating‟ along with the 2.2.3 Watchdog timers
error code -107, which is ENOTCONN. The client will In the case where server threads stall and the Lustre
then attempt to reconnect when the next ping or I/O watchdog timer expires, a stack trace of the hung thread is
request is generated. Once it reconnects, the server emitted along with the message, „Service thread pid
informs the client that it was evicted. Eviction means that <pid> was inactive for <seconds>s‟. These
all outstanding I/O from a client is lost and un-submitted timers monitor Lustre thread progress and are not the
changes must be discarded from the buffer cache. The Linux “soft lockup” timers. Therefore, the time spent
string „evicting client‟ will denote when the server inactive is not necessarily blocking other interrupts or
drops the client and the string „evicted‟ will pinpoint threads. The messages generally indicate that the service
when the client discovered this fact. The client message thread has encountered a condition that caused it to stall.
is, „This client was evicted by service; in It could be that the thread was starved for resources,
progress operations using this service will deadlocked, or it blocked on an RPC transmission. The
fail‟. bug or condition can be node or system wide so multiple
During this time, the client‟s import state changes to service threads could pop their watchdog timers at around
EVICTED. Use `lctl get_param *.*.state` or the same time. Cray configures servers with 512 threads,
`lctl get_param *.*.import` to get detailed so this can be a chatty affair.

Cray User Group 2011 Proceedings 3 of 15


2.2.4 Lost communication creating, opening, unlinking, then writing a single file.
Another class of problem that is easiest to identify That file is created with no explicit file stripe settings and
from console messages are connection errors such as - so the test does not necessarily check every OST in the
107 and -108, ENOTCONN and ESHUTDOWN respectively. file system.
Connection errors indicate problems with the Lustre If the test passes then it will do so silently and you
client/server communications. The root cause of can be assured that most of Lustre is working well. If the
connection failures can lie anywhere in the network stack test fails, the node will be marked as suspect as seen by
or the Lustre level. No matter where the root problem xtprocadmin. Then the test is repeated, by default,
lies, there will be information from the LNET Network every 60 seconds for 35 minutes. Oftentimes if there is a
Driver, or LND, since it is the interface to the transport server load issue or transient network problem, then a
mechanism. On Cray SeaStar systems, the LND is node can be marked as suspect and later pass the test
ptllnd and on Gemini systems, it is named gnilnd. and return to the up state.
The Infiniband OFED driver, which is used for the If the test fails all retry attempts, „FAILURES:
external fabric of esFS installations, is called o2iblnd. (Admindown) Filesystem_Test‟ will appear in the
Look for the LND name, i.e., ptllnd, gnilnd, or console logs and the node is marked admindown. NHC
o2iblnd to help pinpoint the problem. Those lines by will also stop testing the node. If there is a failure, look at
themselves may not be that useful so be sure to look at the the LustreError messages that appear between the time
preceding and following messages for additional context. the node is set suspect and the time it is set admindown as
For example, on SeaStar systems, you might find those messages may offer stronger clues to what has gone
„PTL_NAL_FAILED‟ and „beer‟ (Basic End-to-End wrong.
Reliability) messages, surrounding ptllnd messages that
would indicate that portals failed underneath the LND. 2.2.6 Cray Failover and Imperative Recovery
Alternatively, on a Gemini system, you might find Lustre failover on Cray direct-attached systems
leverage both Lustre health status and Cray‟s RCA
„No gnilnd traffic received from <nid>‟,
heartbeat mechanism to determine when to begin failover.
which could suggest a potentially failed peer. The
(2) Failover for esFS is built around esfsmon. (3) For
gnilnd keeps persistent connections with keep alive and
the local side will close a connection if it does not see either solution, the backup server reports that it „will be
in recovery for at least <time> or until
receive (rx) or keep alive traffic from the remote side
<#> clients reconnect‟.
within the gnilnd timeout. The default gnilnd
If imperative recovery is enabled, which is only
timeout is 60 seconds. The gnilnd will close those
available for Cray direct-attached systems, the message
connections with the message above when there is no
„xtlusfoevntsndr: Sent ec_rca_host_cmd:‟
traffic and later re-establish them if necessary. This will
indicates that the imperative directive to clients was sent.
result in extra connect cycles in the logs.
„Executed client switch‟ indicates that the client
If messages with the LND name do not pinpoint the
side imperative recovery agent, xtlusfoclntswtch,
problem, look for a downed node, or SeaStar or Gemini
HW errors to explain the lost connection. Generally, in made the import switch. „Connection switch to
the absence of any of these messages the problem is the nid=<nid> failed‟ indicates failure.
higher level Lustre connection. Connection problems can 2.2.7 Hardware RAID errors
be complicated especially in routed configurations and Linux, and in turn Lustre, do not tolerate HW I/O
you may have to collect additional data to diagnose the errors well. Therefore, Lustre is sensitive to HW RAID
problem. We will cover data collection for gnilnd and errors. These errors are not included in the console log
routers in future sections. and may not be included in the messages.sdb log either.
Very high load and, as discussed earlier, OOM Typically, errors stay resident on the RAID controller, but
conditions may trigger frequent dropped connections and SCSI errors will be seen in the console log if, for
reconnect cycles. The message containing the strings example, the device reports a fatal error or the SCSI
„was lost‟ and „Connection restored‟ bound the timeout for a command is exceeded. When this happens,
interval. the kernel forces the block device to be mounted read-
2.2.5 Node Health Checker only. At that time, Lustre will encounter the errno, -30,
The Cray Node Health Checker (NHC) executes or EROFS, on the next attempt to write to that target. Be
system integrity checks after abnormal application exit. sure to match that up with „end_request: I/O error‟
Usually, a Lustre file system check is included. The NHC to ensure it was a HW and not a Lustre error that caused
Lustre test checks that the compute node can both the target to be remounted read-only.
perform metadata and I/O by executing a statfs() and

Cray User Group 2011 Proceedings 4 of 15


2.2.8 Gemini HW errors If an LNET router was affected by either a stack reset
Hardware errors reported about the HSN are recorded or lost a link that was repaired with a quiesce and re-
in special locations such as the route, then it is more likely that a client could be evicted.
hwerrlog.<timestamp>, netwatch.<timestamp> or This is because although clients can suffer RPC failures
consumer.<timestamp> event log, but the errors will and resend, the servers do not resend blocking callbacks.
be evident in the console log as well. These errors are not 2.2.9 RPC Debug messages
necessarily fatal, however.
When you see a message like „Lustre:
Cray XE systems have the ability to reset the HSN on 10763:0:(service.c:1393:ptlrpc_server_handl
the fly in order to ride through critical HW errors that e_request()) @@@ Request x1367071000625897
would traditionally have resulted in kernel panics, a took longer than estimated (888+12s);
wedged HSN, or both. The `xthwerrlog –c crit –f client may timeout. req@ffff880068217400
<file>` command will show the critical HW errors. x1367071000625897/t133143988007 o101-
When the Gemini suffers such a critical error, the gnilnd >316a078c-99d7-fda8-5d6a-
e357a4eba5a9@NET_0x40000000000c7_UUID:0/0
must perform a so-called stack reset, as all outstanding lens 680/680 e 2 to 0 dl 1303746736 ref 1
transmissions have been lost with the HW reset.
fl Complete:/0/0 rc 301/301‟ you would gather
When a stack reset occurs, there will be lots of
that the server took an extraordinary amount of time to
console activity, but you will see the string „resetting handle a particular request, but you might throw your
all resources‟. You will also see the error code - hands up at trying to understand the rest. The message is
131, ENOTRECOVERABLE. Remote nodes communicating very concise to keep the logs readable, but it is very terse.
with the node that underwent the stack reset should Messages of this type deserve explanation because they
receive the error code -14, EFAULT, indicating that the are common and will appear even at the default debug
RDMA action failed. Thus if that error is emitted then the level.
remote peer should be checked for a stack reset condition. The information in the second half of the message
Additional gnilnd error codes and meanings are starting at req@ is pulled from the ptlrpc_request
explained in Appendix C. structure used for an RPC by the DEBUG_REQ macro.
The goal of the stack reset is to keep the node from There are over two hundred locations in the Lustre source
crashing. The Gemini NIC must be reset to clear the that use this macro.
errors, but Lustre can often survive the HW error because The data is clearly useful for developers, but what
the gnilnd pauses all transfers and re-establishes can casual users take from the message? You will quickly
connections after the reset completes. However, a stack learn the pertinent details, but the following explains the
reset can be tricky and the “old” memory used by the entire macro. After the request memory address denoted
driver for RDMA cannot be reused until it is verifiably by req@ the XID and Transaction Number (transno) are
safe from remote tampering. The n_mdd_held field in printed. These parameters are described in Section 19.2,
the /proc/kgnilnd/stats interface shows how many Metadata Replay, of the Lustre Operations Manual. Next
memory descriptors are under “purgatory” hold. is the opcode. You will need to reference the source, but
Cray XE systems also have the capability to quiesce you quickly learn that o400 is the obd_ping request and
the HSN and re-route upon link or Gemini failure. To o101 is the LDLM enqueue request, as these will turn up
accommodate this feature, Cray configures both the often. Next, comes the export or import target UUID and
minimum Adaptive Timeout, at_min, and the portals request and reply buffers. lens refers to the
ldlm_timeout to 70 seconds. The long timeouts allow request and reply buffer lengths. e refers to the number
Lustre to “ride through” the re-route, but this is typically of early replies sent under adaptive timeouts. to refers to
an extremely chatty process as many errors are emitted timeout and is a logical zero or one depending on whether
before the system can be automatically repaired. the request timed out. dl is the deadline time. ref is
On the console, the string „All threads paused!‟ reference count. Next fl refers to “flags” and will
will be emitted when the quiesce event completes. Then, indicate whether the request was resent, interrupted,
„All threads awake!‟ will indicate that operations complete, high priority, etc. Finally, we have the
have resumed. When the quiesce event occurs, the request/reply flags and the request/reply status. The
gnilnd pushes out all timers so that none will expire status is typically an errno, but higher numbers refer to
during the quiescent period. Moreover, no LND threads Lustre specific uses. In the case above, 301 refers to
are run and new requests are queued and are processed “lock aborted”.
after the LND threads resume. This minimizes the The transno, opcode, and reply status are the most
number of failed transmissions. useful entries to parse while examining the logs. They
can be found easily because the DEBUG_REQ macro uses

Cray User Group 2011 Proceedings 5 of 15


the eye catcher „@@@‟. Therefore, whenever you see that some additional quick tips that are useful, especially when
in the logs, you will know that the message is of the recreating problems to collect data for bug reporting.
DEBUG_REQ format. Lustre routines use a debug mask to determine
whether to make a dk log entry. The default debug mask
2.2.10 LDLM Debug messages is a trade off between usefulness and performance. We
Another useful error message type is the could choose to log more but then we suffer from reduced
LDLM_ERROR macro message. This macro is used performance. When debugging problems it useful to
whenever a server evicts a client and so it is quite unmask other debug statements in critical sections. To
common. This macro uses the eye catcher „###‟ so it can enable all logging, execute `lctl set_param debug=-
be easily found as well. 1; lctl set_param subsystem_debug=-1`.
An example client eviction looks like, The dk log is a ring buffer, which can quickly
„LustreError: overflow during heavy logging. Therefore, when
0:0:(ldlm_lockd.c:305:waiting_locks_callbac
k()) ### lock callback timer expired after
enhancing the debug mask you should also grow the
603s: evicting client at 415@ptl ns: mds- buffer to accommodate a larger debug history. The
test-MDT0000_UUID lock: maximum size is roughly 400 MiB, which can be set with
ffff88007018b800/0x6491052209158906 lrc: `lctl set_param debug_mb=400`.
3/0,0 mode: CR/CR res: 4348859/3527105419 The current debug mask can be read with `lctl
bits 0x3 rrc: 5 type: IBT flags: 0x4000020 get_param debug` and can easily be updated using
remote: 0x6ca282feb4c7392 expref: 13 pid: “+/-” notation. For example, to add RPC tracing, simply
11168 timeout: 4296831002‟. However, the run `lctl set_param debug="+rpctrace"`.
message differs slightly depending upon whether the lock Desired traces will depend upon the problem, but
type used was extent, ibits, or flock. The type field “rpctrace” and “dlmtrace” are generally the most useful
will read EXT, IBT, or FLK respectively. trace flags.
For any lock type, ns refers to the namespace, which The debug log with a full debug mask will trace entry
is essentially the lock domain for the storage target. The and exit into many functions, lock information, RPC info,
two mode fields refer to the granted and requested mode. VFS info, and more. The dk log will also include all
The types are exclusive mode (EX), protective write (PW), items inserted into the console log. These details are
protective read (PR), concurrent write (CW), concurrent invaluable to developers and support staff, but because so
read (CR), or null (NL). The res field can be particularly much information is gathered, it can be difficult to
handy as it refers to the inode and generation numbers for correlate the logs to external events such as the start of a
the resource on the ldiskfs backing store. Finally, for test case. Therefore, the logs are typically cleared with
extent locks the extent ranges for the granted and `lctl clear` when beginning data collection and
requested area are listed respectively after the lock type. annotated with `lctl mark <annotation>` with
Do not fret what appears to be an extremely large extent updates.
size as extent locks are typically granted for a full file, The log is dumped with `lctl dk <filename>`.
which could support the maximum file size. The typical This method will automatically convert the output format
range is 0->18446744073709551615 which is simply into a human readable format. However, this processing
0xffffffffffffffff or -1. on a busy node may interfere with debug progress. You
can also dump the log in a binary format by appending a
3. Collecting additional debug data „1‟ with `lctl dk <filename> 1`. This saves a lot of
time for large logs on the local node and ensures timely
It is often times necessary to gather additional debug
data collection. You can post-process the binary dk log
data beyond the logs. There is a wealth of Lustre
and turn it into a human readable format later with `lctl
information spread across servers, routers, and clients that
df <binary_dklog> <output_filename>`.
should be extracted. Some of the information is human
Lustre dk logs can be configured to dump upon
readable from the /proc interface on the specific node.
timeouts or eviction with the tunables
First, we will cover some tools that you can use to gather
dump_on_timeout and dump_on_eviction
the data and then we will point out some especially useful
interfaces. respectively. The dk logs can be dumped for other
reasons in addition to timeouts and evictions as well. It
3.1 Lustre debug kernel traces will be evident in the logs that a dump has occurred
Lustre uses a debug facility commonly referred to as because „LustreError: dumping log to <path>‟
the dk log, short for debug kernel. It is documented in will be added to the console log. The path is configurable
Chapter 24, Lustre Debugging, of the Lustre Operations via /proc/sys/lnet/debug_path and defaults to
Manual and the lctl man page. However, there are

Cray User Group 2011 Proceedings 6 of 15


/tmp/lustre-log. The dumps should be collected if 4. Performance
possible. Since the path is well known, there is no reason
to first extract the file names from the error messages. This section will not necessarily tell you how to tune
When one of these events occurs, the logs are your Lustre file system for performance, but instead it
dumped in binary format and so they will need to be details the causes of performance problems and common
converted with `lctl df` after they are collected. In sense approaches to finding and mitigating performance
addition, the log will contain entries that may be out of problems.
order in time. Cray has written sort_lctl.sh included 4.1 Metadata performance
in Appendix B that will reorder the entries One of the biggest complaints about Lustre is slow
chronologically. Another handy utility included in metadata performance. This complaint is most often
Appendix B is lctl_daytime.sh, which converts the voiced as the result of user experiences with interactive
UNIX time to time of day. usage rather than metadata performance for their
applications. Why is that?
3.2 State and stats
Lustre clients are limited to one concurrent
In addition to the console and dk logs, there are some
modifying metadata operation in flight to the MDS, which
special files in the /proc interfaces that can be useful.
is terrible for single client metadata performance. A
Snapshots of these files prove useful during investigations
modifying operation would be an open or create.
or when reproducing a problem for a bug report. The
Although close is not a modifying operation, it is treated
llstat tool can be used to clear stats and display them
as one for recovery reasons. Examples of non-modifying
on an interval.
operations are gettatr and lookup.
The client import state /proc interface contains a
With enough clients, aggregate metadata rates for a
wealth of data about the client‟s server connections.
whole file system may be just fine. In fact, across
There is an „import‟ file on the client for each metadata
hundreds of clients the metadata performance can scale
client (mdc) and object storage client (osc). All of the
very nicely in cases like file-per-process style application
files can be retrieved with `lctl get_param I/O. But when there are many users on a single node then
*.*.import`. The import interface shows connection you‟ve got a problem. This is exactly the situation one
status and rpc state counts. This file can be monitored to finds with the model used on Cray systems with login
get a quick read on the current connection status and nodes. Fortunately, any reasonable number of login
should be gathered when debugging communication nodes is supported. Because the nodes cannot use swap,
problems. additional login nodes are added to the system as
The import file will also include the average wait interactive load and memory usage increases. However,
time for all RPCs and service estimates, although they are if users are dissatisfied with the interactive Lustre
brought up to the adaptive timeout minimum (at_min) performance it would also make sense to add additional
floor, which again by default on Cray systems is 70 login nodes to support more simultaneous modifying
seconds. The timeouts file includes real estimates on metadata operations.
network latency. For stats on a per operation basis, The client‟s mdc max_rpcs_in_flight parameter
inspect `lctl get_param *.*.stats` to see service can be tuned up to do more non-modifying operations in
counts, min (fastest) service time in µsecs, max (slowest) parallel. The value defaults to 8, which may be fine for
service time in µsecs, and sum and sum squared statistics. compute nodes, but this is insufficient for decent
Failover and recovery status can be acquired with performance on login nodes, which typically use metadata
`lctl get_param
more heavily.
obdfilter.*.recovery_status` for an OSS or Lustre includes a feature to „stat‟-ahead metadata
`lctl get_param mds.*.recovery_status` for information when certain access heuristics are met like
the MDS. It is useful to periodically display the progress `ls –l` or `rm –rf` similar to how data is read-ahead
with /usr/bin/watch. It is also useful to monitor a upon read access. Unfortunately, statahead is buggy and
select client‟s import connection as well. This Cray has had to disable the feature1.
information could be useful if the recovery does not In addition to poor single client metadata
complete successfully. performance, users often make the problem worse by
The nis, peers, and if appropriate with esFS, issuing commands to retrieve information about the file
buffers, routes, and routers files should be gathered system, which further clogs the MDS and the pipe from
from /proc/sys/lnet when investigating LNET and each client. Typically, users just want to know if the file
LND problems. They provide the state of the LNET system is healthy, but the commands that they issue give
resources and whether peers and routers are up or down.
These interfaces are detailed later in section 4.3. 1
Lustre bug 15962 tracks a deficiency in statahead.

Cray User Group 2011 Proceedings 7 of 15


them much more information than they might need and On the client side, `lctl get_param
thus are more expensive. Instead of /bin/df which osc.*.rpc_stats` will show counts for in-flight I/O.
issues expensive stat() or statfs() system calls, a DIRECT_IO is broken out separately from buffered I/O.
simple `lfs check servers`2 will report the health of This and other useful utilities for monitoring client side
all of the servers. Also, `lctl dl` (device list) will I/O are covered in Section 21.2, Lustre I/O Tunables, of
cheaply (with no RPC transmission) show the Lustre the Lustre Operations Manual (1). On the server side, the
component status and can be used on clients to see obdfilter brw_stats file contains much useful data
whether OSTs are UP or IN (inactive). and is covered in the same section of the manual.
Another way that users can further reduce the Use the brw_stats data to monitor the disk I/O
metadata load is to stop using `ls –l` where a simple sizes. Lustre tries very hard to write aligned 1 MiB
`ls` would suffice. Also be advised the `ls –color` chunks over the network and through to disk. Typical
is also expensive and that Cray systems alias ls to `ls – HW RAID devices work faster that way. Depending on
color=tty`. The reason it is expensive is that if file the RAID type, expensive read-modify-write operations
size or mode is needed then the client must generate extra or cache mirroring operations may occur when the I/O
RPCs for the stat() or file glimpse operation for each size or alignment is suboptimal. There are a number of
object on an OST. Moreover, the request cannot be causes to fragmented I/O and brw_stats will not
batched up into a single RPC so each file listed will indicate why I/O was not optimal, but it will indicate that
generate multiple RPCs (2). This penalty can be very something needs investigation.
large when files are widely striped. For instance if the file The brw_stats file also gives a histogram of I/O
striping is set to „-1‟, then up to 160 RPCs for that single completion times. If you are seeing a large percentage of
file will need to be generated (160 is the maximum stripe your I/O complete in seconds or even tens of seconds, it is
count.) an indication that something is likely wrong beyond
Due to the way that the Lustre Distributed Lock heavy load. Oftentimes disk subsystems suffer poor
Manager (LDLM) handles parallel modifying operations performance without error, or a RAID rebuild is going on.
in a single directory, threads can become blocked on a That activity is not visible to Lustre unless the
single resource. Moreover, threads must hold a resource degradation becomes extreme.
until clients acknowledge the operation. Even though The main brw_stats file contains all the data for a
Cray systems configure 512 service threads, they can all particular OST. However, per client stats are broken out
become quickly consumed due to the blocking. If most of into obdfilter.*.exports.*.brw_stats. This can
the service threads become serialized then all other be used to isolate I/O stats from a particular client.
metadata services including those for unrelated processes As part of the Lustre install, the sd_iostats patch is
will degrade. This will occur even if there are extra login applied to the kernel, which provides an interface in
nodes to spread out the metadata load because the /proc/scsi/sd_iostats/*. This file can be used to
bottleneck is on the server side. Long delays are thus corroborate the brw_stats. It is useful because it
inserted and it can take many minutes to clear out the includes all I/O to the block device, which includes
backlog of requests on large systems. This behavior metadata and journal updates for the backing ldiskfs file
typically occurs when large file-per-process applications system. As the name implies, the brw_stats only track
are started that create large numbers of files in a single, the bulk read and write stats.
shared directory. Because Lustre performance can degrade over time,
There is no good way to identify this condition, but it it is useful to always keep a watchful eye towards
is useful to inspect metadata service times for file system performace. Use llobdstat to get a quick read on a
clients. This can be done quickly by monitoring the mdc particular OST (see Section 21.3.1.2 in the Lustre
import and stats files as described in section 3.2. Operations Manual). The Lustre Monitoring Tool is
4.2 Bulk read/write performance useful for both real-time (5) and post-mortem
Lustre provides a variety of ways to measure and performance analysis (6) and is a good best practice to
monitor bulk read/write performance in real time. In employ. In addition, other best practices such as constant
addition, other Linux tools such as iostat and vmstat functionality testing and monitoring thereof (7) can be
are useful, but will not be covered here. used on a regular basis to spot performance regressions.
The OSS read cache is built on top of the Linux
buffer cache and so it follows the same semantics. Thus
2
Lustre bug 21665 documents a recent regression with `lfs increased memory pressure will cause the read cache to
check servers` that resulted in EPERM errors for non-root users. be discarded. It also works the other way. Increased
This regression has been fixed in Lustre 1.8.4 included in CLE cache usage can cause other caches to be flushed. Linux
3.1 UP03. does not know what caches are most important and can at

Cray User Group 2011 Proceedings 8 of 15


times flush much more important file system metadata, will become negative. The absolute value is the number
such as the ldiskfs buddy maps. Cray Lustre contains an of tx queued. /proc/sys/lnet/nis also records the
optimization for O_DIRECT reads and writes that cause low water mark for interface credits, which is marked as
them to always bypass the OSS cache. This extends the “min”. If this number becomes negative, then more
POSIX semantics of O_DIRECT to the OSS that say, “do credits may be needed.
not cache this data”. However, buffered reads and writes LNET and the LNDs also keep track of per peer
can still exert considerable memory pressure on the OSS resources and make them visible in
so it can be valuable to tune the maximum file size that /proc/sys/lnet/peers. Most importantly for this
the OSS can cache. By default the size is unlimited, but it view, the two “min” columns track the low water mark
is of little value to cache very large files and we can save for peer router buffer credits and peer tx credits.
the cache and memory space. The read-cache can be These LNET interfaces are documented in Section
tuned by issuing, for example, `lctl set_param 21.1.4 in the Lustre Operations Manual.
obdfilter.*.readcache_max_filesize=32M`. It
4.3.2 LNET router performance
can also be set permanently for all OSSs in a file system The primary consideration for LNET routers is
from the MGS via `lctl conf_param having enough bandwidth to take full advantage of the
<fsname>.obdfilter.readcache_max_filesize=3
back-end bandwidth to disk. However, due to the nature
2M`.
of credit based resource allocation, it is possible for
4.3 LNET performance LNET routers to choke aggregate bandwidth. For
LNET performance is critical to overall Lustre communication to routers, not only must a NI tx credit
performance. LNET uses a credit based implementation and peer tx credit be consumed, but a global router buffer
to avoid consuming too many HW resources or spending and peer router buffer credit are needed.
too many resources for communication to a specific host. The LNET kernel module parameters
This is done for fairness. Understanding that credits are a tiny_router_buffers, small_router_buffers,
scarce resource will allow for better tuning of the LNET. and large_router_buffers account for the global
Each LND is different, but the ptllnd, gnilnd, and router buffer credits and are visible in the
o2iblnd all have a concept of interface credits and peer /proc/sys/lnet/buffers file. The global router
credits, which is a resource that can only be consumed for credits really pertain to memory pools of size less than
a specific peer. There are four kinds of credits relevant to one page, one page, and 1 MiB for the tiny, small, and
tuning performance: network interface (NI) transmit (tx) large buffer tunables respectively. Again, negative
credits, peer tx credits, router buffer credits and peer numbers in the “min” column indicate that the buffers
router buffer credits. have been oversubscribed. If the load seems reasonable,
The interface credit count is the maximum number of you can increase the number of router buffers for a
concurrent sends that can occur on an LNET network. particular size to avoid stalling under the same load in the
The peer credit count is the number of concurrent sends future.
allowed to a single peer. LNET limits concurrent sends to The number of peer router buffer credits defaults to
a single peer so that no peer can occupy all of the the LND peer tx max credit count. Therefore, the LNET
interface credits. module parameter peer_buffer_credits should be
Sized router buffers exist on the routers to receive tuned on the routers to allow the global router buffers to
transmissions from remote peers and the router buffer be fully consumed.
credits are a count of the available buffer slots. The peer
router buffer credits exist for the same reason that LNET 5. Conclusion
peer tx credits do, so that a single peer cannot monopolize
the buffers. Lustre is a complex distributed file system and as
such, it can be quite difficult to diagnose and service. In
4.3.1 Monitoring LNET credits addition, since Lustre is a critical system resource, it is
The credits are resources like semaphores. Both an important to investigate and fix issues quickly to
interface credit and a peer credit must be acquired minimize down time. The tips and techniques presented
(decremented) to send to a remote peer. If either interface here cover a broad range of knowledge of Cray Lustre
or peer credits are unavailable then the operation will be systems and are a primer on how to investigate Lustre
queued. problems in order to achieve quick diagnosis of issues.
/proc/sys/lnet/nis lists the maximum number
of NI tx credits and peer credits along with the current
available NI tx credits per interface. When there are
insufficient credits, operations queue and the credit count

Cray User Group 2011 Proceedings 9 of 15


6. References systems engineer at Cray. Ann can be reached at
amk@cray.com. Cory and Ann can both be reached at
1. Oracle. Lustre Operations Manual S-6540-1815. 380 Jackson Street, St. Paul, MN, 55101.
CrayDoc. [Online] March 2011.
http://docs.cray.com/books/S-6540-1815.
2. Automated Lustre Failover on the Cray XT. Nicholas
Henke, Wally Wang, and Ann Koehler. Atlanta :
Proceedings of the Cray User Group, 2009.
3. Cray Inc. esFS FailOver 2.0. 2011.
4. Feiyi Wang, Sarp Oral, Galen Shipman, Oleg
Drokin, Tom Wang, Isaac Huang. Understanding
Lustre Filesystem Internals. http://www.lustre.org/lid.
[Online] 2009.
http://wiki.lustre.org/lid/ulfi/complete/ulfi_complete.html
#_why_em_ls_em_is_expensive_on_lustre.
5. Chris Morrone, LLNL. Lustre Monitoring Tool
(LMT) . Lustre Users Group 2011. [Online] April 2011.
http://www.olcf.ornl.gov/wp-content/events/lug2011/4-
13-2011/400-430_Chris_Morrone_LMT_v2.pdf.
6. Andrew Uselton, NERSC. The Statistical Properties
of Lustre Server-side I/O. Lustre Users Group 2011.
[Online] April 2011. http://www.olcf.ornl.gov/wp-
content/events/lug2011/4-12-2011/1130-
1200_Andrew_Uselton_LUG_2011-04-11.pdf.
7. Nick Cardo, NERSC. Detecting Hidden File System
Problems. [Online] 2011. http://www.olcf.ornl.gov/wp-
content/events/lug2011/4-13-2011/230-
300_Nick_Cardo.pptx.

7. Acknowledgments
Many thanks to the Cray benchmarking and SPS staff
including the field support for always providing the
needed data, insights, and operational support in whose
experience the authors based this paper on.
Also, thank you to the CUG 2010 attendees for
requesting this type of contribution from the Cray Lustre
team.
Finally, thank you to Nic Henke for providing insight
into gnilnd internals. Questions not covered in this
paper pertaining to the gnilnd internals can be directed
to nic@cray.com.
This material is based upon work supported by the
Defense Advanced Research Projects Agency under its
Agreement No. HR0011-07-9-0001. Any opinions,
findings and conclusions or recommendations expressed
in this material are those of the author(s) and do not
necessarily reflect the views of the Defense Advanced
Research Projects Agency.

8. About the Authors


Cory is the team lead for Lustre integration at Cray.
His email address is spitzcor@cray.com. Ann is a file

Cray User Group 2011 Proceedings 10 of 15


Appendix A
lustrelogs.sh

#
# Extract Lustre server messages from console log into separate files
# Copyright 2011 Cray Inc. All Rights Reserved.
#

#!/bin/bash
# Usage: <script> <console log> [<hosts>]

usage () {
echo ""
echo "*** Usage: $(basename $0) [-h] <console_log>"
echo ""
echo "*** Extracts MDS and OSS messages from the specified console"
echo "*** log and places them in separate files based on node id."
echo ""
echo "*** File names identify server type, cname, nid, and objects"
echo "*** on the server. The OST list in file name is not guaranteed"
echo "*** to be complete but the gaps in the numbers usually makes "
echo "*** this obvious."
echo ""
echo "*** Options:"
echo "*** -h Prints this message."
echo ""
}

while getopts "h" OPTION; do


case $OPTION in
h) usage
exit 0
;;
*) usage
exit 1
esac
done
shift $((OPTIND - 1))

if [ "$1" == "" ]; then


usage
exit 1
fi
CONSOLE_LOG=$1

# Parses cname and Lustre object from console log messages of the form:
#
# [2010-10-12 04:16:36][c0-0c0s4n3]Lustre: Server garnid15-OST0005 on device /dev/sdb has started
#
# Finds cname/nid pairings for server nodes. Record format is:
# 2010-10-12 04:13:22][c0-0c0s0n3] HOSTNAME: nid00003
#
# Builds filenames: <oss | mds>.<cname>.<nid>.<target list>
# Extracts records for cname from console file and writes to <filename>

# Lustre Version 1.6.5 and 1.8.2


srch[1]="Lustre: Server"
objfld[1]=4

# Version 1.8.4 and later


srch[2]="Lustre: .*: Now serving"
objfld[2]=3

# Produces: mds:c#-#c#s#n:.MDT0000.MGS or
# oss#:c#-#c#s#n:.OST####.OST####...

Cray User Group 2011 Proceedings 11 of 15


find_servernodes () {
local obj_field=$1
local srch_string=$2

SERVERS=$( \
grep "${srch_string}" $CONSOLE_LOG | sort -k ${obj_field} -u | \
awk -v fld=$obj_field \
'{match($2, /c[0-9]+-[0-9]+c[0-9]+s[0-9]+n[0-9]+/, cn);
obj=$(fld)
sub(/^.*-/, "", obj);
nodes[cn[0]] = sprintf("%s.%s", nodes[cn[0]], obj);
}
END {
ndx=0
for (cname in nodes) {
if (match(nodes[cname], /OST/)) {
printf "oss%d:%s:%s ", ndx, cname, nodes[cname];
ndx++;
}
else
printf "mds:%s:%s ", cname, nodes[cname];
}
}'
)
}

# Main

SERVERS=""
for idx in $(seq 1 ${#srch[@]}); do
find_servernodes ${objfld[$idx]} "${srch[$idx]}"
if [ "${SERVERS}" != "" ]; then
break
fi
done

nid_file="/tmp/"$(mktemp .nidsXXXXX)
grep "HOSTNAME" ${CONSOLE_LOG} > ${nid_file}

echo "Creating files:"


for name in ${SERVERS}; do
nm=(${name//:/ })
prefix=${nm[0]};
cname=${nm[1]};
objs=${nm[2]};

nid="."$(grep ${cname} ${nid_file} | awk '{print $4}')

fname=${prefix}.${cname}${nid}${objs}
echo " "$fname
grep "${cname}" ${CONSOLE_LOG} > ${fname}
done
rm ${nid_file}

Cray User Group 2011 Proceedings 12 of 15


Appendix B
sort_lctl.sh

#!/bin/bash

#
# Sort Lustre dk log into chronological order
# Copyright 2011 Cray Inc. All Rights Reserved.
#

INF=$*

for inf in $INF; do


cat $inf | sort -n -s -t: -k4,4 > $inf.sort
done

lctl_daytime.sh

#!/bin/bash

#
# Convert dk log into time of day format
# Copyright 2011 Cray Inc. All Rights Reserved.
#

if [ $# -lt 2 ]; then
echo "usage: $(basename $0) <input_file> <output_file>"
exit 1
fi

awk -F":" '{ format = "%a %b %e %H:%M:%S %Z %Y"; $4=strftime(format,


$4); print}' $1 > $2

Cray User Group 2011 Proceedings 13 of 15


Appendix C
gnilnd error codes and meanings from Cray intranet http://iowiki/wiki/GeminiLNDDebug

NOTE: The text description from errno.h is provided to reference the string printed from things like strerror and doesn't reflect
the exact use in the gnilnd. Some errors are used in a bit of a crafty manner.
Error code (name)
text description from errno.h - description of error(s) in the gnilnd
-2 (-ENOENT)
No such file or directory - could not find peer, often for lctl --net peer_list, del_peer, disconnect, etc.
-3 (-ESRCH)
No such process - RCA could not resolve NID to to NIC address.
-5 (-EIO)
I/O error - generic error returned to LNET for failed transactions, used in gnilnd for failed IP sockets reads, etc
-7 (-E2BIG)
Argument list too long - too many peers/conns/endpoints
-9 (-EBADF)
Bad file number - could not validate connection request (datagram) header - like -EPROTO, but for different fields
that should be more static. Most likely a corrupt packet - it will be dropped instead of the NAK for -EPROTO.
-12 (-ENOMEM)
Out of memory - memory couldn't be allocated for some function; also indicates a GART registration failure (for now)
-14 (-EFAULT)
Bad address - failed RDMA send due to fatal network error
-19 (-ENODEV)
No such device - connection request to invalid device
-53 (-EBADR)
Invalid request descriptor - couldn't post datagram for outgoing connection request
-54 (-EXFULL)
Exchange Full - too many SMSG retransmits
-57 (-EBADSLT)
Invalid slot - datagram match for wrong NID.
-70 (-ECOMM)
Communication error on send - we couldn't send an SMSG (FMA) due to a GNI_RC_TRANSACTION_ERROR to
peer. This means that there was some HW issue in trying the send. Check for errors like SMSG send error to
29@gni: rc 11 (SOURCE_SSID_SRSP:REQUEST_TIMEOUT) to find the type and cause of the error.
-71 (-EPROTO)
Protocol error - invalid bits in messages, bad magic, wire version, NID wrong for mailbox, bad timeout. Remote peer
will receive NAK.
-100 (-ENETDOWN)
Network is down - could not create EP or post datagram for new connection setup
-102 (-ENETRESET)
Network dropped connection because of reset - admin ran lctl --net gni disconnect
-103 (-ECONNABORTED)
Software caused connection abort - could not configure EP for new connection with the parameters provided from
remote peer
-104 (-ECONNRESET)
Connection reset by peer - remote peer sent CLOSE to us
-108 (-ESHUTDOWN)
Cannot send after transport endpoint shutdown - we are tearing down the LND.
-110 (-ETIMEDOUT)
Connection timed out - connection did not receive SMSG from peer within timeout
-111 (-ECONNREFUSED)
Connection refused - hardware datagram timeout trying to connect to peer.

Cray User Group 2011 Proceedings 14 of 15


-113 (-EHOSTUNREACH)
No route to host - error when connection attempt to peer fails
-116 (-ESTALE)
Stale NFS file handle - older connection closed due to new connection request
-117 (-EUCLEAN)
Structure needs cleaning - admin called lctl --net gni del_peer
-125 (-ECANCELED)
Operation Canceled - operation terminated due to error injection (fail_loc) - not all injected errors will do this.
-126 (-ENOKEY)
Required key not available - bad checksum
-131 (-ENOTRECOVERABLE)
State not recoverable - stack reset induced TX or connection termination

Cray User Group 2011 Proceedings 15 of 15

You might also like