Lustre Admin Monitor
Lustre Admin Monitor
Lustre Admin Monitor
File System
Administration and
Monitoring
Jesse Hanley
Rick Mohr
Jeffrey Rossiter
Sarp Oral
Michael Brim
Jason Hill
Neena Imam
* Joshua Lothian (MELT)
DoD HPC
Research
Program
Starting a Lustre file system
DoD HPC
Research
Program
Mounting Strategies
DoD HPC
Research
Program
Client Mounting
DoD HPC
Research
Program
Stopping a Lustre file system
DoD HPC
Research
Program
Usage Reports
DoD HPC
Research
Program
Purging
DoD HPC
Research
Program
Purging Tools
DoD HPC
Research
Program
Handling Full OSTs
• Once the index of the OST is found, running “lfs quota” with
the –I argument will provide the usage on that OST.
– for user in $(users); do lfs quota -u $user -I 221 /lustre/testfs; done
DoD HPC
Research
Program
Handling Full OSTs (cont.)
DoD HPC
Research
Program
Monitoring
DoD HPC
Research
Program
Monitoring – General Software
• Nagios
– Nagios is a general purpose monitoring solution.
– A system can be set up with host and service checks. There is
native support for host-down checks and various service
checks, including file system utilization.
– Nagios is highly extensible, allowing for custom checks
• This could include checking the contents of the /proc/fs/lustre/
health_check file.
– It’s an industry standard and has proven to scale to hundreds
of checks
– Open source (GPL)
– Supports paging on alerts and reports. Includes a multi-user
web interface
– https://www.nagios.org
DoD HPC
Research
Program
Monitoring – General Software
• Ganglia
– Gathers system metrics (load, memory, disk utilization,
…) and stores the values in RRD files.
– Benefits to RRD (fixed size) vs downsides (data rolloff)
– Provides a web interface for these metrics over time (past
2hr, 4hr, day, week, …)
– Ability to group hosts together
– In combination with collectl, can provide usage metrics for
Infiniband traffic and Lustre metrics
– http://ganglia.sourceforge.net/
– http://collectl.sourceforge.net/Tutorial-Lustre.html
DoD HPC
Research
Program
Monitoring – General Software
• Splunk
– “Operational Intelligence”
– Aggregates machine data, logs, and other user-defined
sources
– From this data, users can run queries. These queries can
be scheduled or turned into reports, alerts, or dashboards
for Splunk’s web interface
– Tiered licensing based on indexed data, including a free
version.
– Useful for generating alerts on Lustre bugs within syslog
– There are open source alternatives such as ELK stack.
– https://www.splunk.com/
DoD HPC
Research
Program
Monitoring – General Software
DoD HPC
Research
Program
Monitoring – Lustre tools
DoD HPC
Research
Program
Monitoring – Lustre tools
DoD HPC
Research
Program
Monitoring – Lustre tools
– The llstat and llobdstat commands provide a watch-like interface for the
various stats files
• llobdstat: /proc/fs/lustre/obdfilter/<ostname>/stats
• llstat: /proc/fs/lustre/mds/MDS/mdt/stats, etc. Appropriate files are listed in the llstat man page
• Example:
[root@sultan-‐mds1
lustre]#
llstat
-‐i
2
lwp/sultan-‐MDT0000-‐lwp-‐MDT0000/stats
/usr/bin/llstat:
STATS
on
06/08/15
lwp/sultan-‐MDT0000-‐lwp-‐MDT0000/stats
on
10.37.248.68@o2ib1
snapshot_time
1433768403.74762
req_waittime
1520
req_active
1520
mds_connect
2
obd_ping
1518
lwp/sultan-‐MDT0000-‐lwp-‐MDT0000/stats
@
1433768405.75033
Name
Cur.Count
Cur.Rate
#Events
Unit
last
min
avg
max
stddev
req_waittime
0
0
1520
[usec]
0
46
144.53
14808
380.72
req_active
0
0
1520
[reqs]
0
1
1.00
1
0.00
mds_connect
0
0
2
[usec]
0
76
7442.00
14808
10417.10
obd_ping
0
0
1518
[usec]
0
46
134.91
426
57.51
^C
DoD HPC
Research
Program
Monitoring – Lustre Specific Software
• LMT
– “The Lustre Monitoring Tool (LMT) is a Python-based,
distributed system that provides a top-like display of
activity on server-side nodes”
– LMT uses cerebro (software similar to Ganglia) to pull
statistics from the /proc/ file system into a MySQL
database
• lltop/xltop
– A former TACC staff member, John Hammond, created
several monitoring tools for Lustre file systems
– lltop - Lustre load monitor with batch scheduler integration
– xltop - continuous Lustre load monitor
DoD HPC
Research
Program
Monitoring – Lustre Specific Software
DoD HPC
Research
Program
Summary
DoD HPC
Research
Program
Acknowledgements
DoD HPC
Research
Program