NCI Intro
NCI Intro
NCI Intro
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
2 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
3 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Allocation Schemes
I
Partner allocations
I
I
Flagship Projects
I
Startup allocation
Director
4 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
NCMAS 15%
CSIRO 21.4%
BOM 18.9%
ANU 17.7%
Flagships 5.0% (including CoECSS, TERN, Astro, CoE Optics)
INTERSECT 3.8%
GA 3.4%
Monash, UNSW, UQ, USyd, Uni Adelaide, 1.7% each
Directors share, QCIF, Deakin, MSI 6.3% in total
5 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
6 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Getting Information
I
I
I
I
I
I
I
I
I
URL http://nci.org.au/
Detailed usage information
Raijin Quick Reference Guide
Detailed software information
Raijin FAQs
/g/data FAQs
Message of the Day (/etc/motd)
Emergency and Downtime Notices
NCI help email help@nci.org.au
7 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
8 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Cloud
NCIs Cloud services focus around:
I
I
I
Designed for high speed IO (All SSD disk storage in the cloud)
NCI can offer a high speed interconnect between the NCI Lustre based
filesystems and NCI Cloud services.
9 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Data Storage
I
I
10 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
11 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Project leaders (Chief Investigators) will fill out on-line forms with
required details and be given a project ID.
Application process:
I
I
I
I
Partner (anytime)
Merit scheme (once a year, deadline Nov)
Start-up (anytime, max 5000 SU per year)
Commercial (anytime)
12 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Project accounting
I
I
I
I
14 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Default Project
I
The following displays the usage of the project in the current quarter
against each of the stakeholder funding the project.
nci_account
You can also use -P for other project and -p for different quarter, ie:
nci account -P c25 -p 2014.q2 -v
I
I
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
16 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Establish Connection
I
Caution!
Be sure to logout of xterm sessions, and quit the Window Manager
before leaving the system.
17 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Connecting to raijin
The hostname of the Fujitsu Primergy Cluster is
raijin.nci.org.au
and can be accessed using the secure shell (ssh) command, for example,
ssh -X abc123@raijin.nci.org.au
Your ssh connection will be to one of six possible login nodes, raijin{1,6}
(If ssh to raijin fails, you should try specifying one of the nodes, i.e.
raijin3.nci.org.au).
18 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Caution!
Day-to-day use is strongly discouraged.
This considerably weakens both NCI and home institution system
security. (Instead consider a key with passphrase + ssh-agent on your
workstation.)
Can be useful to support copyq batch jobs:
I
I
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
20 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
UNIX environment
The working environment under UNIX is controlled by shells
(command-line interpreter). The shell interprets and executes user
commands.
I
I
I
I
The default is bash shell (also popular is tcsh, you may use ksh)
Shell can be changed by modifying .rashrc
Shell commands can be grouped together into scripts
Unix Quick Reference Guide
Note
Unix is case sensitive!!
21 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
UNIX environment
The shell provides environment variables that can be accessed across all
the processes initiated from the original shell e.g. login environment.
exec on login and compute nodes
exec on login nodes only
modules
csh/tcsh
.cshrc
.login
.login
sh/bash/ksh
.bashrc
.profile
.profile
tcsh syntax
setenv VARIABLE value
bash syntax
export VARIABLE=value
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Environment Modules
Modules provide a great way to easily customize your shell environment
for different software packages. The module command syntax is the
same no matter which command shell you are using.
Various modules are loaded into your environment at login to provide a
workable environment.
module list
module avail
module show name
module load
module unload
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Environment Modules
Note
To automate environment customisation at login module load
commands can be added to the .login (tcsh) or .profile (bash) files.
Users should be aware that different applications can have incompatible
environment requirements so loading multiple application modules in
your dot file may lead to problems. We recommend that modules are
loaded in scripts as needed at runtime and likewise discourage the use of
module commands in shell configuration (dot) files.
More advanced information on modules can be found in the Modules
User Guide.
24 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Editors
Several editors are available
I
I
I
vi
emacs
nano
If you are not familiar with any of these you will find that nano has a
simple interface. Just type nano.
Caution!
Use dos2unix if your input/job script files were edited on a windows
machine.
25 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
#
#
#
#
Note
In .cshrc (tcsh) or .bashrc (bash) that the intel-fc, intel-cc and openmpi
modules are loaded by default.
26 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
27 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
I
I
I
Note
Job charging is based on wall clock time used, number of cpus requested,
queue choice.
28 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Queue Limit
29 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
normal
I
I
I
I
express
I
I
I
copyq
I
30 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
I
I
I
walltime
memory (32GB, 64GB, 128GB per node)
disk (jobfs)
number of cpus
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
-l
-l
-l
-l
-l
-l
walltime=20:00:00
mem=2GB
jobfs=1GB
ncpus=16
software=xxx (for licenced software)
wd (to start the batch job in the working
directory from which it was submitted.)
my_program.exe
32 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Job Scheduling
I
I
Tips
I
I
33 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Long-running jobs
34 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Long-running jobs
I
I
Caution!
Checkpoint/restart is not a filesystem or PBSPro capability - It must be
implemented by the user or software vendor.
35 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Example script.o123456
============================================================
Resource Usage on 2013-07-20 12:48:04.355160:
JobId:
123456.r-man2
Project:
c25
Exit Status: 0 (Linux Signal 0)
Service Units: 0.01
NCPUs Requested: 1
NCPUs Used: 1
CPU Time Used: 00:00:43
Memory Requested: 50mb
Memory Used: 13mb
Vmem Used: 52mb
Walltime requested: 00:10:00
Walltime Used: 00:00:49
jobfs request: 100mb
jobfs used: 1mb
============================================================
36 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
.o***** file contains the output arising from the script (if not
redirected in the script) and additional information from PBS.
.e***** file contains any error output arising from the script (if not
redirected in the script) and additional information from PBS. For a
successful job it should be empty.
Common errors to look for in the .e***** file:
I
I
37 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
#
#
#
#
#
#
#
#
#
Caution!
Please use nqstat anu -a | grep $USER to see the cpu% of your
jobs. An efficient parallel job should be close to 100%.
38 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
runjob
I
This job searches the first n prime number. Please feel free to
change the number n, or the PBS resource to see the behaviour of
the outcome.
View the output in the file runjob.o**** and any error messages
in runjob.e**** after the job completes.
39 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Interactive jobs
When running jobs on login nodes, users may see the following message
when running interactive process on login nodes:
RSS exceeded.user=abc123, pid=12345, cmd=exe,
rss=4028904, rlim=2097152 Killed
Each interactive process you run on the login nodes has imposed on it a
time (30mins) limit and a memory use (2GB) limit. If you want to run
longer or more memory intensive interactive job, please submit an
interactive job.
I The -I option for qsub will result in an interactive shell being
started out on the compute nodes once your job starts.
I A submission script cannot be used in this mode you must provide
all qsub options on the command line.
I To use X windows in an interactive batch job, include the -X option
when submitting your job this will automatically export the
DISPLAY environment variable.
40 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
42 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Filesystems
Things to consider I
I
I
I
I
I
I
Transferring large data files to and from raijin: scp, rsync, filezilla
Use designated data mover nodes, not interactive login nodes.
r-dm.nci.org.au
How much data do you really need to keep?
Do you need metadata or a self-describing file format?
Decide on a structure for archived data before you start.
Staging in archived data from tape (Offline) to disk before starting
jobs.
Archiving results automatically at the end of batch jobs.
43 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Purpose
Quota
Backup
Availability
Time limit
/home
2GB (user)
Yes
raijin
None
/short
72GB (project)
No
raijin
365 days
/g/data/
project dependent
No
Global
No
$PBS JOBFS
IO intensive data
No
Local to node
Duration of job
MDSS
Archiving
files
20GB
2
copies
in
two
different
locations
External
access
using
mdss
commands
No
large
data
Note
These limits can be changed on request.
44 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Caution!
/short and /g/data are not backed up so it is the users responsibility to
make sure that important files are archived to the MDSS or off-site.
45 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Input/Output Warning
I
Lots of small IO to /short (or /home) can be very slow and can
severely impact other jobs on the system.
Avoid dribbly IO, e.g. writing 2 numbers from your inner loop.
Writing to /short every second is far too often!
Avoid frequent opening and closing of files (or other file operations)
46 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Use the lquota and du commands to find how much disk space
you have available in your home, short and gdata directories.
47 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
filename
filename
filename
filename
#
#
#
#
Verify with your neighbour that your file permissions are as expected.
Note
I
I
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Caution!
We strongly recommend that you consult with NCI before using ACLs.
49 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
50 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
The mdss command can be used to get and put data between
the login and copyq nodes of the raijin and the MDSS, and also
list files and directories on the MDSS.
netcp and netmv can be used from within batch jobs to
I
I
netcp and netmv can also be used interactively to save you work
creating tarfiles and generating mdss commands.
I
I
Caution!
Always use -l other=mdss when using mdss commands in copyq. This
is so that jobs only run when the the mdss system is available.
51 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
52 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Using /jobfs
I
All files are deleted at end of job. Copy what you need to /short
or other global filesystem in job script.
53 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
54 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Check out the output file that this job created on /short and the copy on
the MDSS
cd /short/$PROJECT/$USER
ls -ltr
less save_data.o*
mdss ls $USER
mdss rm -r $USER
55 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
56 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
Troubleshooting
I
I
I
I
I
57 / 58
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting
I
I
I
I
I
I
I