TinyFat#
Memoryhog and the TinyFat cluster are intended for running serial or moderately parallel (OpenMP) applications that require large amounts of memory in one machine.
Hostnames | # nodes | CPUs and # cores per node | main memory per node | node-local SSD | Slurm partition |
---|---|---|---|---|---|
memoryhog |
1 | 2 x Intel Xeon Platinum 8360Y ("Ice Lake"), 72 cores/144 threads @2.4GHz | 2 TB | n/a | interactively accessible without batch job |
tf04x |
3 | 2 x Intel Xeon E5-2680 v4 ("Broadwell"), 28 cores/56 threads @2.4 GHz | 512 GB | 1 TB | broadwell512 |
tf05x |
8 | 2 x Intel Xeon E5-2643 v4 ("Broadwell"), 12 cores/24 threads @3.4 GHz | 256 GB | 1 TB | broadwell256 , long256 |
tf06x -tf09x |
36 | 2 x AMD EPYC 7502 ("Rome", "Zen2"), 64 cores/128 threads | 512 GB | 3.5 TB | work |
All nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.
Access to the machines#
TinyGPU is only available to accounts part of the "Tier3 Grundversorgung", not to NHR project accounts.
See configuring connection settings or SSH in general for configuring your SSH connection.
If successfully configured, the shared frontend node for TinyGPU and TinyFat can be accessed via SSH by:
Software#
TinyFat runs Ubuntu 20.04 LTS.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Python, conda, conda environments#
Through the python
module, a Conda installation is available.
See our Python documentation for usage, initialization,
and working with conda environments.
Compiler#
For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.
The CPU types on the frontend node and in the partitions are different. The compilation flags should be adjusted according to the partition you plan to run your code on. See the following table for details.
On nodes of the work
partition, non-optimal code might be generated for the AMD processors when Intel compilers with -march=native
or -xHost
are used.
Software compiled specifically for Intel processors might not run on the work
partition, since the nodes have AMD CPUs.
The following table shows the compiler flags for targeting TinyFat's CPUs:
partition | microarchitecture | GCC/LLVM | Intel oneAPI/Classic |
---|---|---|---|
all | Zen2, Broadwell | -mavx2 -mfma or -march=x86-64-v3 |
-mavx2 -mfma |
work |
Zen2 | -march=znver2 |
-mavx2 -mfma |
broadwell* , long256 |
Broadwell | -march=broadwell |
-march=broadwell |
Filesystems#
On all front ends and nodes the filesystems $HOME
, $HPCVAULT
, and $WORK
are mounted.
For details see the filesystems documentation.
Node local SSD $TMPDIR
#
Data stored on $TMPDIR
will be deleted when the job ends.
Each cluster node has a local SSD that is reachable under $TMPDIR
.
For more information on how to use $TMPDIR
see:
- general documentation of
$TMPDIR
, - staging data, e.g. to speed up training,
- sharing data among jobs on a node.
The capacity is 1 TB for broadwell*
and long256
partition nodes and 3.5 TB for work
partition nodes.
The storage space of the SSD is shared among all jobs on a node.
Hence, you might not have access to the full capacity of the SSD.
Batch processing#
Resources are controlled through the batch system Slurm.
The only exception is memoryhog
, which can be used interactively
without a batch job. Every HPC user can log in directly to
memoryhog.rrze.fau.de
to run their memory-intensive workloads.
Slurm commands are suffixed with .tinyfat
#
The front end node tinyx.nhr.fau.de
serves both the TinyGPU and the TinyFat cluster.
To distinguish which cluster is targeted when a Slurm command is used, Slurm commands for TinyFat have the .tinyfat
suffix.
This means instead of using:
srun
usesrun.tinyfat
salloc
usesalloc.tinyfat
sbatch
usesbatch.tinyfat
sinfo
usesinfo.tinyfat
These commands are equivalent to unsuffixed Slurm commands and using the option
--clusters=tinyfat
.
When resubmitting jobs from TinyFat's compute nodes themselves, only use sbatch
, i.e. without the .tinyfat
suffix.
Partitions#
Only single node jobs are allowed.
Compute nodes in the broadwell*
and long256
partition are allocated exclusively.
Compute nodes in the work
partition are shared, however, requested resources are always granted
exclusively.
The granularity of batch allocations are individual cores.
For each requested core 8 GB of main memory are allocated.
If your application needs more memory, then use the option --mem=<memory in MByte>
.
Request a node exclusively by using the
--exclusive
option.
Partition | min – max walltime | min – max cores | exclusivity | memory per node | Slurm options |
---|---|---|---|---|---|
work |
0 – 2:00:00 (1) | 1 – 64 | shared nodes | 512 GB | |
work (default) |
0 – 24:00:00 | 1 – 64 | shared nodes | 512 GB | |
broadwell256 |
0 – 24:00:00 | 12 | exclusive nodes | 256 GB | -p broadwell256 |
broadwell512 |
0 – 24:00:00 | 28 | exclusive nodes | 512 GB | -p broadwell512 |
long256 |
0 – 60:00:00 | 12 | exclusive nodes | 256 GB | -p long256 |
(1) nodes reserved for short jobs, assigned automatically
All nodes have SMT, a.k.a. hardware threads or hyper threading, enabled, per default only one task per physical core is scheduled.
To use SMT you have to specify --hint=multithread
.
See batch job examples for examples.
Using SMT / Hyperthreads#
Most modern architectures offer simultaneous multithreading (SMT), where physical cores of a CPU are split into virtual cores (aka. threads). This technique allows to run two instruction streams per physical core in parallel.
On all TinyFat nodes, SMT is available. When specifying --cpus-per-task
(e.g. for OpenMP jobs), SMT threads are automatically used. If you do not wish to use SMT threads but only physical cores, add the option --hint=nomultithread
to sbatch
, srun
or salloc
or use #SBATCH --hint=nomultithread
inside your job script.
Pure MPI jobs automatically do not use SMT threads.
Interactive jobs#
Interactive jobs can be requested by using salloc.tinyfat
instead of sbatch.tinyfat
and specifying the respective options on the command line.
The environment from the calling shell, like loaded modules, will be inherited by the interactive job.
Interactive job (single-core)#
The following will give you an interactive shell on one node with one core and 8 GB RAM dedicated to you for one hour:
Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!
Interactive job (multiple cores)#
The following will give you an interactive shell on one node with 10 physical cores and 80 GB RAM dedicated to you for one hour:
Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!
Batch job script examples#
Serial job (single-core)#
In this example, the executable will be run using a single core for a total job walltime of 1 hours.
#!/bin/bash -l
#
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
./application
MPI parallel job (single-node)#
In this example, the executable will be run using 2 MPI processes. Each process is running on a physical core and SMT threads are not used.
#!/bin/bash -l
#
#SBATCH --ntasks=2
#SBATCH --partition=work
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun --mpi=pmi2 ./application
Hybrid MPI/OpenMP (single-node)#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, the executable will be run on one node using 2 MPI processes with 8 OpenMP threads (i.e. one per physical core) for a total job walltime of 6 hours. 16 cores are allocated in total and each OpenMP thread is running on a physical core. Hyperthreads are not used.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH --time=6:00:00
#SBATCH --hint=nomultithread
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 ./hybrid_application
OpenMP job#
For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true
. For more information, see e.g. the HPC Wiki.
In this example, the executable will be run using 6 OpenMP threads (i.e. one per physical core) for a total job walltime of 4 hours.
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
# do not use SMT threads
#SBATCH --hint=nomultithread
#SBATCH --time=4:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./application
Attach to a running job#
See the general documentation on batch processing.