User Manual
User Manual
User Manual
User Manual
Revision: 134ac95
Date: Fri Dec 15 2023
©2020 Bright Computing, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced
in any form unless permitted by contract or by written permission of Bright Computing, Inc.
Trademarks
Linux is a registered trademark of Linus Torvalds. PathScale is a registered trademark of Cray, Inc. Red
Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SUSE is
a registered trademark of Novell, Inc. PGI is a registered trademark of NVIDIA Corporation. FLEXlm is
a registered trademark of Flexera Software, Inc. PBS Professional, PBS Pro, and Green Provisioning are
trademarks of Altair Engineering, Inc. All other trademarks are the property of their respective owners.
1 Introduction 1
1.1 What Is A Beowulf Cluster? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Background And History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Brief Hardware And Software Description . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Brief Network Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Cluster Usage 3
2.1 Login To The Cluster Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Setting Up The User Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Environment Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Available commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Changing The Current Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.3 Changing The Default Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Compiling Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Open MPI And Mixing Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Using MPI 9
3.1 Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Gigabit Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Selecting An MPI implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Example MPI Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Compiling And Preparing The Application . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Creating A Machine File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 Running The Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.4 Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.5 Support Thread Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.6 Further Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Workload Management 17
4.1 What Is A Workload Manager? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Why Use A Workload Manager? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 How Does A Workload Manager Function? . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Job Submission Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 What Do Job Scripts Look Like? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Running Jobs On A Workload Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Running Jobs In Cluster Extension Cloud Nodes Using cmsub . . . . . . . . . . . . . . . . 19
ii Table of Contents
5 Slurm 21
5.1 Loading Slurm Modules And Compiling The Executable . . . . . . . . . . . . . . . . . . . 21
5.2 Running The Executable With salloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Node Allocation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Running The Executable As A Slurm Job Script . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.1 Slurm Job Script Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.2 Slurm Job Script Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3.3 Slurm Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.4 Submitting The Slurm Job Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.5 Checking And Changing Queued Job Status . . . . . . . . . . . . . . . . . . . . . . 26
6 SGE 27
6.1 Writing A Job Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.1 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.2 SGE Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.1.3 Job Script Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1.4 The Executable Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1.5 Job Script Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Submitting A Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.1 Submitting To A Specific Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.2 Queue Assignment Required For cm-scale . . . . . . . . . . . . . . . . . . . . . . 32
6.3 Monitoring A Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.4 Deleting A Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8 Using GPUs 47
8.1 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Using CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.3 Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.4 Compiling Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.5 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.5.1 CUDA gdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.5.2 nvidia-smi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Table of Contents iii
9 Using MICs 53
9.1 Compiling Code In Native Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.1.1 Using The GNU Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
9.1.2 Using Intel Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.2 Compiling Code In Offload Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.3 Using MIC With Workload Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.1 Using MIC Cards With Slurm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.3.2 Using MIC Cards With PBS Pro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.3 Using MIC Cards With TORQUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.4 Using MIC Cards With SGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.5 Using MIC Cards With openlava . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
10 Using Kubernetes 59
10.1 Introduction To Kubernetes Running Via Bright Cluster Manager . . . . . . . . . . . . . . 59
10.2 Kubernetes Quickstarts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
10.2.1 Quickstart: Connecting From A Local Machine . . . . . . . . . . . . . . . . . . . . . 60
10.2.2 Quickstart: Submitting Batch Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.2.3 Quickstart: Persistent Storage For Kubernetes Using Ceph . . . . . . . . . . . . . . 62
10.2.4 Quickstart: Helm, The Kubernetes Package Manager . . . . . . . . . . . . . . . . . 63
11 Using Singularity 65
11.1 How To Build A Simple Container Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
11.2 Using MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11.3 Using A Container Image With Workload Managers . . . . . . . . . . . . . . . . . . . . . . 69
11.4 Using the singularity Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
12 User Portal 71
12.1 Overview Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.2 Workload Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.3 Nodes Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.4 Hadoop Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.5 OpenStack Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.6 Kubernetes Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
12.7 Charts Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15 Using OpenStack 89
15.1 User Access To OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
15.2 Getting A User Instance Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
15.2.1 Making An Image Available In OpenStack . . . . . . . . . . . . . . . . . . . . . . . 90
15.2.2 Creating The Networking Components For The OpenStack Image To Be Launched 91
15.2.3 Accessing The Instance Remotely With A Floating IP Address . . . . . . . . . . . . 94
• enough local memory—memory contained in a single node—to deal with the processes passed on
to the node
Nodes are configured and controlled by the head node, and do only what they are told to do. One
of the main differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf
behaves more like a single machine rather than many workstations. In most cases, the nodes do not
have keyboards or monitors, and are accessed only via remote login or possibly serial terminal. Beowulf
nodes can be thought of as a CPU + memory package which can be plugged into the cluster, just like
a CPU or memory module can be plugged into a motherboard to form a larger and more powerful
machine. A significant difference is that the nodes of a cluster have a relatively slower interconnect.
The login node is used to compile software, to submit a parallel or batch program to a job queuing
system and to gather/analyze results. Therefore, it should rarely be necessary for a user to log on to
one of the nodes and in some cases node logins are disabled altogether. The head, login and compute
nodes usually communicate with each other through a gigabit Ethernet network, capable of transmitting
information at a maximum rate of 1000 Mbps. In some clusters 10 gigabit Ethernet (10GE, 10GBE, or
10GigE) is used, capable of up to 10 Gbps rates.
Sometimes an additional network is used by the cluster for even faster communication between the
compute nodes. This particular network is mainly used for programs dedicated to solving large scale
computational problems, which may require multiple machines and could involve the exchange of vast
amounts of information. One such network topology is InfiniBand, commonly capable of transmitting
information at a maximum effective data rate of about 124Gbps and about 1.2µs end-to-end latency on
small packets, for clusters in 2013. The commonly available maximum transmission rates will increase
over the years as the technology advances.
Applications relying on message passing benefit greatly from lower latency. The fast network is
usually complementary to a slower Ethernet-based network.
To carry out an ssh login to the cluster, a terminal session can be started from Unix-like operating
systems:
Example
$ ssh myname@cluster.hostname
On a Windows operating system, an SSH client such as for PuTTY (http://www.putty.org) can be
downloaded. Another standard possibility is to run a Unix-like environment such as Cygwin (http:
//www.cywin.com) within the Windows operating system, and then run the SSH client from within it.
A Mac OS X user can use the Terminal application from the Finder, or under
Application/Utilities/Terminal.app. X11 must be installed from the Mac OS X medium, or
alternatively, XQuartz can be used instead. XQuartz is an alternative to the official X11 package, and is
usually more up-to-date and less buggy.
When using the SSH connection, the cluster’s address must be added. When the connection is made,
a username and password must be entered at the prompt.
If the administrator has changed the default SSH port from 22 to something else, the port can be
specified with the -p <port> option:
The -X option can be dropped if no X11-forwarding is required. X11-forwarding allows a GUI appli-
cation from the cluster to be displayed locally.
Optionally, after logging in, the password used can be changed using the passwd command:
$ passwd
• the prompt can be changed to indicate the current username, host, and directory, for example: by
setting the prompt string variable:
PS1="[\u@\h:\w ] $"
• the size of the command history file can be increased, for example: export HISTSIZE=100
• aliases can be added for frequently used command sequences, for example: alias lart=’ls
-alrt’
• the location of software packages and versions that are to be used by a user (the path to a package)
can be set.
Because there is a huge choice of software packages and versions, it can be hard to set up the right
environment variables and paths for software that is to be used. Collisions between different versions
of the same package and non-matching dependencies on other packages must also be avoided. To
make setting up the environment easier, Bright Cluster Manager provides preconfigured environment
modules (section 2.3).
Switches:
-H|--help this usage info
-V|--version modules version & configuration options
-f|--force force active dependency resolution
-t|--terse terse format avail and list format
-l|--long long format avail and list format
-h|--human readable format avail and list format
-v|--verbose enable verbose messages
-s|--silent disable verbose messages
-c|--create create caches for avail and apropos
Modules can be loaded using the add or load options. A list of modules can be added by spacing
them:
$ module add shared gcc openmpi/gcc
The shared module is special. If it is to be loaded, it is usually placed first in a list before the
other modules, because the other modules often depend on it. The shared module is described further
shortly.
The “module avail” command lists all modules that are available for loading (some output elided):
Example
[fred@bright81 ~]$ module avail
• local modules, which are specific to the node, or head node only
• shared modules, which are made available from a shared storage, and which only become avail-
able for loading after the shared module is loaded.
The shared module is obviously a useful local module, and is therefore usually configured to be
loaded for the user by default.
Although version numbers are shown in the “module avail” output, it is not necessary to specify
version numbers, unless multiple versions are available for a module1 .
To remove one or more modules, the “module unload” or “module rm” command is used.
To remove all modules from the user’s environment, the “module purge” command is used.
The user should be aware that some loaded modules can conflict with others loaded at the same time.
For example, loading openmpi/gcc/64/ without removing an already loaded openmpi/gcc/64/
can result in confusion about what compiler opencc is meant to use.
• module initclear: clear all modules from the list of modules loaded initially
Example
$ module initclear
$ module initlist
bash initialization file $HOME/.bashrc loads modules:
null
$ module initadd shared gcc/4.8.1 openmpi/gcc sge
$ module initlist
bash initialization file $HOME/.bashrc loads modules:
null shared gcc/4.8.1 openmpi/gcc/64/1.6.5 sge/2011.11p1
In the preceding example, the newly defined initial state module environment for the user is loaded
from the next login onwards.
If the user is unsure about what the module does, it can be checked using “module whatis”:
be an issue when versions move from, say, 9, to 10. For example, the following is sorted in alphabetical order: v1 v10 v11 v12 v13
v2 v3 v4 v5 v6 v7 v8 v9.
GNU compilers are the de facto standard on Linux and are installed by default. They are provided
under the terms of the GNU General Public License. Commercial compilers by Portland and Intel are
available as packages via the Bright Cluster Manager YUM repository, and require the purchase of a
license to use them. To make a compiler available to be used in a user’s shell commands, the appropriate
environment module (section 2.3) must be loaded first. On most clusters two versions of GCC are
available:
1. The version of GCC that comes along with the Linux distribution. For example, for CentOS 6.x:
Example
2. The latest version suitable for general use that is packaged as a module by Bright Computing:
Example
To use the latest version of GCC, the gcc module must be loaded. To revert to the version of GCC
that comes natively with the Linux distribution, the gcc module must be unloaded.
The compilers in the preceding table are ordinarily used for applications that run on a single node.
However, the applications used may fork, thread, and run across as many nodes and processors as they
can access if the application is designed that way.
The standard, structured way of running applications in parallel is to use the MPI-based libraries,
which link to the underlying compilers in the preceding table. The underlying compilers are automati-
cally made available after choosing the parallel environment (MPICH, MVAPICH, Open MPI, etc.) via
the following compiler commands:
Variables that may be set are OMPI_CC, OMPI_FC, OMPI_F77, and OMPI_CXX. More on overriding
the Open MPI wrapper settings is documented in the man pages of mpicc in the environment section.
Example
openmpi-geib-open64-64-1.10.7-478_cm8.1.x86_64
implies: Open MPI version 1.10.7, compiled for both Gigabit Ethernet (ge) and InfiniBand (ib), with
the GCC (gcc) compiler for a 64-bit architecture, packaged as a (Bright) cluster manager (cm) package,
for version 8.1 of Bright Cluster Manager, for the x86_64 architecture.
3.1 Interconnects
Jobs can use particular networks for inter-node communication.
3.1.2 InfiniBand
InfiniBand is a high-performance switched fabric which is characterized by its high throughput and low
latency. Open MPI, MVAPICH and MVAPICH2 are suitable MPI implementations for InfiniBand.
• mpich/ge/<compiler>
• mvapich/<compiler>
• mvapich2/<compiler>
• openmpi/<compiler>
After the appropriate MPI module has been added to the user environment, the user can start com-
piling applications. The mpich and openmpi implementations may be used on Ethernet. On Infini-
Band, mvapich, mvapich2 and openmpi may be used. Open MPI’s openmpi implementation will
first attempt to use InfiniBand, but will revert to Ethernet if InfiniBand is not available.
Depending on the libraries and compilers installed on the system, the availability of these packages
might differ. To see a full list on the system the command “module avail” can be typed.
The a.out binary that is created can then be executed using the mpirun command (section 3.3.3).
• Listing the same node several times to indicate that more than one process should be started on
each node:
node001
node001
node002
node002
• Listing nodes once, but with a suffix for the number of CPU cores to use on each node:
node001:2
node002:2
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Get_processor_name(processor_name, &processor_name_len);
for(i=1;i<2;i++)
{printf(
"Hello world from process %03d out of %03d, processor name %s\n",
id, np, processor_name
);}
MPI_Finalize();
return 0;
}
[fred@bright81 ~]$ module add openmpi/gcc #or as appropriate
[fred@bright81 ~]$ mpicc hello.c -o hello
However, it still runs on a single processor unless it is submitted to the system in a special way.
Example
Supposing the .bashrc loads two MPI stacks—the mpich stack, followed by the Open MPI stack—
then that can cause errors because the compute node may use parts of the wrong MPI implementation.
The environment of the user from the interactive shell prompt is not normally carried over auto-
matically to the compute nodes during an mpirun submission. That is, compiling and running the
executable will normally work only on the local node without a special treatment. To have the exe-
cutable run on the compute nodes, the right environment modules for the job must be made available
on the compute nodes too, as part of the user login process to the compute nodes for that job. Usu-
ally the system administrator takes care of such matters in the default user configuration by setting up
the default user environment (section 2.3.3), with reasonable initrm and initadd options. Users are
then typically allowed to set up their personal default overrides to the default administrator settings, by
placing their own initrm and initadd options to the module command according to their needs.
Running mpirun outside a workload manager: When using mpirun manually, outside a workload
manager environment, the number of processes (-np) as well as the number of hosts (-machinefile)
should be specified. For example, on a cluster with 2 compute-nodes and a machine file as specified in
section 3.3.2:
Example
The output of the preceding program is actually printed in random order. This can be modified as
follows, so that only process 0 prints to the standard output, and other processes communicate their
output to process 0:
#include "mpi.h"
#include "string.h"
#include <stdio.h>
if ( myrank == 0 ) {
printf( "%s\n", greeting );
for ( i = 1; i < numprocs; i++ ) {
MPI_Recv( greeting, sizeof( greeting ), MPI_CHAR,
i, 1, MPI_COMM_WORLD, &status );
printf( "%s\n", greeting );
}
}
else {
MPI_Send( greeting, strlen( greeting ) + 1, MPI_CHAR,
0, 1, MPI_COMM_WORLD );
}
MPI_Finalize( );
return 0;
}
Running the executable with mpirun outside the workload manager as shown does not take the
resources of the cluster into account. To handle running jobs with cluster resources is of course what
workload managers such as Slurm are designed to do. Workload managers also typically take care of
what environment modules should be loaded on the compute nodes for a job, via additions that the user
makes to a job script.
Running an application through a workload manager via a job script is introduced in Chapter 4.
Appendix A contains a number of simple MPI programs.
3.3.4 Hybridization
OpenMP is an implementation of multi-threading. This is a method of parallelizing whereby a par-
ent thread—a series of instructions executed consecutively—forks a specified number of child threads,
and a task is divided among them. The threads then run concurrently, with the runtime environment
allocating threads to different processors and accessing the shared memory of an SMP system.
MPI can be mixed with OpenMP to achieve high performance on a cluster/supercomputer of multi-
core nodes or servers. MPI creates processes that reside on the level of node, while OpenMP forks
threads on the level of a core within an SMP node. Each process executes a portion of the overall
computation, while inside each process, a team of threads is created through OpenMP directives to
further divide the problem. This kind of execution makes sense due to:
• OpenMP might not require copies of data structure, which allows for designs that overlap compu-
tation and communication
• overcoming the limits of parallelism within the SMP node is of course still possible by using the
power of other nodes via MPI.
Example
#include<mpi.h>
#include <omp.h>
#include <stdio.h>
#include<stdlib.h>
To specify the number of OpenMP threads per MPI task the environment variable OMP_NUM_THREADS
must be set.
Example
fred@bright81 ~]$ export OMP_NUM_THREADS=3
The number of threads specified by the variable can then be run over the hosts specified by the
mpirun.hosts file:
fred@bright81 ~]$ mpirun -np 2 -hostfile mpirun.hosts ./hybridhello
Hello I am Processor 0 on node001 of 2
Hello I am Processor 1 on node002 of 2
Hybrid Hello World: I am thread # 0 out of 3
Hybrid Hello World: I am thread # 2 out of 3
Hybrid Hello World: I am thread # 1 out of 3
Hybrid Hello World: I am thread # 0 out of 3
Hybrid Hello World: I am thread # 2 out of 3
Hybrid Hello World: I am thread # 1 out of 3
• Less domain decomposition, which can help with load balancing as well as allowing for larger
messages and fewer tasks participating in MPI collective operations.
• OpenMP is a standard, so any modifications introduced into an application are portable and ap-
pear as comments on systems not using OpenMP.
• By adding annotations to existing code and using a compiler option, it is possible to add OpenMP
to a code somewhat incrementally, almost on a loop-by-loop basis. The vector loops in a code that
vectorize well are good candidates for OpenMP.
• In some cases a serial portion may be essential, which can inhibit performance.
• In most MPI codes, synchronization is implicit and happens when messages are sent and received.
However, with OpenMP, much synchronization must be added to the code explicitly. The pro-
grammer must also explicitly determine which variables can be shared among threads and which
ones cannot (parallel scoping). OpenMP codes that have errors introduced by incomplete or mis-
placed synchronization or improper scoping can be difficult to debug because the error can intro-
duce race conditions which cause the error to happen only intermittently.
• Trying out various compilers and compiler flags, and finding out which options are best for par-
ticular applications.
• Changing the default MPI rank ordering. This is a simple, yet sometimes effective, runtime tuning
option that requires no source code modification, recompilation or re-linking. The default MPI
rank placement on the compute nodes is SMP style. However, other choices are round-robin,
folded rank, and custom ranking.
• Using fewer cores per node is helpful when more memory per process than the default is needed.
Having fewer processes to share the memory and interconnect bandwidth is also helpful in this
case. For NUMA nodes, extra care must be taken.
• Hybrid MPI/OpenMP reduces the memory footprint. Overlapping communication with compu-
tation in hybrid MPI/OpenMP can be considered.
• Some applications may perform better when large memory pages are used.
• directives for the workload manager to request resources, control the output, set email addresses
for messages to go to
When running a job script, the workload manager is normally responsible for generating a machine
file based on the requested number of processor cores (np), as well as being responsible for the allocation
any other requested resources.
The executable submission line in a job script is the line where the job is submitted to the workload
manager. This can take various forms.
Example
For the Slurm workload manager, the line might look like:
srun --mpi=mpich1_p4 ./a.out
Example
Example
• Slurm (Chapter 5)
• SGE (Chapter 6)
Example
$ cat myscript1
#!/bin/sh
hostname
$ cmsub myscript1
Submitting job: myscript1(slurm-2) [slurm:2] ... OK
All cmsub command line options can also be specified in a job-directive style format in the job script
itself, using the “#CMSUB” tag to indicate an option.
Example
$ cat myscript2
#!/bin/sh
#CMSUB --input-list=/home/user/myjob.in
#CMSUB --output-list=/home/user/myjob.out
#CMSUB --remote-output-list=/home/user/file-which-will-be-created
#CMSUB --input=/home/user/onemoreinput.dat
#CMSUB --input=/home/user/myexec
myexec
$ cmsub myscript2
Submitting job: myscript2(slurm-2) [slurm:2] ... OK
• Create a job script that sets the resources for the script/executable
The details of Slurm usage depends upon the MPI implementation used. The description in this
chapter will cover using Slurm’s Open MPI implementation, which is quite standard. Slurm documen-
tation can be consulted (http://slurm.schedmd.com/mpi_guide.html) if the implementation the
user is using is very different.
Example
The “hello world” executable from section 3.3.3 can then be compiled and run for one task outside
the workload manager, on the local host, as:
Slurm is more typically run as a batch job (section 5.3). However execution via salloc uses the
same options, and it is more convenient as an introduction because of its interactive behavior.
In a default Bright Cluster Manager configuration, Slurm auto-detects the cores available and by
default spreads the tasks across the cores that are part of the allocation request.
To change how Slurm spreads the executable across nodes is typically determined by the options in
the following table:
Short Long
Option Option Description
The full options list and syntax for salloc can be viewed with “man salloc”.
The requirement of specified options to salloc must be met before the executable is allowed to run.
So, for example, if --nodes=4 and the cluster only has 3 nodes, then the executable does not run.
Default settings: The hello MPI executable with default settings of Slurm runs successfully over the
first (and in this case, the only) node that it finds:
The preceding output also displays if -N1 (indicating 1 node) is specified, or if -n4 (indicating 4 tasks)
is specified.
The node and task allocation is almost certainly not going to be done by relying on defaults. Instead,
node specifications are supplied to Slurm along with the executable.
To understand Slurm node specifications, the following cases consider and explain where the node
specification is valid and invalid.
Number of nodes requested: The value assigned to the -N|--nodes= option is the number of nodes
from the cluster that is requested for allocation for the executable. In the current cluster example it can
only be 1. For a cluster with, for example, 1000 nodes, it could be a number up to 1000.
A resource allocation request for 2 nodes with the --nodes option causes an error on the current
1-node cluster example:
[fred@bright81 ~]$ salloc -N2 mpirun hello
salloc: error: Failed to allocate resources: Node count specification invalid
salloc: Relinquishing job allocation 573
Number of tasks requested per cluster: The value assigned to the -n|--ntasks option is the num-
ber of tasks that are requested for allocation from the cluster for the executable. In the current cluster
example, it can be 1 to 4 tasks. The default resources available on a cluster are the number of available
processor cores.
A resource allocation request for 5 tasks with the --ntasks option causes an error because it exceeds
the default resources available on the 4-core cluster:
Adding and configuring just one more node to the current cluster would allows the resource alloca-
tion to succeed, since an added node would provide at least one more processor to the cluster.
Number of tasks requested per node: The value assigned to the --ntasks-per-node option is
the number of tasks that are requested for allocation from each node on the cluster. In the cur-
rent cluster example, it can be 1 to 4 tasks. A resource allocation request for 5 tasks per node with
--ntasks-per-node fails on this 4-core cluster, giving an output like:
Adding and configuring another 4-core node to the current cluster would still not allow resource
allocation to succeed, because the request is for at least 5 cores per node, rather than per cluster.
Restricting the number of tasks that can run per node: A resource allocation request for 2 tasks per
node with the --ntasks-per-node option, and simultaneously an allocation request for 1 task to run
on the cluster using the --ntasks option, runs successfully, although it uselessly ties up resources for
1 task per node:
The other way round, that is, a resource allocation request for 1 task per node with the
--ntasks-per-node option, and simultaneously an allocation request for 2 tasks to run on the cluster
using the --ntasks option, fails because on the 1-cluster node, only 1 task can be allocated resources
on the single node, while resources for 2 tasks are being asked for on the cluster:
application execution line: execution of the MPI application using sbatch, the Slurm submission
wrapper.
In SBATCH lines, “#SBATCH” is used to submit options. The various meanings of lines starting with
“#” are:
After the Slurm job script is run with the sbatch command (Section 5.3.4), the output goes into file
my.stdout, as specified by the “-o” command.
If the output file is not specified, then the file takes a name of the form ”slurm-<jobnumber>.out”,
where <jobnumber> is a number starting from 1.
The command “sbatch --usage” lists possible options that can be used on the command line or
in the job script. Command line values override script-provided values.
Directives are used to specify the resource allocation for a job so that Slurm can manage the job
optimally. Available options and their descriptions can be seen with the output of sbatch --help.
The more overviewable usage output from sbatch --usage may also be helpful.
Some of the more useful ones are listed in the following table:
Typically, end users use SLURM_PROCID in a program so that an input of a parallel calculation de-
pends on it. The calculation is thus spread across processors according to the assigned SLURM_PROCID,
so that each processor handles the parallel part of the calculation with different values.
More information on environment variables is also to be found in the man page for sbatch.
Queues in Slurm terminology are called “partitions”. Slurm has a default queue called defq. The
administrator may have removed this or created others.
If a particular queue is to be used, this is typically set in the job script using the -p or --partition
option:
#SBATCH --partition=bitcoinsq
It can also be specified as an option to the sbatch command during submission to Slurm.
The parameter that should be changed is “EligibleTime”, which can be done as follows:
[fred@bright81 ~]$ scontrol update jobid=254 EligibleTime=2011-10-18T22:00:00
An approximate GUI Slurm equivalent to scontrol is the sview tool. This allows the job to be
viewed under its jobs tab, and the job to be edited with a right click menu item. It can also carry out
many other functions, including canceling a job.
Webbrowser-accessible job viewing is possible from the workload tab of the User Portal (sec-
tion 12.2).
#!/bin/bash
#$ Script options # Optional script directives
shell commands # Optional shell commands
application # Application itself
6.1.1 Directives
It is possible to specify options (’directives’) to SGE by using “#$” in the script. The difference in the
meaning of lines that start with the “#” character in the job script file should be noted:
#$ {option} {parameter}
Available options and their descriptions can be seen with the output of qsub -help:
Table 6.1.3: SGE Job Script Options
Option and parameter Description
-a date_time request a start time
-ac context_list add context variables
-ar ar_id bind job to advance reservation
-A account_string account string in accounting record
-b y[es]|n[o] handle command as binary
-binding [env|pe|set] exp|lin|str binds job to processor cores
-c ckpt_selector define type of checkpointing for job
-ckpt ckpt-name request checkpoint method
-clear skip previous definitions for job
-cwd use current working directory
-C directive_prefix define command prefix for job script
-dc simple_context_list delete context variable(s)
-dl date_time request a deadline initiation time
-e path_list specify standard error stream path(s)
-h place user hold on job
-hard consider following requests "hard"
-help print this help
-hold_jid job_identifier_list define jobnet interdependencies
-hold_jid_ad job_identifier_list define jobnet array interdependencies
-i file_list specify standard input stream file(s)
-j y[es]|n[o] merge stdout and stderr stream of job
-js job_share share tree or functional job share
-jsv jsv_url job submission verification script to be used
-l resource_list request the given resources
-m mail_options define mail notification events
-masterq wc_queue_list bind master task to queue(s)
...continued
More detail on these options and their use is found in the man page for qsub.
To use it, the user simply runs the executable line using the cm-launcher wrapper before the mpirun
job-launcher command:
The wrapper tracks processes that the workload manager launches. When it sees processes that the
workload manager is unable to clean up after a job is over, it carries out the cleanup instead. Using
cm-launcher is recommended if jobs that do not get cleaned up correctly are an issue for the user or
administrator.
#!/bin/sh
#
# Your job name
#$ -N My_Job
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe (Parallel environment) request. Set your number of requested slots here.
#$ -pe mpich 2
#
# Run job through bash shell
#$ -S /bin/bash
# The following output will show in the output file. Used for debugging.
The number of available slots can be set by the administrator to an arbitrary value. However, it is
typically set so that it matches the number of cores in the cluster, for efficiency reasons. More slots can
be set than cores, but most administrators prefer not to do that.
In a job script, the user can request slots from the available slots. Requesting multiple slots therefore
typically means requesting multiple cores. In the case of an environment that is not set up as a parallel
environment, the request for slots is done with the -np option. For jobs that run in a parallel environ-
ment, the -pe option is used. Mixed jobs, running in both non-MPI and parallel environments are also
possible if the administrator has allowed it in the complex attribute slots settings.
Whether the request is granted is decided by the workload manager policy set by the administrator.
If the request exceeds the number of available slots, then the request is not granted.
If the administrator has configured the cluster to use cloud computing with cm-scale (section 7.9.2
of the Administrator Manual), then the total number of slots available to a cluster changes over time
automatically, as nodes are started up and stopped dynamically.
With SGE a job can be submitted with qsub. The qsub command has the following syntax:
After completion (either successful or not), output is put in the user’s current directory, appended
with the job number which is assigned by SGE. By default, there is an error and an output file.
myapp.e#{JOBID}
myapp.o#{JOBID}
qstat -g c
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE
-----------------------------------------------------------------
long.q 0.01 0 0 144 288 0 144
default.q 0.01 0 0 144 288 0 144
$ qstat
job-ID prior name user state submit/start at queue slots
-----------------------------------------------------------------------
249 0.00000 Sleeper1 root qw 12/03/2008 07:29:00 1
250 0.00000 Sleeper1 root qw 12/03/2008 07:29:01 1
251 0.00000 Sleeper1 root qw 12/03/2008 07:29:02 1
252 0.00000 Sleeper1 root qw 12/03/2008 07:29:02 1
253 0.00000 Sleeper1 root qw 12/03/2008 07:29:03 1
More details are visible when using the -f (for full) option:
• The used/tot or used/free column is the count of used/free slots in the queue.
$ qstat -f
queuename qtype used/tot. load_avg arch states
-----------------------------------------------------------------------
all.q@node001.cm.cluster BI 0/16 -NA- lx26-amd64 au
-----------------------------------------------------------------------
all.q@node002.cm.cluster BI 0/16 -NA- lx26-amd64 au
########################################################################
PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
########################################################################
249 0.55500 Sleeper1 root qw 12/03/2008 07:29:00 1
250 0.55500 Sleeper1 root qw 12/03/2008 07:29:01 1
• d(eletion)
• E(rror)
• h(old)
• r(unning)
• R(estarted)
• s(uspended)
• S(uspended)
• t(ransfering)
• T(hreshold)
• w(aiting)
• S(ubordinate)
• E(rror) - sge_execd was unable to locate the sge_shepherd - use qmod to fix it.
By default the qstat command shows only jobs belonging to the current user, i.e. the command is
executed with the option -u $user. To see jobs from other users too, the following format is used:
$ qstat -u ``*''
$ qdel <jobid>
The job-id is the number assigned by SGE when the job is submitted using qsub. Only jobs belonging
to the logged-in user can be deleted. Using qdel will delete a user’s job regardless of whether the job is
running or in the queue.
• Creating the job script, adding directives, applications, runtime parameters, and application-specific
variables to the script
This chapter covers the using the workload managers and job scripts with the PBS variants so that
users can get a basic understanding of how they are used, and can get started with typical cluster usage.
In this chapter:
• section 7.1 covers the components of a job script and job script examples
• section 7.2.1 covers submitting, monitoring, and deleting a job with a job script
More depth on using these workload managers is to be found in the PBS Professional User Guide
and in the online Torque documentation at http://www.adaptivecomputing.com/resources/
docs/.
and other specifications for the job. After preparation, the job script is submitted to the workload man-
ager using the qsub command. The workload manager then tries to make the job run according to the
job script specifications.
A job script can be resubmitted with different parameters (e.g. different sets of data or variables).
Example
#!/bin/bash
#
#PBS -l walltime=1:00:00
#PBS -l nodes=4
#PBS -l mem=500mb
#PBS -j oe
cd ${HOME}/myprogs
mpirun myprog a b c
The first line is the standard “shebang” line used for scripts.
The lines that start with #PBS are PBS directive lines, described shortly in section 7.1.2.
The last two lines are an example of setting remaining options or configuration settings up for the
script to run. In this case, a change to the directory myprogs is made, and then run the executable
myprog with arguments a b c. The line that runs the program is called the executable line (sec-
tion 7.1.3).
To run the executable file in the executable line in parallel, the job launcher mpirun is placed imme-
diately before the executable file. The number of nodes the parallel job is to run on is assumed to have
been specified in the PBS directives.
7.1.2 Directives
Job Script Directives And qsub Options
A job script typically has several configurable values called job script directives, set with job script
directive lines. These are lines that start with a “#PBS”. Any directive lines beyond the first executable
line are ignored.
The lines are comments as far as the shell is concerned because they start with a “#”. However, at
the same time the lines are special commands when the job script is processed by the qsub command.
The difference is illustrated by the following:
• The following shell comment is only a comment for a job script processed by qsub:
# PBS
• The following shell comment is also a job script directive when processed by qsub:
#PBS
Job script directive lines with the “#PBS ” part removed are the same as options applied to the qsub
command, so a look at the man pages of qsub describes the possible directives and how they are used.
If there is both a job script directive and a qsub command option set for the same item, the qsub option
takes precedence.
Since the job script file is a shell script, the shell interpreter used can be changed to another shell
interpreter by modifying the first line (the “#!” line) to the preferred shell. Any shell specified by the
first line can also be overridden by using the “#PBS -S” directive to set the shell path.
Walltime Directive
The workload manager typically has default walltime limits per queue with a value limit set by the
administrator. The user sets walltime limit by setting the ”#PBS -l walltime” directive to a specific
time. The time specified is the maximum time that the user expects the job should run for, and it allows
the workload manager to work out an optimum time to run the job. The job can then run sooner than it
would by default.
If the walltime limit is exceeded by a job, then the job is stopped, and an error message like the fol-
lowing is displayed:
For PBS Pro v11 this also works, but is deprecated, and the form “#PBS -l select=8” is recom-
mended instead.
Further examples of node resource specification are given in a table on page 38.
Job Queues
Sending a job to a particular job queue is sometimes appropriate. An administrator may have set
queues up so that some queues are for very long term jobs, or some queues are for users that require
GPUs. Submitting a job to a particular queue <destination> is done by using the directive “#PBS -q
<destination>”.
Directives Summary
A summary of the job directives covered, with a few extras, are shown in the following table:
Some of the examples illustrate requests for GPU resource usage. GPUs and the CUDA utilities for
NVIDIA are introduced in Chapter 8. In the Torque and PBS Pro workload managers, GPU usage is
treated like the attributes of a resource which the cluster administrator will have pre-configured accord-
ing to local requirements.
For further details on resource list directives, the Torque and PBS Pro user documentation should be
consulted.
The wrapper tracks processes that the workload manager launches. When it sees processes that the
workload manager is unable to clean up after the job is over, it carries out the cleanup instead. Using
cm-launcher is recommended if jobs that do not get cleaned up correctly are an issue for the user or
administrator.
Example
#!/bin/bash
#PBS -l walltime=1:00
#PBS -l nodes=4
echo -n "I am on: "
hostname;
The directive specifying walltime means the script runs at most for 1 minute. The ${PBS_NODEFILE}
array used by the script is created and appended with hosts by the queuing system. The script illustrates
how the workload manager generates a ${PBS_NODEFILE} array based on the requested number of
nodes, and which can be used in a job script to spawn child processes. When the script is submitted, the
output from the log will look like:
I am on: node001
finding ssh-accessible nodes:
running on: node001
running on: node002
running on: node003
running on: node004
This illustrates that the job starts up on a node, and that no more than the number of nodes that were
asked for in the resource specification are provided.
The list of all nodes for a cluster can be found using the pbsnodes command (section 7.2.6).
Using InfiniBand
A sample PBS script for InfiniBand is:
#!/bin/bash
#!
#! Sample PBS file
#!
#! Name of job
#PBS -N MPI
#PBS -l nodes=8:ppn=4,walltime=02:00:00
#! Work directory
workdir="<work dir>"
###############################################################
### You should not have to change anything below this line ####
###############################################################
cd $workdir
In the preceding script, no machine file is needed, since it is automatically built by the workload
manager and passed on to the mpirun parallel job launcher utility. The job is given a unique ID and run
in parallel on the nodes based on the resource specification.
7.1.5 Links To Other Resources About Job Scripts In Torque And PBS Pro
A number of useful links are:
• Torque examples:
http://bmi.cchmc.org/resources/software/torque/examples
Users can pre-load particular environment modules as their default using the “module init*”
commands (section 2.3.3).
For example, a job script called mpirun.job with all the relevant directives set inside the script,
may be submitted as follows:
Example
$ qsub mpirun.job
Example
The man page for qsub describes these and other options. The options correspond to PBS directives
in job scripts (section 7.1.1). If a particular item is specified by a qsub option as well as by a PBS directive,
then the qsub option takes precedence.
qstat Basics
The main component is qstat, which has several options. In this example, the most frequently used
options are discussed.
In PBS/Torque, the command “qstat -an” shows what jobs are currently submitted or running
on the queuing system. An example output is:
The output shows the Job ID, the user who owns the job, the queue, the job name, the session ID for a
running job, the number of nodes requested, the number of CPUs or tasks requested, the time requested
(-l walltime), the job state (S) and the elapsed time. In this example, one job is seen to be running (R),
and one is still queued (Q). The -n parameter causes nodes that are in use by a running job to display at
the end of that line.
Possible job states are:
$ qstat -q
server: master.cm.cluster
$ showq
ACTIVE JOBS-----------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
IDLE JOBS-------------
3 Idle Jobs
BLOCKED JOBS----------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Job Details With checkjob The checkjob command (only for Maui) is particularly useful for check-
ing why a job has not yet executed. For a job that has an excessive memory requirement, the output looks
something like:
checking job 65
State: Idle
Creds: user:fred group:fred class:shortq qos:DEFAULT
WallTime: 00:00:00 of 00:01:00
SubmitTime: Tue Sep 13 15:22:44
(Time Queued Total: 2:53:41 Eligible: 2:53:41)
Total Tasks: 1
$ qdel <jobid>
The job ID is printed to the terminal when the job is submitted. To get the job ID of a job if it has
been forgotten, the following can be used:
$ qstat
or
$ showq
np = 3
ntype = cluster
status = rectime=1317911358,varattr=,jobs=96...ncpus=2...
gpus = 1
node002.cm.cluster
state = free
np = 3
...
gpus = 1
...
node002.cm.cluster
Mom = node002.cm.cluster
ntype = PBS
state = free
...
...
8.1 Packages
A number of different GPU-related packages are included in Bright Cluster Manager. Versions sup-
ported are CUDA 8.0 and CUDA 9.0.
For version 9.0, the packages include:
The version implementation depends on how the system administrator has configured CUDA.
The toolkit comes with the necessary tools and the NVIDIA compiler wrapper to compile CUDA C
code.
Extensive documentation on how to get started, the various tools, and how to use the CUDA suite is
in the $CUDA_INSTALL_PATH/doc directory.
Accordingly, both the host and device manage their own memory space, and it is possible to copy data
between them. The CUDA and OpenCL Best Practices Guides in the doc directory, provided by the
CUDA toolkit package, have more information on how to handle both platforms and their limitations.
The nvcc command by default compiles code and links the objects for both the host system and the
GPU. The nvcc command distinguishes between the two and it can hide the details from the developer.
To compile the host code, nvcc will use gcc automatically.
• -arch=sm_13: This can be enabled if the CUDA device supports compute capability 1.3, which
includes double-precision
If double-precision floating-point is not supported or the flag is not set, warnings such as the follow-
ing will come up:
warning : Double is not supported. Demoting to float
The nvcc documentation manual, “The CUDA Compiler Driver NVCC” has more information on
compiler options.
The CUDA SDK has more programming examples and information accessible from the file
$CUDA_SDK/C/Samples.html.
For OpenCL, code compilation can be done by linking against the OpenCL library:
gcc test.c -lOpenCL
g++ test.cpp -lOpenCL
nvcc test.c -lOpenCL
Example
8.5.2 nvidia-smi
The NVIDIA System Management Interface command, nvidia-smi, can be used to allow exclusive
access to the GPU. This means only one application can run on a GPU. By default, a GPU will allow
multiple running applications.
Syntax:
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...
• List GPUs
• Select a GPU
After setting the compute rule on the GPU, the first application which executes on the GPU will
block out all others attempting to run. This application does not necessarily have to be the one started
by the user that set the exclusivity lock on the GPU!
To list the GPUs, the -L argument can be used:
$ nvidia-smi -L
GPU 0: (05E710DE:068F10DE) Tesla T10 Processor (S/N: 706539258209)
GPU 1: (05E710DE:068F10DE) Tesla T10 Processor (S/N: 2486719292433)
• 1 - Exclusive thread mode (only one compute context is allowed to run on the GPU, usable from
one thread at a time)
• 2 - Prohibited mode (no compute contexts are allowed to run on the GPU)
• 3 - Exclusive process mode (only one compute context is allowed to run on the GPU, usable from
multiple threads at a time)
In this example, GPU0 is locked, and there is a running application using GPU0. A second applica-
tion attempting to run on this GPU will not be able to run on this GPU.
$ histogram --device=0
main.cpp(101) : cudaSafeCall() Runtime API error :
no CUDA-capable device is available.
• $CUDA_SDK/C/lib
• $CUDA_SDK/OpenCL/common/lib
Other applications may also refer to them, and the toolkit libraries have already been pre-configured
accordingly. However, they need to be compiled prior to use. Depending on the cluster, this might have
already have been done.
[fred@demo ~] cd
[fred@demo ~] cp -r $CUDA_SDK
[fred@demo ~] cd $(basename $CUDA_SDK); cd C
[fred@demo C] make
[fred@demo C] cd $(basename $CUDA_SDK); cd OpenCL
[fred@demo OpenCL] make
• comparing data arrays (typically used for comparing GPU results with CPU results)
• timers
Example
/*
CUDA example
"Hello World" using shift13, a rot13-like function.
Encoded on CPU, decoded on GPU.
// CPU shift13
int len = sizeof(s);
for (int i = 0; i < len; i++) {
s[i] += 13;
}
printf("String encoded on CPU as: %s\n", s);
The preceding code example may be compiled and run on the GPUs of a node with a GPU with:
[fred@node001 ~]$ module load shared cuda90/toolkit/9.0.176
[fred@node001 ~]$ nvcc hello.cu -o hello
[fred@node001 ~]$ ./hello
String for encode/decode: Hello World!
String encoded on CPU as: Uryy|-d|yq.
...
String encoded on CPU as: Uryy|-d|yq.
...
String decoded on GPU as: Hello World!
[fred@node001 ~]$
The number of characters displayed in the encoded string appear less than expected because there
are unprintable characters in the encoding due to the cipher used being not exactly rot13.
8.5.5 OpenACC
OpenACC (http://www.openacc-standard.org) is a new open parallel programming standard
aiming at simplifying the programmability of heterogeneous CPU/GPU computing systems. OpenACC
allows parallel programmers to provide OpenACC directives to the compiler, identifying which areas of
code to accelerate. This frees the programmer from carrying out time-consuming modifications to the
original code itself. By pointing out parallelism to the compiler, directives get the compiler to carry out
the details of mapping the computation onto the accelerator.
Using OpenACC directives requires a compiler that supports the OpenACC standard.
In the following example, where π is calculated, adding the #pragma directive is sufficient for the
compiler to produce code for the loop that can run on either the GPU or CPU:
Example
#include <stdio.h>
#define N 1000000
int main(void) {
double pi = 0.0f; long i;
#pragma acc parallel loop reduction(+:pi)
for (i=0; i<N; i++) {
double t= (double)((i+0.5)/N);
pi +=4.0/(1.0+t*t);
}
printf("pi=%16.15f\n",pi/N);
return 0;
}
/usr/linux-k1om-<version>
To build a native application on an x86_64 host, the compiler tools prefixed by x86_64-k1om-linux-
have to be used:
Example
If the GNU autoconf tool is used, then the following shell commands can be used to build the appli-
cation:
MIC_ARCH=k1om
GCC_VERSION=4.7
GCC_ROOT=/usr/linux-${MIC_ARCH}-${GCC_VERSION}
./configure \
CXX="${GCC_ROOT}/bin/x86_64-${MIC_ARCH}-linux-g++" \
CXXFLAGS="-I${GCC_ROOT}/linux-${MIC_ARCH}/usr/include" \
CXXCPP="${GCC_ROOT}/bin/x86_64-${MIC_ARCH}-linux-cpp" \
CC="${GCC_ROOT}/bin/x86_64-${MIC_ARCH}-linux-gcc" \
CFLAGS="-I${GCC_ROOT}/linux-${MIC_ARCH}/usr/include" \
CPP="${GCC_ROOT}/bin/x86_64-${MIC_ARCH}-linux-cpp" \
LDFLAGS="-L${GCC_ROOT}/linux-${MIC_ARCH}/usr/lib64" \
LD="${GCC_ROOT}/bin/x86_64-${MIC_ARCH}-linux-ld" \
--build=x86_64-redhat-linux \
--host=x86_64-${MIC_ARCH}-linux \
--target=x86_64-${MIC_ARCH}-linux
make
/opt/intel/composer_xe_<version>
The -mmic switch generates code for the MIC on the non-MIC host. For example:
If the GNU autoconf application is used instead, then the environment variables are like those de-
fined earlier in section 9.1.1.
Detailed information on building a native application for the Intel Xeon Phi coprocessor
using the Intel compilers can be found at http://software.intel.com/en-us/articles/
building-a-native-application-for-intel-xeon-phi-coprocessors.
Example
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
int main(void) {
char hostname[HOST_NAME_MAX];
#pragma offload target(mic)
{
gethostname(hostname, HOST_NAME_MAX);
printf("My hostname is \%s\n", hostname);
}
exit(0);
}
Standard command line arguments, with no MIC-related switch required, compile the code. This is
because offloading is enabled by default in Intel Compiler version 2013 and higher:
Example
To get debug information when an offloaded region is executed, the OFFLOAD_REPORT environment
variable can be used. Possible values, in order of increasing verbosity, are 1, 2, or 3. Setting the empty
string disables debug messages:
Example
My hostname is node001-mic0
[user@bright81 ~]$
manager schedules jobs by determining the order in which the jobs will use the MIC cards. This is the
recommended way to use MIC cards on a multiuser cluster, but currently only Slurm supports both
the native and offload modes. All other workload managers support only offload mode, by using a
preconfigured generic resource.
#!/bin/sh
#SBATCH --partition=defq
#SBATCH --gres=mic:1
The job is submitted as usual using the sbatch or salloc/srun commands of Slurm (Chapter 5).
Native mode—non-distributed job: The user creates a job script and sets a constraint "miccard". For
example, the following job script runs the dgemm test directly inside the MIC:
#!/bin/sh
#SBATCH --partition=micq
#SBATCH --constraint="miccard"
Native mode—MPI job: The user creates a job script in the same way as for non-distributed job, but
the --nodes parameter is specified. For example, the next job script executes the Intel IMB-MPI1 bench-
mark on two MICs using RDMA calls:
#!/bin/sh
#SBATCH --partition=micq
#SBATCH --constraint="miccard"
#SBATCH --nodes=2
SLURM_BIN=/cm/shared/apps/slurm/current/k1om-arch/bin
MPI_DIR=/cm/shared/apps/intel/mpi/current/mic
MPI_RUN=$MPI_DIR/bin/mpirun
APP=$MPI_DIR/bin/IMB-MPI1
APP_ARGS="PingPong"
MPI_ARGS="-genv I_MPI_DAPL_PROVIDER ofa-v2-scif0 -genv I_MPI_FABRICS=shm:dapl -perhost 1"
export LD_LIBRARY_PATH=/lib64:$MPI_DIR/lib:$LD_LIBRARY_PATH
export PATH=$SLURM_BIN:$MPI_DIR/bin/:$PATH
$MPI_RUN $MPI_ARGS $APP $APP_ARGS
The value of DAPL provider (the argument I_MPI_DAPL_PROVIDER) should be set to ofa-v2-scif0
when an application needs MIC-to-MIC or MIC-to-HOST RDMA communication.
All Slurm job examples given here can be found on a cluster in the following directory:
/cm/shared/examples/workload/slurm/jobscripts/
At the time of writing, PBS Pro (version 12.1) is not pinning tasks to a specific MIC device. That is,
the OFFLOAD_DEVICES environment variable is not set for a job.
However, the mics consumable resource can be used only when TORQUE is used together with
MOAB, otherwise the job is never scheduled. This behavior is subject to change, but has been verified
on MAUI 3.3.1 and pbs_sched 4.2.2. When MOAB is used, then user can submit a job, with offload code
regions, as shown in the following example:
Example
#!/bin/sh
#PBS -N TEST_MIC_OFFLOAD
#PBS -l nodes=1:mics=2
./hello_mic
• A container is an extremely lightweight virtualized operating system that runs without the un-
needed extra emulated hardware components of a regular virtualized operating system.
• A containerized application runs within a container, and it only accesses files, environment vari-
ables, and libraries within the container, unless volumes are mounted and used.
• A containerized application provides services to other software or users. Kubernetes thus manages
containerized applications as a service, and is aware of the container states and resources used.
• a basic, Kubernetes 101 tutorial about kubectl, pods, and volumes at https://kubernetes.
io/docs/user-guide/walkthrough/
Familiarity with the concepts in those tutorials is recommended before continuing with the rest of
this chapter.
• the API server secure port should bind to the external network. This binding is true by default in
Bright Cluster Manager, but it is possible that it may be blocked in some other part of the network
configuration.
• the hostname of the head node should be present in the Kubernetes certificates. This is achieved
by specifying the external FQDN when prompted for it during the setup.
Steps:
• On the PC, kubectl for Kubernetes 1.9.2 should be downloaded from the head node. It can be
downloaded to a directory in the user path, such as /usr/bin
Example
$ rsync <username>@<headnode>:/cm/local/apps/kubernetes/current/bin/kubectl \
<directory in the user path>
• The user can make a .kube directory on the PC. The Kubernetes configuration for the user <user-
name> can then be picked up from the head node <headnode>. This includes the keys and the
certificates:
$ mkdir ~/.kube
$ rsync <username>@<headnode>:.kube/config ~/.kube/config
The user can check kubectl is able to talk to the cluster by running the following commands:
Example
$ kubectl cluster-info
$ kubectl get nodes
$ kubectl get all
Example
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
completions: 8
parallelism: 1
template:
metadata:
name: pi
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(4000)"]
restartPolicy: Never
If the administrator has allowed the user access via Kubernetes policies, and has made the user a
Kubernetes user, then the job can be submitted with:
$ kubectl apply -f pi-job.yml
job "pi" created
If the job is horizontally scalable, then the number of replicas can be scaled with:
$ kubectl scale job/pi --replicas=4
job "pi" scaled
Further information on the following job topics can be found at the associated links:
• job: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
A redis-persistent-storage.yml file can be created for the Persistent Volume Claim (PVC),
with the following content:
Example
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-persistent-storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: fast
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: redis-master
spec:
replicas: 1
template:
metadata:
labels:
app: redis
role: master
spec:
containers:
- name: redis-master
image: gcr.io/google_containers/redis:e2e
args: [
'/usr/local/bin/redis-server',
'--appendonly', 'yes',
'--appendfsync', 'always']
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 6379
volumeMounts:
- name: redis-storage
mountPath: /data
volumes:
- name: redis-storage
persistentVolumeClaim:
claimName: redis-persistent-storage
The Redis master deployment with PVC can then be created on Ceph persistent storage using:
Data can now be stored on Redis. After storing data data on Redis, if all the pods are then deleted,
they can be recreated without any data loss.
Example
$ helm version
Choices can be made from among the charts at the official repository at https://github.com/
kubernetes/charts. For example, GitLab and WordPress can be installed with:
$ helm install stable/gitlab --name my-gitlab
• BootStrap: The name of the Linux distribution type module. This informs Singularity which
distribution module should be used to parse the commands in the definition file. At present the
following 4 modules are supported:
– yum: Bootstraps distributions such as Red Hat, Centos, and Scientific Linux.
– debootstrap: Bootstraps Debian- and Ubuntu-based distributions.
– arch. Bootstraps Arch Linux.
– docker. Bootstraps Docker. It creates a core operating system image based on an image
hosted on a particular Docker Registry server. For the docker module, several other key-
words may also be defined:
* From: this keyword defines the string of the registry name used for this image in the
format [name]:[version].
* IncludeCmd: use the Docker-defined Cmd as the %runscript, if the Cmd is defined,
* From: sets the docker registry name,
* Token: sets the docker authorization token.
• MirrorURL: the URL to get packages from.
• OSVersion: this keyword must be defined as the alphabet-character string associated with the
version of the distribution you wish to use. For example: trusty or stable.
• Include: install additional packages.
• %setup: this section blob is a bash scriptlet that is executed on the host outside the container,
during bootstrapping.
• %post: this scriptlet section is executed once from inside the container, during bootstrapping.
• %runscript: the scriptlet that is executed inside the container when it is started.
• %test: this section is run at the very end of the boostrapping process and validates the container
during the bootstrap process.
Once the definition file is ready, and if the image file exists, then a user can run the Singularity
command bootstrap to install the operating system into the container image.
If there is no bootstrap image already, then it can be created with the singularity create com-
mand:
Example
[root@bright81 ~]$ mkdir /cm/shared/sing-images
[root@bright81 ~]$ singularity create --size 1024 /cm/shared/sing-images/centos7.img
Creating a new image with a maximum size of 1024MiB...
Executing image create helper
Formatting image with ext3 file system
Done.
[root@bright81 ~]$
Note that the image must be created and bootstrapped by a privileged user, even if it is always
supposed to be executed with regular user permissions. If, as is the usual case, users are not allowed
to use root permissions on a cluster, then they can create and bootstrap the new image on their own
computers. The image can then be transfered to the cluster.
A very simple image definition file for CentOS can look like this:
Example
[root@bright81 ~]$ cat centos7.def
BootStrap: yum
OSVersion: 7
MirrorURL: http://mirror.centos.org/centos-%OSVERSION/%OSVERSION/os/$basearch/
Include: yum
%post
echo "Installing extra packages..."
yum install vim util-linux -y
%runscript
echo "Hello from container!"
cat /etc/os-release
To bootstrap the image, the user runs singularity bootstrap command as root:
Example
[root@bright81 ~]$ module load singularity
[root@bright81 ~]$ singularity bootstrap /cm/shared/sing-images/centos7.img centos7.def
<...>
Complete!
Done.
[root@bright81 ~]$
The image can be placed in any directory, but it makes sense to share it among compute nodes.
/cm/shared/sing-images is therefore a sensible location for keeping the container images.
The image created can then be executed as a regular binary:
Example
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[user@bright81 ~]$
The default 768MiB image size can be changed with --size option of the singularity create
and singularity expand commands. The expand command increases the image size. There is no
standard way of decreasing the image size.
The following example shows an image definition file that adds the /etc/services file and /bin/
grep binary from the filesystem of the host, to the contaner image. When a user runs the image that is
created, it greps the services file for arguments passed to the image:
Example
+ cp /etc/services /var/singularity/mnt/final/etc/services
Done.
The container can now be run as if it were a simple script. If a string is passed as an argument, the
/etc/services file that is packaged inside the container is searched for the string:
Example
Example
The tcp RPM package in this example is needed to ensure proper environment modules behaviour.
The environment modules are needed to simplify setting the MPICH environment.
After bootstrapping, users can use the created image without root permission:
Example
Singularity.mpich.img> bash
Singularity.mpich.img> module load mpich/ge/gcc
Singularity.mpich.img> ./hello_mpi
Hello MPI! Process 0 of 1 on bright81
Singularity.mpich.img> exit
[user@bright81 ~]$
Example
Example
In the case of other MPI implementations, there can be different mpirun or similar commands re-
quired. But the idea here is to use the singularity image as a regular binary, built, for example, with
mpicc.
By default, the container image is mounted within the container as a read-only filesystem. The user
can change this with the -w option passed to the singularity command:
Example
Further documentation on creating and using Singularity containers can be found at https://www.
sylabs.io/docs/.
• a Message Of The Day. The administrator may put up important messages for users here
• Category: The node category that the node has been allocated by the administrator (by default it
is default)
• Distribution: The Hadoop distribution used, for example, Cloudera, Hortonworks, Pivotal
HD
• Total VMs with errors: total number of virtual machines that are not running
• Total routers: Total number of routers. These are usually interconnecting networks
• Projects: The projects are listed in sortable columns, by name and UUID,
The Kubernetes cluster is subset of a Bright cluster, and is the part of the Bright cluster that runs and
controls pods. The items shown are:
• Services: The number of services that are served by the Kubernetes cluster
• Replication Controllers: The number of replication controllers that run on the Kubernetes
cluster
• Persistent Volumes: The number of persistent volumes created for the pods of the Kubernetes
cluster
• Persistent Volume Claims: The number of persistent volume claims created on the Kuber-
netes cluster
• Workload Management Metrics: The following workload manager metrics can be viewed:
– RunningJobs
– QueuedJobs
– FailedJobs
– CompletedJobs
– EstimatedDelay
– AvgJobDuration
– AvgExpFactor
– OccupationRate
– NetworkBytesRecv
– NetworkBytesSent
– DevicesUp
– NodesUp
– TotalNodes
– TotalMemoryUsed
– TotalSwapUsed
– PhaseLoad
– CPUCoresAvailable
– GPUAvailable
– TotalCPUUser
– TotalCPUSystem
– TotalCPUIdle
• Datapoints: The number of points used for the graph can be specified. The points are interpolated
if necessary
• Interval (Hours): The period over which the data points are displayed
The meanings of the metrics are covered in Appendix G of the Administrator Manual.
The Update button must be clicked to display any changes made.
1. Less structured input: Key value pairs are used as records for the data sets instead of a database.
2. Scale-out rather than scale-up design: For large data sets, if the size of a parallelizable problem
increases linearly, the corresponding cost of scaling up a single machine to solve it tends to grow
exponentially, simply because the hardware requirements tend to get exponentially expensive. If,
however, the system that solves it is a cluster, then the corresponding cost tends to grow linearly
because it can be solved by scaling out the cluster with a linear increase in the number of processing
nodes.
Scaling out can be done, with some effort, for database problems, using a parallel relational
database implementation. However scale-out is inherent in Hadoop, and therefore often easier
to implement with Hadoop. The Hadoop scale-out approach is based on the following design:
• Clustered storage: Instead of a single node with a special, large, storage device, a distributed
filesystem (HDFS) using commodity hardware devices across many nodes stores the data.
• Clustered processing: Instead of using a single node with many processors, the parallel pro-
cessing needs of the problem are distributed out over many nodes. The procedure is called
the MapReduce algorithm, and is based on the following approach:
– The distribution process “maps” the initial state of the problem into processes out to the
nodes, ready to be handled in parallel.
– Processing tasks are carried out on the data at nodes themselves.
– The results are “reduced” back to one result.
3. Automated failure handling at application level for data: Replication of the data takes place
across the DataNodes, which are the nodes holding the data. If a DataNode has failed, then another
node which has the replicated data on it is used instead automatically. Hadoop switches over
quickly in comparison to replicated database clusters due to not having to check database table
consistency.
The deployment of Hadoop in Bright Cluster Manager is covered in the Bright Cluster Manager Big
Data Deployment Manual. For the end user of Bright Cluster Manager, this section explains how jobs and
data can be run within such a deployment.
13.2 Preliminaries
Before running any of the commands in this section:
• Users should make sure that they have access to the proper HDFS instance. If not, the system
administrator can give users access, as explained in the Big Data Deployment Manual.
• The correct Hadoop module should be loaded. The module provides the environment within
which the commands can run. The module name can vary depending on the instance name and
the Hadoop distribution. The hadoop modules that are available for loading can be checked with:
Monitoring Applications
[user@bright81 ~]$ yarn application -list
Killing An Application
[user@bright81 ~]$ yarn application -kill <application ID>
Monitoring A Job
[user@bright81 ~]$ hadoop job -status <job ID>
Killing A job
[user@bright81 ~]$ hadoop job -kill <job ID>
Example
Finishing Up
Once the job is finished, the output directory can be downloaded back to the home directory of the user:
The output file can be removed from the HDFS if not needed:
• http://www.cloudera.com/content/cloudera/en/documentation.html#ClouderaDocumentation
• http://docs.hortonworks.com/
• http://docs.pivotal.io/
• http://hadoop.apache.org/docs/r2.7.0/
• http://hadoop.apache.org/docs/r1.2.1/
Additionally, most commands print a usage or help text when invoked without parameters.
Example
or
Example
Loading the spark module adds the spark-submit command to $PATH. Jobs can be submitted to
Spark with spark-submit.
Example
Example
$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Options:
[..]
The --master option: is used to specify the master URL <master-url>, which can take one of the
following forms:
• yarn-client: Connect to a YARN cluster in client mode. The cluster location is found based on
the variables HADOOP_CONF_DIR or YARN_CONF_DIR.
• yarn-cluster: Connect to a YARN cluster in cluster mode. The cluster location is found based
on the variables HADOOP_CONF_DIR or YARN_CONF_DIR.
The --deploy-mode option: specifies the deployment mode, <deploy-mode>, of the Spark appli-
cation during job submission. The possible deployment modes are:
• client: The driver process runs locally on the host used to submit the job.
spark-submit Examples
Some spark-submit examples for a SparkPi submission are now shown. The jar file for this can be
found under $SPARK_PREFIX/lib/. $SPARK_PREFIX is set by loading the relevant Spark module.
Example
Running a job on a Spark standalone cluster in cluster deploy mode: The job should run on 3 nodes
and the master is node001.
$ spark-submit --class org.apache.spark.examples.SparkPi --master spark\
://10.141.255.254:7070 --deploy-mode cluster --num-executors 3 $SPARK_P\
REFIX/lib/spark-examples-*.jar
Spark Documentation
The official Spark documentation is available at http://spark.apache.org/docs/latest/.
1. Bright-managed instances: This has the cluster providing virtual Bright nodes, called vnodes for
users. Vnodes are not really that different from regular nodes as far as the end user is concerned,
and in any case the cluster administrator typically sets up how they can be used. The end user
typically simply gets on with using them without having to think much about it.
2. user instances: This has the cluster provide the user with the ability to start a instance under
OpenStack. The instance can be from a variety of pre-packaged cloud images, and can be handled
with the standard OpenStack commands or with the OpenStack Horizon dashboard.
• independent of the Bright account password, but initially use the same password. The passwords
can be made different by the user, or indeed kept the same by the user.
Which of these three options it is depends on how the cluster administrator has configured the sys-
tem.
Example
The openstack command in the preceding example assumes that the .openstackrc file has been
generated, and sourced, in order to provide the OpenStack environment. The cluster administrator
typically configures the system so that the .openstackrc file is automatically generated for the user,
so that it can be sourced with:
Example
Sourcing means running the file so that the environment variables in the file are set in the shell on
return. The shell in which fred is logged into either needs the environment to be in place for OpenStack
actions to work, or it needs the relevant options to be provided by fred to the oslc utility during
execution.
If all goes well, then the image is installed and can be seen by the user, via Open-
Stack Horizon, by navigation to the Images pane, or using the URI http://<IP ad-
dress>:10080/dashboard/project/images/ directly (figure 15.1).
15.2.2 Creating The Networking Components For The OpenStack Image To Be Launched
The networking components are needed due to a default policy of network isolation. Only after the
components are in place can the image run within OpenStack on a virtual machine.
Similarly, in the next screen a subnet called fredsubnet can be configured, along with a gateway
address for the subnet (figure 15.4):
• a range of addresses on the subnet is earmarked for DHCP assignment to devices on the subnet
At the end of a successful network creation, when the dialog box has closed, the screen should look
similar to figure 15.6:
On launching, the image will run. However, it will only be accessible via the OpenStack console, which
has some quirks, such as only working well in fullscreen mode in some browsers.
It is more pleasant and practical to login via a terminal client such as ssh. How to configure this is
described next.
The router can be given a name, and connected to the external network of the cluster.
Next, an extra interface for connecting to the network of the instance can be added by clicking on the
router name, which brings up the Router Details page. Within the Interfaces subtab, the Add
Interface button on the right hand side opens up the Add Interface dialog box (figure 15.8):
After connecting the network of the instance, the router interface IP address should be the gateway
of the network that the instance is running on (figure 15.9):
Figure 15.9: End User Router Interface Screen After Router Configuration
The state of the router after floating IP address configuration: To check the router is reachable from
the head node, the IP address of the router interface connected to the cluster external network should
show a ping response.
The IP address can be seen in the Overview subtab of the router (figure 15.10):
Example
Security group rules to allow a floating IP address to access the instance: The internal interface to
the instance is still not reachable via the floating IP address. That is because by default there are security
group rules that set up iptables to restrict ingress of packets across the network node. A network node
is a routing node that is part of Bright Cluster Manager OpenStack.
The rules can be managed by accessing the Compute resource, then selecting the Access &
Security page. Within the Security Groups subtab there is a Manage Rules button. Clicking
the button brings up the Manage Security Group Rules table (figure 15.11):
Clicking on the Add Rule button brings up a dialog. To let incoming pings work, the rule All
ICMP can be added. Further restrictions for the rule can be set in the other fields of the dialog for the
rule (figure 15.12).
Floating IP address association with the instance: The floating IP address can now be associated with
the instance. One way to do this is to select the Compute resource in the navigation window, and select
Instances. In the Instances window, the button for the instance in the Actions column allows
an IP address from the floating IP address pool to be associated with the IP address of the instance
(figure 15.13).
After association, the instance is pingable from the external network of the head node.
Example
[fred@bright81 ]$ ping -c1 192.168.100.10
PING 192.168.100.10 (192.168.100.10) 56(84) bytes of data.
64 bytes from 192.168.100.10: icmp_seq=1 ttl=63 time=1.54 ms
If SSH is allowed in the security group rules instead of ICMP, then fred can run ssh and log into the
Cirros instance, using the default username/password cirros/cubswin:)
Example
[fred@bright81 ~]$ ssh cirros@192.168.100.10
cirros@192.168.100.10's password:
$
Setting up SSH keys: Setting up SSH key pairs for a user fred allows a login to be done using key
authentication instead of passwords. The standard OpenStack way of setting up key pairs is to either
import an existing public key, or to generate a new public and private key. This can be carried out from
the Compute resource in the navigation window, then selecting the Access & Security page. Within
the Key Pairs subtab there are the Import Key Pair button and the Create Key Pair button.
• importing a key option: For example, user fred created in Bright Cluster Manager as in this
chapter has his public key in /home/fred/.ssh/id_dsa.pub on the head node. Pasting the
text of the key into the import dialog, and then saving it, means that the user fred can now login
as the user cirros without being prompted for a password from the head node. This is true for
images that are cloud instances, of which the cirros instance is an example.
• creating a key pair option: Here a pair of keys is generated for a user. A PEM container file with
just the private key <PEM file>, is made available for download to the user, and should be placed
in a directory accessible to the user, on any host machine that is to be used to access the instance.
The corresponding public key is stored by OpenStack’s Keystone, and the private key discarded
by the generating machine. The downloaded private key should be stored where it can be accessed
by ssh, and should be kept read and write only, for the user only. If its permissions have changed,
then running chmod 600 <PEM file> on it will make it compliant. The user can then login to
the instance using, for example, ssh -i <PEM file> cirros@192.168.100.10, without being
prompted for a password.
The openstack keypair options are the openstack utility equivalent for the preceding Horizon
operations.
/*
``Hello World'' Type MPI Test Program
*/
#include <mpi.h>
#include <stdio.h>
#include <string.h>
/* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
/* At this point, all the programs are running equivalently, the rank is used to
distinguish the roles of the programs in the SPMD model, with rank 0 often used
specially... */
if(myid == 0)
{
printf("%d: We have %d processors\n", myid, numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
for(i=1;i<numprocs;i++)
{
include <mpi.h>
#define WORKTAG 1
#define DIETAG 2
main(argc, argv)
int argc;
char *argv[];
{
int myrank;
MPI_Init(&argc, &argv); /* initialize MPI */
MPI_Comm_rank(
MPI_COMM_WORLD, /* always use this */
&myrank); /* process rank, 0 thru N-1 */
if (myrank == 0) {
head();
} else {
computenode();
}
MPI_Finalize(); /* cleanup MPI */
}
head()
{
int ntasks, rank, work;
double result;
MPI_Status status;
MPI_Comm_size(
MPI_COMM_WORLD, /* always use this */
&ntasks); /* #processes in application */
/*
* Seed the compute nodes.
*/
for (rank = 1; rank < ntasks; ++rank) {
work = /* get_next_work_request */;
MPI_Send(&work, /* message buffer */
1, /* one data item */
MPI_INT, /* data item is an integer */
rank, /* destination process rank */
WORKTAG, /* user chosen message tag */
MPI_COMM_WORLD);/* always use this */
}
/*
* Receive a result from any compute node and dispatch a new work
* request work requests have been exhausted.
*/
work = /* get_next_work_request */;
while (/* valid new work request */) {
MPI_Recv(&result, /* message buffer */
1, /* one data item */
MPI_DOUBLE, /* of type double real */
MPI_ANY_SOURCE, /* receive from any sender */
MPI_ANY_TAG, /* any type of message */
MPI_COMM_WORLD, /* always use this */
&status); /* received message info */
MPI_Send(&work, 1, MPI_INT, status.MPI_SOURCE,
WORKTAG, MPI_COMM_WORLD);
work = /* get_next_work_request */;
}
/*
* Receive results for outstanding work requests.
*/
for (rank = 1; rank < ntasks; ++rank) {
MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &status);
}
/*
* Tell all the compute nodes to exit.
*/
for (rank = 1; rank < ntasks; ++rank) {
MPI_Send(0, 0, MPI_INT, rank, DIETAG, MPI_COMM_WORLD);
}
}
computenode()
{
double result;
int work;
MPI_Status status;
for (;;) {
MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);
/*
Processes are represented by a unique rank (integer) and ranks are numbered 0, 1, 2, ..., N-1.
MPI_COMM_WORLD means all the processes in the MPI application. It is called a communicator and
it provides all information necessary to do message passing. Portable libraries do more with communi-
cators to provide synchronisation protection that most other systems cannot handle.
MPI_Init(&argc, &argv);
MPI_Finalize( );
A.4 What Is The Current Process? How Many Processes Are There?
Typically, a process in a parallel application needs to know who it is (its rank) and how many other
processes exist.
A process finds out its own rank by calling:
MPI_Comm_rank( ):
Int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
int nprocs;
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
Information about the received message is returned in a status variable. The received message tag
is status.MPI_TAG and the rank of the sending process is status.MPI_SOURCE. Another function,
not used in the sample code, returns the number of data type elements received. It is used when the
number of elements received might be smaller than maxcount.
Example
while (looping) {
if (i_have_a_left_neighbor)
MPI_Recv(inbuf, count, dtype, left, tag, comm, &status);
if (i_have_a_right_neighbor)
MPI_Send(outbuf, count, dtype, right, tag, comm);
do_other_work();
}
MPI also has the potential to allow both communications to occur simultaneously, as in the following
communication implementation example:
while (looping) {
count = 0;
if (i_have_a_left_neighbor)
MPI_Irecv(inbuf, count, dtype, left, tag, comm, &req[count++]);
if (i_have_a_right_neighbor)
MPI_Isend(outbuf, count, dtype, right, tag, comm, &req[count++]);
MPI_Waitall(count, req, &statuses);
do_other_work();
}
Example
int count = 0;
if (i_have_a_left_neighbor)
MPI_Recv_init(inbuf, count, dtype, left, tag, comm, &req[count++]);
if (i_have_a_right_neighbor)
MPI_Send_init(outbuf, count, dtype, right, tag, comm, &req[count++]);
while (looping) {
MPI_Startall(count, req);
do_some_work();
MPI_Waitall(count, req, &statuses);
do_rest_of_work();
}