GCP A3 Mega clusters¶
This example shows how to set up a GCP A3 Mega cluster with GPUDirect-TCPXO
optimized NCCL communication and run NCCL Tests on it using dstack
.
Overview¶
GCP's A3 Mega instances are 8xH100 VMs that have 1800Gbps maximum network bandwidth, which is the best among GCP H100 instances. To get that network performance, you need to set up GPUDirect-TCPXO – the GCP technology for GPU RDMA over TCP. This involves:
- Setting up eight extra data NICs on every node, each NIC in a separate VPC.
- Building a VM image with the GPUDirect-TCPXO support.
- Launching an RXDM service container.
- Installing the GPUDirect-TCPXO NCCL plugin.
dstack
hides most of the setup complexity and provides optimized A3 Mega clusters out-of-the-box.
Configure GCP backend¶
First configure the gcp
backend for the GPUDirect-TCPXO support.
You need to specify eight extra_vpcs
to use for data NICs:
projects:
- name: main
backends:
- type: gcp
project_id: $MYPROJECT # Replace $MYPROJECT
extra_vpcs:
- dstack-gpu-data-net-1
- dstack-gpu-data-net-2
- dstack-gpu-data-net-3
- dstack-gpu-data-net-4
- dstack-gpu-data-net-5
- dstack-gpu-data-net-6
- dstack-gpu-data-net-7
- dstack-gpu-data-net-8
regions: [europe-west4]
creds:
type: default
Custom VPC
If you specify a non-default primary VPC, ensure it has a firewall rule allowing all traffic within the VPC. This is needed for MPI and NCCL to work. The default VPC already permits traffic within the VPC.
Create extra VPCs
Create the VPC networks for GPUDirect in your project, each with a subnet and a firewall rule:
# Specify the region where you intend to deploy the cluster
REGION="europe-west4"
for N in $(seq 1 8); do
gcloud compute networks create dstack-gpu-data-net-$N \
--subnet-mode=custom \
--mtu=8244
gcloud compute networks subnets create dstack-gpu-data-sub-$N \
--network=dstack-gpu-data-net-$N \
--region=$REGION \
--range=192.168.$N.0/24
gcloud compute firewall-rules create dstack-gpu-data-internal-$N \
--network=dstack-gpu-data-net-$N \
--action=ALLOW \
--rules=tcp:0-65535,udp:0-65535,icmp \
--source-ranges=192.168.0.0/16
done
Create A3 Mega fleet¶
Once you've configured the gcp
backend, create the fleet configuration:
type: fleet
name: a3mega-cluster
nodes: 2
placement: cluster
instance_types:
- a3-megagpu-8g
spot_policy: auto
and apply the configuration:
$ dstack apply -f examples/misc/a3mega-clusters/fleet.dstack.yml
Project main
User admin
Configuration examples/misc/a3mega-clusters/fleet.dstack.yml
Type fleet
Fleet type cloud
Nodes 2
Placement cluster
Resources 2..xCPU, 8GB.., 100GB.. (disk)
Spot policy auto
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, yes $22.1525
8xH100 (80GB),
100.0GB (disk)
2 gcp europe-west4 a3-megagpu-8g 208xCPU, 1872GB, no $64.2718
8xH100 (80GB),
100.0GB (disk)
Fleet a3mega-cluster does not exist yet.
Create the fleet? [y/n]: y
Provisioning...
---> 100%
dstack
will provision two A3 Mega nodes with GPUDirect-TCPXO configured.
Run NCCL Tests with GPUDirect-TCPXO support¶
Once the nodes are provisioned, let's test the network by running NCCL Tests:
$ dstack apply -f examples/misc/a3mega-clusters/nccl-tests.dstack.yml
nccl-tests provisioning completed (running)
nThread 1 nGpus 1 minBytes 8388608 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 200 agg iters: 1 validation: 0 graph: 0
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8388608 131072 float none -1 166.6 50.34 47.19 N/A 164.1 51.11 47.92 N/A
16777216 262144 float none -1 204.6 82.01 76.89 N/A 203.8 82.30 77.16 N/A
33554432 524288 float none -1 284.0 118.17 110.78 N/A 281.7 119.12 111.67 N/A
67108864 1048576 float none -1 447.4 150.00 140.62 N/A 443.5 151.31 141.86 N/A
134217728 2097152 float none -1 808.3 166.05 155.67 N/A 801.9 167.38 156.92 N/A
268435456 4194304 float none -1 1522.1 176.36 165.34 N/A 1518.7 176.76 165.71 N/A
536870912 8388608 float none -1 2892.3 185.62 174.02 N/A 2894.4 185.49 173.89 N/A
1073741824 16777216 float none -1 5532.7 194.07 181.94 N/A 5530.7 194.14 182.01 N/A
2147483648 33554432 float none -1 10863 197.69 185.34 N/A 10837 198.17 185.78 N/A
4294967296 67108864 float none -1 21481 199.94 187.45 N/A 21466 200.08 187.58 N/A
8589934592 134217728 float none -1 42713 201.11 188.54 N/A 42701 201.16 188.59 N/A
Out of bounds values : 0 OK
Avg bus bandwidth : 146.948
Done
Run NCCL workloads with GPUDirect-TCPXO support¶
To take full advantage of GPUDirect-TCPXO in your workloads, you need properly set up the NCCL environment variables. This can be done with the following commands in your run configuration:
type: task
nodes: 2
commands:
- |
NCCL_LIB_DIR="/var/lib/tcpxo/lib64"
source ${NCCL_LIB_DIR}/nccl-env-profile-ll128.sh
export NCCL_FASTRAK_CTRL_DEV=enp0s12
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
export NCCL_SOCKET_IFNAME=enp0s12
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY="/dev/aperture_devices"
export LD_LIBRARY_PATH="${NCCL_LIB_DIR}:${LD_LIBRARY_PATH}"
# run NCCL
resources:
# Allocate some shared memory for NCCL
shm_size: 16GB
Source code¶
The source code for this example can be found in
examples/misc/a3mega-clusters
.