Nvidia DGX Pod Data Center Reference Design
Nvidia DGX Pod Data Center Reference Design
Nvidia DGX Pod Data Center Reference Design
Reference Architecture
Document Change History
DG-09225-001
Version Date Authors Description of Change
02 2018-10-09 Louis Capps, Robert Sohigian Added two new pod elevations
03 2018-10-12 Robert Sohigian Cleanup
Adjust
Raw Labeled Trained Optimized
Model
Figure 1 AI workflow
Sizing DL training is highly dependent on data size and model complexity. A single
DGX-1 server can complete a training experiment on a wide variety of AI models in one
day. For example, the autonomous vehicle software team at NVIDIA developing
NVIDIA DRIVENet uses a custom Resnet-18 backbone detection network with
960x480x3 image size and trains at 480 images per second on a DGX-1 server, allowing
training of 120 epochs with 300k images in 21 hours. Internal experience at NVIDIA has
shown that five developers collaborating on one AI model provides the optimal
development time. Each developer typically works on two models in parallel thus the
infrastructure needs to support ten model training experiments within the desired turn-
around-time (TAT). Nine DGX-1 servers can provide one day TAT for model training for
the five-developer workgroup. During schedule-critical times, multi-node scaling can
reduce turnaround time from one day to four hours using eight DGX-1 servers. Once in
production, additional DGX-1 servers will be necessary to support on-going model
refinement and regression testing
NVIDIA AI Software
NVIDIA AI software running (Figure 2) on the DGX POD provides a high-performance
DL training environment for large scale multi-user AI software development teams.
NVIDIA AI software includes the DGX operating system (DGX OS), cluster
management and orchestration tools, NVIDIA libraries and frameworks, workload
schedulers, and optimized containers from the NGC container registry. To provide
additional functionality, the DGX POD management software includes third-party
open-source tools recommended by NVIDIA which have been tested to work on DGX
PODs with the NVIDIA AI software stack. Support for these tools can be obtained
directly through third-party support structures.
The foundation of the NVIDIA AI software stack is the DGX OS, built on an optimized
version of the Ubuntu Linux operating system and tuned specifically for the DGX
hardware.
The DGX OS software includes certified GPU drivers, a network software stack, pre-
configured NFS caching, NVIDIA data center GPU management (DCGM) diagnostic
tools, GPU-enabled container runtime, NVIDIA CUDA® SDK, cuDNN, NCCL and other
NVIDIA libraries, and support for NVIDIA GPUDirect™ technology. The DGX OS
software can be automatically re-installed on demand by the DGX POD management
software.
The management software layer of the DGX POD (Figure 3) is composed of various
services running on the Kubernetes container orchestration framework for fault
tolerance and high availability. Services are provided for network configuration (DHCP)
and fully-automated DGX OS software provisioning over the network (PXE).
Monitoring of the DGX POD utilizes Prometheus for server data collection and storage
in a time-series database. Cluster-wide alerts are configured with Alertmanager, and
DGX POD metrics are displayed using the Grafana web interface. For sites required to
The DGX POD software allows for dynamic partitioning between the nodes assigned to
Kubernetes and Slurm such that resources can be shifted between the partitions to meet
the current workload demand. A simple user interface allows administrators to move
DGX servers between Kubernetes- and Slurm-managed domains.
Kubernetes serves the dual role of running management services on management nodes
as well as accepting user-defined workloads and is installed on every server in the DGX
POD. Slurm runs only user workloads and is installed on the login node as well as the
DGX compute nodes. The DGX POD software allows individual DGX servers to run jobs
in either Kubernetes or Slurm. Kubernetes provides a high level of flexibility in load
balancing, node failover, and bursting to external Kubernetes clusters, including NGC
public cloud instances. Slurm supports a more static cluster environment but provides
advanced HPC-style batch scheduling features including multi-node scheduling that
some workgroups may require. With the DGX POD software, idle systems can be
moved back and forth as needed between the Kubernetes and Slurm environments.
Future enhancements to Kubernetes are expected to support all DGX POD use cases in a
pure Kubernetes environment.
User workloads on the DGX POD primarily utilize containers from NGC (Figure 4),
which provides researchers and data scientists with easy access to a comprehensive
catalog of GPU-optimized software for DL, HPC applications, and HPC visualization
that take full advantage of the GPUs. The NGC container registry includes NVIDIA
tuned, tested, certified, and maintained containers for the top DL frameworks such as
TensorFlow, PyTorch, and MXNet. NGC also has third-party managed HPC application
containers, and NVIDIA HPC visualization containers.
Management of the NVIDIA AI software on the DGX POD is accomplished with the
Ansible configuration management tool. Ansible roles are used to install Kubernetes on
the management nodes, install additional software on the login and DGX servers,
configure user accounts, configure external storage connections, install Kubernetes and
Slurm schedulers, as well as performing day-to-day maintenance tasks such as new
software installation, software updates, and GPU driver upgrades.
The software management stack and documentation are available as an open source
project on GitHub at:
https://github.com/NVIDIA/deepops
There are several factors to consider when planning a DGX POD deployment to
determine if more than one rack is needed per DGX POD. These reference architectures
are designed to utilize high-density racks to provide the most efficient use of costly data
center floorspace and to simplify network cabling. As GPU usage grows, the average
power per server and power per rack continues to increase. However, older data centers
may not yet be able to support the power and cooling densities required; hence the
three-zone design allowing the DGX POD components to be installed in up to three
lower-power racks – such as shown in the DGX-1 POD (18 kW) section.
A DGX POD is designed to fit within a standard-height 42 RU data center rack. A taller
rack can be used to include redundant networking switches, a management switch, and
login servers. This reference architecture uses an additional utility rack for login and
management servers and has been sized and tested with up to six DGX PODs. Larger
configurations of DGX PODs can be defined by an NVIDIA solution architect.
A primary 10 GbE (minimum) network switch is used to connect all servers in the DGX
POD and to provide access to a data center network. The DGX POD has been tested with
an Arista switch with 48 x 10 GbE ports and 4 x 40 GbE uplinks. VLAN capabilities of
the networking hardware are used to allow the out-of-band management network to run
independently from the data network, while sharing the same physical hardware.
Alternatively, a separate 1 GbE management switch may be used. While not included in
the reference architecture, a second 10 GbE network switch can be used for redundancy
and high availability. In addition to Arista, NVIDIA is working with other networking
vendors who plan to release switch reference designs compatible with the DGX POD.
A 36-port Mellanox 100 Gbps switch is configured to provide four 100 Gbps InfiniBand
connections to the DGX servers in the rack. This provides the best possible scalability for
multi-node jobs. In the event of switch failure, multi-node jobs can fall back to use the
10 GbE switch for communications. The Mellanox switch can also be configured in
100 GbE mode for organizations that prefer to use Ethernet networking. Alternately, by
configuring two 100 Gbps ports per DGX server, the Mellanox switch can also be used
by the storage servers.
With the DGX family of servers, AI and HPC workloads are fusing into a unified
architecture. For organizations that want to utilize multiple DGX PODs to run
cluster-wide jobs, a core InfiniBand switch is configured in the utility rack in conjunction
with a second 36-port Mellanox switch.
Storage architecture is important for optimized DL training performance. The DGX POD
uses a hierarchical design with multiple levels of cache storage using the DGX SSD and
additional cache storage servers in the DGX POD. Long-term storage of raw data can be
located on a wide variety of storage devices outside of the DGX POD, either on-premises
or in public clouds.
The DGX POD baseline storage architecture consists of in-rack NFS storage servers used
in conjunction with the local DGX SSD cache. Additional storage performance may be
obtained by using one of the distributed filesystems listed at this link:
https://docs.nvidia.com/deeplearning/dgx/bp-dgx/index.html#storage_parallel
The DGX POD is also designed to be compatible with several third-party storage
solutions, many of which are documented at this link:
https://www.nvidia.com/en-us/data-center/dgx-reference-architecture/
Login server which allows users to login to the cluster and launch Slurm
batch jobs1. (1 RU)
Three management servers running Kubernetes server components and
other DGX POD management software2. (3 x 1 RU = 3 RU)
Optional multi-POD 10 GbE storage and management network switches
(2 RU)
Optional multi-POD clustering using a Mellanox 216 port EDR InfiniBand
switch. (12 RU)
1 To support many users, the login server should have at two high-end CPUs, at least 1 TB of memory, two links to the 100
Gbps network, and redundant fans and power supplies.
2 These servers can be lower performance than the login server and can be configured with mid-range CPUs and less
memory (128 to 256 GB).
Additional DGX site requirements are detailed in the NVIDIA DGX Site Preparation Guide
but important items to consider include:
Design Guidelines
Area DGX-2 Server (35 kW) DGX-1 Server (35 kW) DGX-1 Server (18 kW)
Supports 3000 lbs. of static load Supports 1200 lbs. of static load
Rack
•Dimensions of 1200 mm depth x 700 mm width
•Structured cabling pathways per TIA 942 standard
•North America: A/B power feeds, each three- •North America: A/B power
phase 400V/60A/33.2kW feeds, each three-phase
(or three-phase 208V/60A/17.3 kW with 208V/60A/17.3 kW
additional considerations for redundancy as •International: A/B power
Power
required) feeds, each 380/400/415V,
• International: A/B power feeds, each 32A, three-phase –21-23kW
380/400/415V, 32A, three-phase –21-23kW each. each.
1 Via rack cooling door or data center hot/cold aisle air containment
Table 1 Rack, cooling, and power considerations for DGX POD racks
The figures in DGX POD Design show the server components and networking of the
DGX POD. Management servers, login servers, DGX compute servers, and storage
communicate over a 1 or 10 Gbps Ethernet network, while login servers, DGX compute
servers and optionally storage can also communicate over high-speed 100 Gbps
InfiniBand or Ethernet. The DGX compute servers shown here are running both
Kubernetes and Slurm to handle varying user workloads.
Whether deploying single or multiple DGX PODs, it is best to use tools that take care of
the management of all the nodes. Use the NVIDIA AI software stack to provide an
end-to-end solution from server OS installation to user job management.
Installing the NVIDIA AI software stack on DGX PODs requires meeting a basic set of
hardware and software prerequisites, setting local configuration parameters, and
deploying the cluster following step-by-step instructions. Optional installation steps
include configuring multiple job schedulers and integrating with enterprise
authentication and storage systems.
Once the DGX PODs are powered-on and configured, verify all the components are
working correctly. This can be an involved process. During this operational checkout,
the following should be verified:
Compute hardware, networking, and storage are operating as expected
Power distribution works properly under maximum possible load
Additional site tests as may be required for the data center
The first step to validate DGX server installation is performed using the diagnostic
feature of DCGM. This tests many different GPU functions including memory, PCIe bus,
SM units, memory bandwidth, and NVLink. DCGM will draw close to maximum server
power during operation and thus is a good stress test of server power and cooling. By
running this test simultaneously across all nodes using Kubernetes, it will stress the
rack-level power, cooling, and airflow.
The complete DCGM test takes approximately ten minutes to run on each DGX server.
This test should be run back to back for several hours to stress-test the DGX POD and
ensure correct completion of each iteration.
To test DL training, the NGC containers have built-in example scripts to check several
different families of networks including image classification and language modeling via
LSTMs. For example, the NGC TensorFlow Container contains scripts to test the major
ImageNet Networks (Resnet-50, Inception, VGG-16, etc.) and allows for scalable server
testing with real data or in synthetic mode.
In addition, test any local applications that are planned to be run on the DGX POD. All
performance results measured during server installation should be saved and used
during routine retesting of the DGX POD to verify performance consistency as servers
are modified and updated.
Summary
The DGX POD reference architecture provides organizations a blueprint to simplify
deployment of GPU computing infrastructure to support large-scale AI software
development efforts. A single DGX POD supporting small workgroups of AI developers
can be grown into an infrastructure supporting thousands of users. The DGX POD
reference architecture is based on the NVIDIA DGX SATURNV AI supercomputer
which has 1000 DGX-1 servers and powers autonomous vehicle software and internal AI
R&D across NVIDIA research, graphics, HPC, and robotics.
The DGX POD has been designed and tested by using specific storage and networking
partners. In addition to those mentioned in this paper, NVIDIA is working with
additional storage and networking vendors who plan to publish DGX POD-compatible
reference architectures using their specific products.
Because most installations of a DGX POD will require small differences such as cable
length to integrate into a data center, NVIDIA does not sell DGX POD as a single unit.
Work with an authorized NVIDIA Partner Network (NPN) reseller to configure and
purchase a DGX POD.
Finally, this white paper is meant to be a high-level overview and is not intended to be a
step-by-step installation guide. Customers should work with an NPN provider to
customize an installation plan for their organization.
Trademarks
NVIDIA and the NVIDIA logo are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other
countries. Other company and product names may be trademarks of the respective companies with which they are
associated.
Copyright
© 2018 NVIDIA Corporation. All rights reserved.
www.nvidia.com