Gpu tester finds all your bad gpus.
Works on slurm.
Features:
- does a forward on each gpu
- check for gpu returning incorrect results
- check for gpu failing due to ECC errors
Roadmap:
- sanity check forward speed
- sanity check broadcast speed
Create a venv:
python3 -m venv .env
source .env/bin/activate
pip install -U pip
Then:
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester
Checkout these examples to call this as a lib:
Output looks like this:
job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]
The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example
gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'
Once you validated this works, you may want to try the DDP strategy over all nodes, eg:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'
If you want to only validate the forward functionality of gpus and not the communication, you may use:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'
This module exposes a single function gpu_tester
which takes the same arguments as the command line tool:
- cluster the cluster. (default slurm)
- job_name slurm job name. (default gpu_tester)
- partition slurm partition. (default compute-od-gpu)
- gpu_per_node numbe of gpu per node. (default 8)
- nodes number of gpu nodes. (default 1)
- output_folder the output folder. (default None which means current folder / results)
- job_timeout job timeout (default 150 seconds)
- job_comment optional comment arg given to slurm (default None)
- job_account optional account arg given to slurm (default None)
- test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
- parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
- nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
- exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test