gpu_tester

Gpu tester finds all your bad gpus.

Works on slurm.

Features:

does a forward on each gpu
check for gpu returning incorrect results
check for gpu failing due to ECC errors

Roadmap:

sanity check forward speed
sanity check broadcast speed

Install

Create a venv:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

Then:

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester

Python examples

Checkout these examples to call this as a lib:

example.py

Output

Output looks like this:

job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]

Recommended testing strategy

Pair based strategy

The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example

gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'

All at once strategy

Once you validated this works, you may want to try the DDP strategy over all nodes, eg:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'

Simple forward

If you want to only validate the forward functionality of gpus and not the communication, you may use:

gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'

API

This module exposes a single function gpu_tester which takes the same arguments as the command line tool:

cluster the cluster. (default slurm)
job_name slurm job name. (default gpu_tester)
partition slurm partition. (default compute-od-gpu)
gpu_per_node numbe of gpu per node. (default 8)
nodes number of gpu nodes. (default 1)
output_folder the output folder. (default None which means current folder / results)
job_timeout job timeout (default 150 seconds)
job_comment optional comment arg given to slurm (default None)
job_account optional account arg given to slurm (default None)
test_kind simple_forward or ddp. simple_forward is quick forward test. DDP uses pytorch ddp to check gpu interconnect (default simple_forward)
parallel_tests number of tests to run in parallel. Recommended to use that with nodes == 2 to test pair by pair (default 1)
nodelist node whitelist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)
exclude node blacklist, example 'gpu-st-p4d-24xlarge-[66-67]' (default None)

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
examples		examples
gpu_tester		gpu_tester
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
HISTORY.md		HISTORY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu_tester

Install

Python examples

Output

Recommended testing strategy

Pair based strategy

All at once strategy

Simple forward

API

For development

About

Releases 6

Packages

Languages

License

rom1504/gpu-tester

Folders and files

Latest commit

History

Repository files navigation

gpu_tester

Install

Python examples

Output

Recommended testing strategy

Pair based strategy

All at once strategy

Simple forward

API

For development

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages