A framework to streamline developing for CUDA, ROCm and oneAPI at the same time. There is a recorded video about it on SHARCNET YouTube Channel:
Updated slides of the above video with more accurate benchmark results are included in the doc folder.
- Support four target APIs
- CUDA
- oneAPI
- ROCm
- STL Parallel Algorithms
- All the configurations are automatically done by CMake
- Support unit testing with Catch2
- Support Google Benchmark
- Two (kernel and Thrust/oneDPL) sample algorithms are already included
You need:
- C++ compiler supporting the C++17 standard (e.g.
gcc
9.3) - CMake version 3.21 or higher.
And the following optional third-party libraries
- Catch2 v3.1 or higher for unit testing
- Google Benchmark for benchmarks
The CMake
script configured in a way that if it cannot find the optional third-party libraries it tries to fetch and build them automatically. So, there is no need to do anything if they are missing but you need an internet connection for that to work.
On the Alliance clusters, you can activate the above environment by the following module command:
module load cmake googlebenchmark catch2
Parallel STL requires a TBB version between 2018 to 2020 to work.
git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build
cmake --build build -j
Requires CUDA version 11 or higher.
git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build-cuda -DONE4ALL_TARGET_API=cuda
cmake --build build-cuda -j
Requires ROCm 5.4.3 or higher.
git clone https://github.com/arminms/one4all.git
cd one4all
CXX=hipcc cmake -S . -B build-rocm -DONE4ALL_TARGET_API=rocm
cmake --build build-rocm -j
Requires oneAPI 2023.0.0 or higher.
git clone https://github.com/arminms/one4all.git
cd one4all
CXX=icpx cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j
Requires Codeplay plugins for NVIDIA and/or AMD GPUs installed.
git clone https://github.com/arminms/one4all.git
cd one4all
CXX=clang++ cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j
cd build # or build-cuda / build-rocm / build-oneapi
ctest
To select target for oneAPI version, set ONEAPI_DEVICE_SELECTOR
or SYCL_DEVICE_FILTER
environment variable first:
# oneAPI 2023.1.0 or higher
ONEAPI_DEVICE_SELECTOR=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|fpga|*]
# older versions of oneAPI
SYCL_DEVICE_FILTER=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|acc|*]
You can find the complete syntax here. Here is an example to run oneAPI version on NVIDIA GPUs:
ONEAPI_DEVICE_SELECTOR=cuda build-oneapi/test/unit_tests
cd build # or build-cuda / build-rocm / build-oneapi
perf/benchmarks --benchmark_counters_tabular=true
Selecting targets for oneAPI version is like unit tests described above.
Here are some updated benchmark results (more accurate than the preliminary results shown in the YouTube video because of switching to cudaEvent*() / hipEvent*() / SYCL's queue profiling for measuring performance) on the Alliance's clusters. Output files are included in the perf/results folder.
Using AMD EPYC 7543 x2 2.8 GHz (64C / 128T)
CPU:
API – Algorithm | float |
double |
---|---|---|
Parallel STL – * |
1.00 | 1.00 |
oneAPI – generate_table() |
0.46 | 0.27 |
oneAPI – scale_table() |
1.61 | 1.36 |
Using NVIDIA A100-SXM4-40GB
GPU:
API – Algorithm | float |
double |
---|---|---|
CUDA – * |
1.00 | 1.00 |
oneAPI – generate_table() |
1.00 | 1.02 |
oneAPI – scale_table() |
1.01 | 1.02 |
Using AMD Instinct MI210
GPU:
API – Algorithm | float |
double |
---|---|---|
ROCm – * |
1.00 | 1.00 |
oneAPI – generate_table() |
1.04 | 1.08 |
oneAPI – scale_table() |
0.91 | 0.79 |
GPU – Algorithm | float |
double |
---|---|---|
A100 – * |
1.00 | 1.00 |
MI210 – generate_table() |
0.81 | 0.37 |
MI210 – scale_table() |
0.68 | 0.87 |
Select fork
from the top right part of this page. You may choose a different name for your repository. In that case, you can also find/replace one4all
with <your-project>
in all files (case-sensitive) and ONE4ALL_TARGET_API
with <YOUR-PROJECT>_TARGET_API
in all CMakeLists.txt
files. Finally, rename include/one4all
folder to include/<your-project>
.
You can add your new algorithms to include/<your-project>/algorithm
along with unit tests and benchmarks in the corresponding test/unit_test/unit_tests_*.cpp
and perf/benchmark/benchmarks_*.cpp
files, respectively.
Later, if you decided to have a program, you can make a src
folder and add the source code (e.g. my_prog_*.cpp
) along with the following CMakeLists.txt
into it:
## defining target for my_prog
#
add_executable(my_prog
my_prog_${<YOUR-PROJECT>_TARGET_API}.$<IF:$<STREQUAL:${<YOUR-PROJECT>_TARGET_API},cuda>,cu,cpp>
)
## defining link libraries for my_prog
#
target_link_libraries(my_prog PRIVATE
${PROJECT_NAME}::${<YOUR-PROJECT>_TARGET_API}
)
## installing my_prog
#
install(TARGETS my_prog RUNTIME DESTINATION CMAKE_INSTALL_BINDIR)
Don't forget to replace <YOUR-PROJECT>
with the name of your project in the above file.
Finally, add add_subdirectory(src)
at the end of the main CMakeLists.txt
file.