one4all

A framework to streamline developing for CUDA, ROCm and oneAPI at the same time. There is a recorded video about it on SHARCNET YouTube Channel:

CUDA, ROCm, oneAPI – All for One or One for All?

Updated slides of the above video with more accurate benchmark results are included in the doc folder.

Features

Support four target APIs
- CUDA
- oneAPI
- ROCm
- STL Parallel Algorithms
All the configurations are automatically done by CMake
Support unit testing with Catch2
Support Google Benchmark
Two (kernel and Thrust/oneDPL) sample algorithms are already included

Building from source

You need:

C++ compiler supporting the C++17 standard (e.g. gcc 9.3)
CMake version 3.21 or higher.

And the following optional third-party libraries

Catch2 v3.1 or higher for unit testing
Google Benchmark for benchmarks

The CMake script configured in a way that if it cannot find the optional third-party libraries it tries to fetch and build them automatically. So, there is no need to do anything if they are missing but you need an internet connection for that to work.

On the Alliance clusters, you can activate the above environment by the following module command:

module load cmake googlebenchmark catch2

Building C++17 parallel algorithm version

Parallel STL requires a TBB version between 2018 to 2020 to work.

git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build
cmake --build build -j

Building CUDA version

Requires CUDA version 11 or higher.

git clone https://github.com/arminms/one4all.git
cd one4all
cmake -S . -B build-cuda -DONE4ALL_TARGET_API=cuda
cmake --build build-cuda -j

Building ROCm version

Requires ROCm 5.4.3 or higher.

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=hipcc cmake -S . -B build-rocm -DONE4ALL_TARGET_API=rocm
cmake --build build-rocm -j

Building oneAPI version

Requires oneAPI 2023.0.0 or higher.

Building for OpenCL targets

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=icpx cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j

Building for OpenCL, NVIDIA and/or AMD GPUs

Requires Codeplay plugins for NVIDIA and/or AMD GPUs installed.

git clone https://github.com/arminms/one4all.git
cd one4all
CXX=clang++ cmake -S . -B build-oneapi -DONE4ALL_TARGET_API=oneapi
cmake --build build-oneapi -j

Running unit tests

cd build # or build-cuda / build-rocm / build-oneapi
ctest

To select target for oneAPI version, set ONEAPI_DEVICE_SELECTOR or SYCL_DEVICE_FILTER environment variable first:

# oneAPI 2023.1.0 or higher
ONEAPI_DEVICE_SELECTOR=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|fpga|*]

# older versions of oneAPI
SYCL_DEVICE_FILTER=[level_zero|opencl|cuda|hip|esimd_emulator|*][:cpu|gpu|acc|*]

You can find the complete syntax here. Here is an example to run oneAPI version on NVIDIA GPUs:

ONEAPI_DEVICE_SELECTOR=cuda build-oneapi/test/unit_tests

Running benchmarks

cd build  # or build-cuda / build-rocm / build-oneapi
perf/benchmarks --benchmark_counters_tabular=true

Selecting targets for oneAPI version is like unit tests described above.

Benchmark results

Here are some updated benchmark results (more accurate than the preliminary results shown in the YouTube video because of switching to cudaEvent*() / hipEvent*() / SYCL's queue profiling for measuring performance) on the Alliance's clusters. Output files are included in the perf/results folder.

Parallel STL vs. oneAPI (higher is better)

Using AMD EPYC 7543 x2 2.8 GHz (64C / 128T) CPU:

API – Algorithm	`float`	`double`
Parallel STL – `*`	1.00	1.00
oneAPI – `generate_table()`	0.46	0.27
oneAPI – `scale_table()`	1.61	1.36

CUDA vs. oneAPI (higher is better)

Using NVIDIA A100-SXM4-40GB GPU:

API – Algorithm	`float`	`double`
CUDA – `*`	1.00	1.00
oneAPI – `generate_table()`	1.00	1.02
oneAPI – `scale_table()`	1.01	1.02

ROCm vs. oneAPI (higher is better)

Using AMD Instinct MI210 GPU:

API – Algorithm	`float`	`double`
ROCm – `*`	1.00	1.00
oneAPI – `generate_table()`	1.04	1.08
oneAPI – `scale_table()`	0.91	0.79

NVIDIA A100-SXM4-40GB (SM=108) vs. AMD Instinct MI210 (SM=104) (higher is better)

GPU – Algorithm	`float`	`double`
A100 – `*`	1.00	1.00
MI210 – `generate_table()`	0.81	0.37
MI210 – `scale_table()`	0.68	0.87

Using one4all for new projects

Select fork from the top right part of this page. You may choose a different name for your repository. In that case, you can also find/replace one4all with <your-project> in all files (case-sensitive) and ONE4ALL_TARGET_API with <YOUR-PROJECT>_TARGET_API in all CMakeLists.txt files. Finally, rename include/one4all folder to include/<your-project>.

You can add your new algorithms to include/<your-project>/algorithm along with unit tests and benchmarks in the corresponding test/unit_test/unit_tests_*.cpp and perf/benchmark/benchmarks_*.cpp files, respectively.

Later, if you decided to have a program, you can make a src folder and add the source code (e.g. my_prog_*.cpp) along with the following CMakeLists.txt into it:

## defining target for my_prog
#
add_executable(my_prog
  my_prog_${<YOUR-PROJECT>_TARGET_API}.$<IF:$<STREQUAL:${<YOUR-PROJECT>_TARGET_API},cuda>,cu,cpp>
)

## defining link libraries for my_prog
#
target_link_libraries(my_prog PRIVATE
  ${PROJECT_NAME}::${<YOUR-PROJECT>_TARGET_API}
)

## installing my_prog
#
install(TARGETS my_prog RUNTIME DESTINATION CMAKE_INSTALL_BINDIR)

Don't forget to replace <YOUR-PROJECT> with the name of your project in the above file.

Finally, add add_subdirectory(src) at the end of the main CMakeLists.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
cmake		cmake
doc		doc
include		include
perf		perf
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

one4all

Table of contents

Features

Building from source

Building C++17 parallel algorithm version

Building CUDA version

Building ROCm version

Building oneAPI version

Building for OpenCL targets

Building for OpenCL, NVIDIA and/or AMD GPUs

Running unit tests

Running benchmarks

Benchmark results

Parallel STL vs. oneAPI (higher is better)

CUDA vs. oneAPI (higher is better)

ROCm vs. oneAPI (higher is better)

NVIDIA A100-SXM4-40GB (SM=108) vs. AMD Instinct MI210 (SM=104) (higher is better)

Using one4all for new projects

About

Releases

Packages

Languages

License

arminms/one4all

Folders and files

Latest commit

History

Repository files navigation

one4all

Table of contents

Features

Building from source

Building C++17 parallel algorithm version

Building CUDA version

Building ROCm version

Building oneAPI version

Building for OpenCL targets

Building for OpenCL, NVIDIA and/or AMD GPUs

Running unit tests

Running benchmarks

Benchmark results

Parallel STL vs. oneAPI (higher is better)

CUDA vs. oneAPI (higher is better)

ROCm vs. oneAPI (higher is better)

NVIDIA A100-SXM4-40GB (SM=108) vs. AMD Instinct MI210 (SM=104) (higher is better)

Using one4all for new projects

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages