Oneapi Programming Guide (1)
Oneapi Programming Guide (1)
Programming Guide
December 8, 2022
Release 2023.0
Contents
3
6.3 Debugging the DPC++ and OpenMP* Offload Process . . . . . . . . . . . . . . . . . . . . . . . . 159
6.4 Performance Tuning Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.5 oneAPI Library Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7 Glossary 187
7.1 Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.2 Accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.3 Application Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.4 Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.5 Command Group Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.6 Command Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.7 Compute Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.8 Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.9 Device Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.10 DPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.11 Fat Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.12 Fat Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.13 Fat Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.14 Host . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.15 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.16 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.17 Kernel Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.18 ND-range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.19 Processing Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.20 Single Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.21 SPIR-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.22 SYCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.23 Work-groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.24 Work-item . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
4
Intel® oneAPI
Note: Not all programs can benefit from the single programming model offered by oneAPI. It is important to
understand how to design, implement, and use the oneAPI programming model for your program.
Learn more about the oneAPI initiative and programming model at oneapi.com. The site includes the oneAPI
Specification, SYCL Language Guide and API Reference, and other resources.
1
Intel® oneAPI
As shown in the figure above, applications that take advantage of the oneAPI programming model can run on
multiple target hardware platforms ranging from CPU to FPGA. Intel offers oneAPI products as part of a set
of toolkits. The Intel® oneAPI Base Toolkit, Intel® oneAPI HPC Toolkit, Intel® oneAPI IoT Toolkit, and several
other toolkits feature complementary tools based on specific developer workload needs. For example, the In-
tel oneAPI Base Toolkit includes the Intel® oneAPI DPC++/C++ Compiler, the Intel® DPC++ Compatibility Tool,
select libraries, and analysis tools.
• Developers who want to migrate existing CUDA* code to SYCL* for compilation with the DPC++ compiler
can use the Intel DPC++ Compatibility Tool to help migrate their existing projects to SYCL* using DPC++.
• The Intel oneAPI DPC++/C++ Compiler supports direct programming of code targeting accelerators. Di-
rect programming is coding for performance when APIs are not available for the algorithms expressed in
user code. It supports online and offline compilation for CPU and GPU targets and offline compilation for
FPGA targets.
• API-based programming is supported via sets of optimized libraries. The library functions provided in the
oneAPI product are pre-tuned for use with any supported target architecture, eliminating the need for de-
veloper intervention. For example, the BLAS routine available from Intel® oneAPI Math Kernel Library
is just as optimized for a GPU target as a CPU target.
• Finally, the compiled SYCL application can be analyzed and debugged to ensure performance, stability,
and energy efficiency goals are achieved using tools such as Intel® VTune™ Profiler or Intel® Advisor.
The Intel oneAPI Base Toolkit is available as a free download from the Intel Developer Zone.
2
Intel® oneAPI
Users familiar with Intel® Parallel Studio and Intel® System Studio may be interested in the Intel oneAPI HPC
Toolkit and Intel oneAPI IoT Toolkit respectively.
3
Intel® oneAPI
4
Intel® oneAPI
The next sections briefly describe each language and provide pointers to more information.
The best way to introduce SYCL is through an example. Since SYCL is based on modern C++, this example uses
several features that have been added to C++ in recent years, such as lambda functions and uniform initialization.
Even if developers are not familiar with these features, their semantics will become clear from the context of the
example. After gaining some experience with SYCL, these newer C++ features will become second nature.
The following application sets each element of an array to the value of its index, so that a[0] = 0, a[1] = 1, etc.
#include <CL/sycl.hpp>
#include <iostream>
5
Intel® oneAPI
int main() {
auto r = range{num};
buffer<int> a{r};
queue{}.submit([&](handler& h) {
accessor out{a, h};
h.parallel_for(r, [=](item<1> idx) {
out[idx] = idx;
});
});
host_accessor result{a};
for (int i=0; i<num; ++i)
std::cout << result[i] << "\n";
}
The first thing to notice is that there is just one source file: both the host code and the offloaded accelerator code
are combined in a single source file. The second thing to notice is that the syntax is standard C++: there aren’t
any new keywords or pragmas used to express the parallelism. Instead, the parallelism is expressed through
C++ classes. For example, the buffer class on line 9 represents data that will be offloaded to the device, and the
queue class on line 11 represents a connection from the host to the accelerator.
The logic of the example works as follows. Lines 8 and 9 create a buffer of 16 int elements, which have no initial
value. This buffer acts like an array. Line 11 constructs a queue, which is a connection to an accelerator device.
This simple example asks the SYCL runtime to choose a default accelerator device, but a more robust applica-
tion would probably examine the topology of the system and choose a particular accelerator. Once the queue
is created, the example calls the submit() member function to submit work to the accelerator. The parameter
to this submit() function is a lambda function, which executes immediately on the host. The lambda function
does two things. First, it creates an accessor on line 12, which can write elements in the buffer. Second, it calls
the parallel_for() function on line 13 to execute code on the accelerator.
The call to parallel_for() takes two parameters. One parameter is a lambda function, and the other is the
range object “r” that represents the number of elements in the buffer. SYCL arranges for this lambda to be
called on the accelerator once for each index in that range, i.e. once for each element of the buffer. The lambda
simply assigns a value to the buffer element by using the out accessor that was created on line 12. In this simple
example, there are no dependencies between the invocations of the lambda, so the program is free to execute
them in parallel in whatever way is most efficient for this accelerator.
After calling parallel_for(), the host part of the code continues running without waiting for the work to com-
plete on the accelerator. However, the next thing the host does is to create a host_accessor on line 18, which
reads the elements of the buffer. The SYCL runtime knows this buffer is written by the accelerator, so the host_-
accessor constructor (line 18) is blocked until the work submitted by the parallel_for() is complete. Once
the accelerator work completes, the host code continues past line 18, and it uses the out accessor to read values
from the buffer.
6
Intel® oneAPI
This introduction to SYCL is not meant to be a complete tutorial. Rather, it just gives you a flavor of the language.
There are many more features to learn, including features that allow you to take advantage of common accelera-
tor hardware such as local memory, barriers, and SIMD. There are also features that let you submit work to many
accelerator devices at once, allowing a single application to run work in parallel on many devices simultaneously.
The following resources are useful to learning and mastering SYCL using a DPC++ compiler:
• Explore SYCL with Samples from Intel provides an overview and links to simple sample applications avail-
able from GitHub*.
• The DPC++ Foundations Code Sample Walk-Through is a detailed examination of the Vector Add sample
code, the DPC++ equivalent to a basic Hello World application.
• The oneapi.com site includes a Language Guide and API Reference with descriptions of classes and their
interfaces. It also provides details on the four programming models - platform model, execution model,
memory model, and kernel programming model.
• The DPC++ Essentials training course is a guided learning path for SYCL using Jupyter* Notebooks on
Intel® DevCloud.
• Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL
is a comprehensive book that introduces and explains key programming concepts and language details
about SYCL.
The OpenMP target construct is used to transfer control from the host to the target device. Variables are
mapped between the host and the target device. The host thread waits until the offloaded computations are
complete. Other OpenMP tasks may be used for asynchronous execution on the host; use the nowait clause
to specify that the encountering thread does not wait for the target region to complete.
C/C++
The C++ code snippet below targets a SAXPY computation to the accelerator.
7
Intel® oneAPI
Array fa is mapped both to and from the accelerator since fa is both input to and output from the calculation.
Array fb and the variable a are required as input to the calculation and are not modified, so there is no need to
copy them out. The variable FLOPS_ARRAY_SIZE is implicitly mapped to the accelerator. The loop index k is
implicitly private according to the OpenMP specification.
Fortran
This Fortran code snippet targets a matrix multiply to the accelerator.
Arrays a and b are mapped to the accelerator, while array c is both input to and output from the accelerator. The
variable n is implicitly mapped to the accelerator. The private clause is optional since loop indices are automat-
ically private according to the OpenMP specification.
To optimize data sharing between the host and the accelerator, the target data directive maps variables to the
accelerator and the variables remain in the target data region for the extent of that region. This feature is useful
when mapping variables across multiple target regions.
C/C++
Fortran
Clauses
The clauses can be one or more of the following. See TARGET DATA for more information.
• DEVICE (integer-expression)
8
Intel® oneAPI
– alloc
– to
– from
– tofrom
– delete
– release
• SUBDEVICE ([integer-constant ,] integer-expression [ : integer-expression [ : integer-expression]])
• USE_DEVICE_ADDR (list) // available only in ifx
• USE_DEVICE_PTR (ptr-list)
DEVICE (integer-expression)
IF ([TARGET DATA:] scalar-logical-expression)
MAP ([[map-type-modifier[,]] map-type: alloc | to | from | tofrom | delete | release] list)
SUBDEVICE ([integer-constant ,] integer-expression [ : integer-expression [ : integer-
,→expression]])
USE_DEVICE_ADDR (list) // available only in ifx
USE_DEVICE_PTR (ptr-list)
Use the target update directive to synchronize an original variable in the host with the corresponding variable in
the device.
The following example commands illustrate how to compile an application using OpenMP target.
C/C++
• Linux:
Fortran
• Linux:
9
Intel® oneAPI
• Windows:
• Intel offers code samples that demonstrate using OpenMP directives to target accelerators at https://
github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming. Specific samples include:
– Matrix Multiplication is a simple program that multiplies together two large matrices and verifies the
results. This program is implemented using two ways: SYCL* and OpenMP.
– The ISO3DFD sample refers to Three-Dimensional Finite-Difference Wave Propagation in Isotropic
Media. The sample is a three-dimensional stencil used to simulate a wave propagating in a 3D
isotropic medium. The sample shows some of the more common challenges and techniques when
targeting OMP accelerator devices in more complex applications to achieve good performance.
– openmp_reduction is a simple program that calculates pi. This program is implemented using C++
and OpenMP for CPUs and accelerators based on Intel® Architecture.
• Get Started with OpenMP* Offload Feature provides details on using Intel’s compilers with OpenMP of-
fload, including lists of supported options and example code.
• LLVM/OpenMP Runtimes describes the distinct types of runtimes available and can be helpful when de-
bugging OpenMP offload.
• openmp.org has an examples document: https://www.openmp.org/wp-content/uploads/
openmp-examples-4.5.0.pdf. Chapter 4 of the examples document focuses on accelerator devices and
the target construct.
• Using OpenMP - the Next Step is a good OpenMP reference book. Chapter 6 covers OpenMP support
for heterogeneous systems. For additional information on this book, see https://www.openmp.org/tech/
using-openmp-next-step.
Host code can explicitly select a device type. To do select a device, select a queue and initialize its device with
one of the following:
• default_selector
• cpu_selector
• gpu_selector
10
Intel® oneAPI
• accelerator_selector
If default_selector is used, the kernel runs based on a heuristic that chooses from available compute devices
(all, or a subset based on the value of the SYCL_DEVICE_FILTER environment variable).
If a specific device type (such as cpu_selector or gpu_selector) is used, then it is expected that the specified
device type is available in the platform or included in the filter specified by SYCL_DEVICE_FILTER. If such a device
is not available, then the runtime system throws an exception indicating that the requested device is not available.
This error can be thrown in the situation where an ahead-of-time (AOT) compiled binary is run in a platform that
does not contain the specified device type.
Note: While DPC++ applications can run on any supported target hardware, tuning is required to derive the
best performance advantage on a given target architecture. For example, code tuned for a CPU likely will not
run as fast on a GPU accelerator without modification.
SYCL_DEVICE_FILTER is a complex environment variable that allows you to limit the runtimes, compute device
types, and compute device IDs that may be used by the DPC++ runtime to a subset of all available combinations.
The compute device IDs correspond to those returned by the SYCL API, clinfo, or sycl-ls (with the number-
ing starting at 0). They have no relation to whether the device with that ID is of a certain type or supports a
specific runtime. Using a programmatic special selector (like gpu_selector) to request a filtered out device will
cause an exception to be thrown. Refer to the environment variable description in GitHub for details on use and
example values: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md.
The sycl-ls tool enumerates a list of devices available in the system. It is strongly recommended to run this
tool before running any SYCL or DPC++ programs to make sure the system is configured properly. As a part of
enumeration, sycl-ls prints the SYCL_DEVICE_FILTER string as a prefix of each device listing. The format of the
sycl-ls output is [SYCL_DEVICE_FILTER] Platform_name, Device_name, Device_version [driver_-
version]. In the following example, the string enclosed in the bracket ([ ]) at the beginning of each line is the
SYCL_DEVICE_FILTER string used to designate the specific device on which the program will run.
$ sycl-ls
[opencl:acc:0] Intel® FPGA Emulation Platform for OpenCL™, Intel® FPGA Emulation Device 1.2␣
,→[2021.12.9.0.24_005321]
[opencl:gpu:1] Intel® OpenCL HD Graphics, Intel® UHD Graphics 630 [0x3e92] 3.0 [21.37.20939]
[opencl:cpu:2] Intel® OpenCL, Intel® Core™ i7-8700 CPU @ 3.20GHz 3.0 [2021.12.9.0.24_005321]
[level_zero:gpu:0] Intel® Level-Zero, Intel® UHD Graphics 630 [0x3e92] 1.1 [1.2.20939]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]
Additional information about device selection is available from the DPC++ Language Guide and API Reference.
OpenMP provided a set of APIs for programmers to query and set device for running code on the device. Host
code can explicitly select and set a device num. For each offloading region, a programmer can also use a device
clause to specify the target device that is to be used for executing the offloading region.
• int omp_get_num_procs (void) routine returns the number of processors available to the device
• void omp_set_default_device(int device_num) routine controls the default target device
11
Intel® oneAPI
12
Intel® oneAPI
On Linux* or macOS* system, the Intel oneAPI development tools are typically installed in the /opt/intel/
oneapi/ directory.
These are the default locations; the precise location can be changed during installation.
Within the oneAPI installation directory are a collection of folders that contain the compilers, libraries, analyzers,
and other tools installed on the development system. The precise list depends on the toolkit(s) installed and the
options selected during installation. Most of the folders within the oneAPI installation directory have obvious
names. For example, the mkl folder contains the Intel® oneAPI Math Kernel Library (Intel® oneMKL), the ipp
folder contains the Intel® Integrated Performance Primitives (Intel® IPP) library, and so on.
13
Intel® oneAPI
Most of the oneAPI component tool folders contain an environment script named vars.bat that configures
the environment variables needed by that component to support oneAPI development work. For example, in a
default installation, the Intel® Integrated Performance Primitives (Intel® IPP) vars script on Windows is located at:
C:\Program Files (x86)\Intel\oneAPI\ipp\latest\env\vars.bat. This pattern is shared by all oneAPI
components that include an environment vars setup script.
These component tool vars scripts can be called directly or collectively. To call them collectively, a script named
setvars.bat is provided in the oneAPI installation folder. For example, in a default installation on a Windows
machine: C:\Program Files (x86)\Intel\oneAPI\setvars.bat.
Running the setvars.bat script without any arguments causes it to locate and run all <component>\latest\
env\vars.bat scripts in the installation. Changes made to the environment by these scripts can be seen by
running the Windows set command after running the environment setup scripts.
Visual Studio Code* developers can install a oneAPI environment extension to run the setvars.bat within Vi-
sual Studio Code. Learn more in Using Visual Studio Code with Intel oneAPI Toolkits.
Note: Changes to your environment made by running the setvars.bat script (or the individual vars.bat
scripts) are not permanent. Those changes only apply to the cmd.exe session in which the setvars.bat envi-
ronment script was executed.
The setvars.bat script supports several command-line arguments, which are displayed using the --help op-
tion. For example:
The --config=file argument and the ability to include arguments that will be passed to the vars.bat scripts
that are called by the setvars.bat script can be used to customize the environment setup.
The --config=file argument provides the ability to limit environment initialization to a specific set of oneAPI
components. It also provides a way to initialize the environment for specific component versions. For example,
to limit environment setup to just the Intel® IPP library and the Intel® oneAPI Math Kernel Library (Intel® oneMKL),
pass a config file that tells the setvars.bat script to only call the vars.bat environment scripts for those two
oneAPI components. More details and examples are provided in Use a Config file for setvars.bat on Windows.
14
Intel® oneAPI
Any extra arguments passed on the setvars.bat command line that are not described in the setvars.bat help
message will be passed to every called vars.bat script. That is, if the setvars.bat script does not recognize an
argument, it assumes the argument is meant for use by one or more component vars scripts and passes those
extra arguments to every component vars.bat script that it calls. The most common extra arguments are ia32
and intel64, which are used by the Intel compilers and the IPP, MKL, and TBB libraries to specify the application
target architecture.
If more than one version of Microsoft Visual Studio* is installed on your system, you can specify which Visual
Studio environment should be initialized as part of the oneAPI setvars.bat environment initialization by adding
the vs2017, vs2019, or vs2022 argument to the setvars.bat command line. By default, the most recent version
of Visual Studio is located and initialized.
Note: Support for Microsoft Visual Studio* 2017 is deprecated as of the Intel® oneAPI 2022.1 release, and will
be removed in a future release.
Inspect the individual vars.bat scripts to determine which, if any, command line arguments they accept.
How to Run
<install-dir>\setvars.bat
How to Verify
After executing setvars.bat, verify success by searching for the SETVARS_COMPLETED environment vari-
able. If setvars.bat was successful the SETVARS_COMPLETED environment variable will have a value of
1:
set | find "SETVARS_COMPLETED"
Return value
SETVARS_COMPLETED=1
If the return value is anything other than SETVARS_COMPLETED=1 the test failed and setvars.bat did not com-
plete properly.
Multiple Runs
Because many of the individual env\vars.bat scripts make significant changes to PATH, CPATH, and other
environment variables, the top-level setvars.bat script will not allow multiple invocations of itself in the same
session. This is done to ensure that your environment variables do not exceed the maximum provided environ-
ment space, especially the %PATH% environment variable. Exceeding the available environment space results in
unpredictable behavior in your terminal session and should be avoided.
This behavior can be overridden by passing setvars.bat the --force flag. In this example, the user tries to run
setvars.bat twice. The second instance is stopped because setvars.bat has already been run.
15
Intel® oneAPI
> <install-dir>\setvars.bat
initializing environment ...
(SNIP: lot of output)
oneAPI environment initialized
> <install-dir>\setvars.bat
.. code-block:: WARNING: setvars.bat has already been run. Skipping re-execution.
To force a re-execution of setvars.bat, use the '--force' option.
Using '--force' can result in excessive use of your environment variables.
In the third instance, the user runs <install-dir>\setvars.bat --force and the initialization is successful.
The ONEAPI_ROOT variable is set by the top-level setvars.bat script when that script is sourced. If there is al-
ready a ONEAPI_ROOT environment variable defined, setvars.bat temporarily overwrites it in the cmd.exe ses-
sion in which you ran the setvars.bat script. This variable is primarily used by the oneapi-cli sample browser
and the Microsoft Visual Studio and Visual Studio Code* sample browsers to help them locate oneAPI tools and
components, especially for locating the setvars.bat script if the SETVARS_CONFIG feature has been enabled.
For more information about the SETVARS_CONFIG feature, see Automate the setvars.bat Script with Microsoft
Visual Studio*.
On Windows systems, the installer adds the ONEAPI_ROOT variable to the environment.
The setvars.bat script sets environment variables for use with the oneAPI toolkits by executing each of the
<install-dir>\latest\env\vars.bat scripts found in the respective oneAPI folders. Unless you configure
your Windows system to run the setvars.bat script automatically, it must be executed every time a new terminal
window is opened for command line development, or prior to launching Visual Studio Code, Sublime Text, or any
other C/C++ editor you use. For more information, see Configure Your System.
The procedure below describes how to use a configuration file to manage environment variables.
Some oneAPI tools support installation of multiple versions. For those tools that do support multiple versions,
the directory is organized like this (assuming a default installation and using the compiler as an example):
16
Intel® oneAPI
For example:
For all tools, there is a symbolic link named latest that points to the latest installed version of that component;
and the vars.bat script located in the latest\env\ folder is what the setvars.bat executes by default.
If required, setvars.bat can be customized to point to a specific directory by using a configuration file.
–config Parameter
The top level setvars.bat script accepts a --config parameter that identifies your custom config.txt file.
<install-dir>\setvars.bat --config="path\to\your\config.txt"
The name of your configuration file can have any name you choose. You can create many config files to setup
a variety of development or test environments. For example, you might want to test the latest version of a library
with an older version of a compiler; use a setvars config file to manage such a setup.
mkl=1.1
dldt=exclude
default=exclude
mkl=1.0
ipp=latest
17
Intel® oneAPI
The config file can be used to exclude specific components, include specific component versions or only in-
clude specific component versions that are named after a "default=exclude" statement.
By default, setvars.bat will process the latest version of each env\vars.bat script.
The sample below shows two versions of Intel oneMKL installed: 2021.1.1 and 2021.2.0. The latest shortcut
points to the 2021.2.0 folder because it is the latest version installed. By default, setvars.bat will execute the
2021.2.0 vars.bat script in the mkl folder because that is the folder that latest points to.
18
Intel® oneAPI
This instructs setvars.bat to execute the env\vars.bat script located in the 2021.1.1 version folder inside
the mkl directory. For other installed components, setvars.bat will execute the env\vars.bat script located
in the latest version folder.
Exclude Specific Components
To exclude a component, use the following syntax:
<key>=exclude
For example, to exclude Intel IPP, but include the 2021.1.1 version of Intel oneMKL:
mkl=2021.1.1
ipp=exclude
In this example:
• setvars.bat WILL execute the Intel oneMKL 2021.1.1 env\vars.bat script
• setvars.bat WILL NOT execute Intel IPP env\vars.bat script files
• setvars.bat WILL execute the latest version of the remaining env\vars.bat script files
Include Specific Components
To execute a specific list of component env\vars.bat scripts, you must first exclude all env\vars.bat scripts.
Then add back the list of components to be executed by setvars.bat. Use the following syntax to exclude all
component env\vars.bat scripts from being executed:
default=exclude
For example, to have setvars.bat execute only the Intel oneMKL and Intel IPP component env\vars.bat
scripts, use this config file:
default=exclude
mkl=2021.1.1
ipp=latest
In this example:
• setvars.bat WILL execute the Intel oneMKL 2021.1.1 env\vars.bat script
• setvars.bat WILL execute the latest version of the Intel IPP env\vars.bat script
• setvars.bat WILL NOT execute the env\vars.bat script for any other components
19
Intel® oneAPI
Note: Support for Microsoft Visual Studio* 2017 is deprecated as of the Intel® oneAPI 2022.1 release, and will
be removed in a future release.
The setvars.bat script sets up the environment variables needed to use the oneAPI toolkits. This script must
be run every time a new terminal window is opened for command-line development. The setvars.bat script can
also be run automatically when Microsoft Visual Studio is started. You can configure this feature to instruct the
setvars.bat script to set up a specific set of oneAPI tools by using the SETVARS_CONFIG environment variable.
The SETVARS_CONFIG environment variable enables automatic configuration of the oneAPI development
environment when you start your instance of Microsoft Visual Studio. The variable has three conditions or
states:
• Undefined (the SETVARS_CONFIG environment variable does not exist)
• Defined but empty (the value contains nothing or only whitespace)
• Defined and points to a setvars.bat configuration file
If SETVARS_CONFIG is undefined there will be no attempt to automatically run setvars.bat when Visual Studio
is started. This is the default case, since the SETVARS_CONFIG variable is not defined by the oneAPI installer.
If SETVARS_CONFIG is defined and has no value (or contains only whitespace), the setvars.bat script will be
automatically run when Visual Studio is started. In this case, the setvars.bat script initializes the environment for
all oneAPI tools that are installed on your system. For more information about running the setvars.bat script,
see Build and Run a Sample Project Using the Visual Studio* Command Line.
When SETVARS_CONFIG is defined with the absolute pathname to a setvars configuration file, the setvars.bat
script will be automatically run when Visual Studio is started. In this case, the setvars.bat script initializes the
environment for only those oneAPI tools that are defined in the setvars configuration file. For more information
about how to create a setvars config file, see Using a Config File with setvars.bat.
A setvars configuration file can have any name and can be saved to any location on your hard disk, as long as
that location and the file are accessible and readable by Visual Studio. (A plug-in that was added to Visual Studio
when you installed the oneAPI tools on your Windows system performs the SETVARS_CONFIG actions; that
is why Visual Studio must have access to the location and contents of the setvars configuration file.)
If you leave the setvars config file empty, the setvars.bat script will initialize your environment for all oneAPI
tools that are installed on your system. This is equivalent to defining the SETVARS_CONFIG variable with an empty
string. See Using a Config File with setvars.bat for details regarding what to put inside of your setvars config
file.
20
Intel® oneAPI
Since the SETVARS_CONFIG environment variable is not automatically defined during installation, you must add
it to your environment before starting Visual Studio (per the rules above). You can define the SETVARS_CONFIG
environment variable using the Windows SETX command or in the Windows GUI tool by typing “rundll32.exe
sysdm.cpl,EditEnvironmentVariables” into the “Win+R” dialog (use “Win+R” to bring up the dialog).
Most of the component tool folders contain an environment script named vars.sh that configures the envi-
ronment variables needed by that component to support oneAPI development work. For example, in a default
installation, the Intel® Integrated Performance Primitives (Intel® IPP) vars script on Linux or macOS is located
at: /opt/intel/ipp/latest/env/vars.sh. This pattern is shared by all oneAPI components that include an
environment vars setup script.
These component tool vars scripts can be called directly or collectively. To call them collectively, a script named
setvars.sh is provided in the oneAPI installation folder. For example, in a default installation on a Linux or ma-
cOS machine: /opt/intel/setvars.sh.
Sourcing the setvars.sh script without any arguments causes it to locate and source all <component>/
latest/env/vars.sh scripts in the installation. Changes made to the environment by these scripts can be seen
by running the env command after running the environment setup scripts.
Note: Changes to your environment made by sourcing the setvars.sh script (or the individual vars.sh
scripts) are not permanent. Those changes only apply to the terminal session in which the setvars.sh en-
vironment script was sourced.
The setvars.sh script supports several command-line arguments, which are displayed using the --help op-
tion. For example:
The --config=file argument and the ability to include arguments that will be passed to the vars.sh scripts
that are called by the setvars.sh script can be used to customize the environment setup.
The --config=file argument provides the ability to limit environment initialization to a specific set of oneAPI
components. It also provides a way to initialize the environment for specific component versions. For example,
to limit environment setup to just the Intel® IPP library and the Intel® oneAPI Math Kernel Library (Intel® oneMKL),
pass a config file that tells the setvars.sh script to only call the vars.sh environment scripts for those two
oneAPI components. More details and examples are provided in Use a Config file for setvars.sh on Linux or
macOS.
Any extra arguments passed on the setvars.sh command line that are not described in the setvars.sh help
message will be passed to every called vars.sh script. That is, if the setvars.sh script does not recognize an
argument, it assumes the argument is meant for use by one or more component scripts and passes those extra
arguments to every component vars.sh script that it calls. The most common extra arguments are ia32 and
21
Intel® oneAPI
intel64, which are used by the Intel compilers and the IPP, MKL, and TBB libraries to specify the application
target architecture.
Inspect the individual vars.sh scripts to determine which, if any, command line arguments they accept.
How to Run
source <install-dir>/setvars.sh
Note: If you are using a non-POSIX shell, such as csh, use the following command:
Alternatively, use the modulefiles scripts to set up your development environment. The modulefiles scripts work
with all Linux shells.
If you wish to fine tune the list of components and the version of those components, use a setvars config file to
set up your development environment.
How to Verify
After sourcing the setvars.sh script, verify success by searching for the SETVARS_COMPLETED environment
variables. If setvars.sh was successful, then the SETVARS_COMPLETED environment variable will have a value
of 1:
Return value
SETVARS_COMPLETED=1
If the return value is anything other than SETVARS_COMPLETED=1, then the test failed and setvars.sh did not
complete properly.
Multiple Runs
Because many of the individual env/vars.sh scripts make significant changes to PATH, CPATH, and other en-
vironment variables, the top-level setvars.sh script will not allow multiple invocations of itself in the same ses-
sion. This is done to ensure that your environment variables do not become too long due to redundant path
references, especially the $PATH environment variable.
This behavior can be overridden by passing setvars.sh the --force flag. In this example, the user tries to run
setvars.sh twice. The second instance is stopped because setvars.sh has already been run.
22
Intel® oneAPI
In the third instance, the user runs setvars.sh --force and the initialization is successful.
The ONEAPI_ROOT variable is set by the top-level setvars.sh script when that script is sourced. If there is already
a ONEAPI_ROOT environment variable defined, setvars.sh temporarily overwrites it in the terminal session in
which you sourced the setvars.sh script. This variable is primarily used by the oneapi-cli sample browser
and the Eclipse* and Visual Studio Code* sample browsers to help them locate oneAPI tools and components,
especially for in locating the setvars.sh script if the SETVARS_CONFIG feature has been enabled. For more
information about the SETVARS_CONFIG feature, see Automate the setvars.sh Script with Eclipse*.
On Linux and macOS systems, the installer does not add the ONEAPI_ROOT variable to the environment. To add
it to the default environment, define the variable in your local shell initialization file(s) or in the system’s /etc/
environment file.
Some oneAPI tools support installation of multiple versions. For those tools that do support multiple versions,
the directory is organized like this:
23
Intel® oneAPI
intel/oneapi/compiler/
|-- 2021.1.1
|-- 2021.2.0
`-- latest -> 2021.2.0
For example:
For all tools, there is a symlink named latest that points to the latest installed version of that component; and
the vars.sh script located in the latest/env/ folder is what the setvars.sh sources by default.
If required, setvars.sh can be customized to point to a specific directory by using a configuration file.
–config Parameter
The top level setvars.sh script accepts a --config parameter that identifies your custom config.txt file.
The name of your configuration file can have any name you choose. You can create many config files to setup
a variety of development or test environments. For example, you might want to test the latest version of a library
with an older version of a compiler; use a setvars config file to manage such a setup.
mkl=1.1
dldt=exclude
default=exclude
mkl=1.0
ipp=latest
24
Intel® oneAPI
The config file can be used to exclude specific components, include specific component versions or only in-
clude specific component versions that are named after a "default=exclude" statement.
By default, setvars.sh will process the latest version of each env/vars.sh script.
The sample below shows two versions of Intel oneMKL installed: 2021.1.1 and 2021.2.0. The latest symlink
points to the 2021.2.0 folder because it is the latest version. By default setvars.sh will source the 2021.2.0
vars.sh script in the mkl folder because that is the folder that latest points to.
25
Intel® oneAPI
<key>=exclude
For example, to exclude Intel IPP, but include the 2021.1.1 version of Intel oneMKL:
mkl=2021.1.1
ipp=exclude
In this example:
26
Intel® oneAPI
default=exclude
For example, to have setvars.sh source only the Intel oneMKL and Intel IPP component env/vars.sh scripts,
use this config file:
default=exclude
mkl=2021.1.1
ipp=latest
In this example:
• setvars.sh WILL source the Intel oneMKL 2021.1.1 env/vars.sh script
• setvars.sh WILL source the latest version of the Intel IPP env/vars.sh script
• setvars.sh WILL NOT source the env/vars.sh script for any other components
The setvars.sh script sets up the environment variables needed to use the oneAPI toolkits. This script must be
run every time a new terminal window is opened for command-line development. The setvars.sh script can
also be run automatically when Eclipse* is started. You can configure this feature to instruct the setvars.sh
script to set up a specific set of oneAPI tools by using the SETVARS_CONFIG environment variable.
The SETVARS_CONFIG environment variable enables automatic configuration of the oneAPI development envi-
ronment when you start your instance of Eclipse IDE for C/C++ Developers. The variable has three conditions
or states:
• Undefined (the SETVARS_CONFIG environment variable does not exist)
• Defined but empty (the value contains nothing or only whitespace)
• Defined and points to a setvars.sh configuration file
27
Intel® oneAPI
If SETVARS_CONFIG is undefined or if it exists but has no value (or contains only whitespace), the setvars.sh
script will be automatically run when Eclipse is started. In this case, the setvars.sh script initializes the environ-
ment for all oneAPI tools that are installed on your system. For more information about running the setvars.sh
script, see Build and Run a Sample Project Using Eclipse.
When SETVARS_CONFIG is defined with the absolute pathname to a setvars configuration file, the setvars.sh
script will be automatically run when Eclipse is started. In this case, the setvars.sh script initializes the environ-
ment for only those oneAPI tools that are defined in the setvars configuration file. For more information about
how to create a setvars config file, see Use a Config file for setvars.sh on Linux or macOS.
Note: The default SETVARS_CONFIG behavior in Eclipse is different than the behavior described for Visual Stu-
dio on Windows. When starting Eclipse, automatic execution of the setvars.sh script is always attempted.
When starting Visual Studio automatic execution of the setvars.bat script it is only attempted if the SETVARS_-
CONFIG environment variable has been defined.
A setvars configuration file can have any name and can be saved to any location on your hard disk, as long as
that location and the file are accessible and readable by Eclipse. (A plug-in that was added to Eclipse when you
installed the oneAPI tools on your LInux system performs the SETVARS_CONFIG actions; that is why Eclipse must
have access to the location and contents of the setvars configuration file.)
If you leave the setvars config file empty, the setvars.sh script will initialize your environment for all oneAPI
tools that are installed on your system. This is equivalent to defining the SETVARS_CONFIG variable with an empty
string. See Use a Config file for setvars.sh on Linux or macOS for details regarding what to put inside of your
setvars config file.
Since the SETVARS_CONFIG environment variable is not automatically defined during installation, you must add
it to your environment before starting Eclipse (per the rules above). There are a variety of places to define the
SETVARS_CONFIG environment variable:
• /etc/environment
• /etc/profile
• ~/.bashrc
• and so on…
The list above shows common places to define environment variables on a Linux system. Ultimately, where you
choose to define the SETVARS_CONFIG environment variable depends on your system and your needs.
28
Intel® oneAPI
Note: The modulefiles provided with the Intel oneAPI toolkits are compatible with the Tcl Environment Modules
(Tmod) and Lua Environment Modules (Lmod). The following minimum versions are supported:
• Tmod 3.2.10 (compiler modulefile requires 4.1, see below)
• Tcl version 8.4
• Lmod version 8.2.10
Test which version is installed on your system using the following command:
module --version
Each modulefile automatically verifies the Tcl version on your system when it runs.
If your modulefile version is not supported, a workaround may be possible. See Using Environment Modules
with Intel Development Tools for more details.
As of the oneAPI 2021.4 release you can use the icc modulefile to setup the icc and ifort compilers if you are
using version 3.2.10 of the Tcl Environment Modules. A future oneAPI release will resolve the support for the
compiler modulefile.
The oneAPI modulefile scripts are located in a modulefiles directory inside each component folder (similar to
how the individual vars scripts are located). For example, in a default installation, the ipp modulefiles script(s)
are in the /opt/intel/ipp/latest/modulefiles/ directory.
Due to how oneAPI component folders are organized on the disk, it can be difficult to use the oneAPI modulefiles
directly where they are installed. Therefore, a special modulefiles-setup.sh script is provided in the oneAPI
installation folder to make it easier to work with the oneAPI modulefiles. In a default installation, that setup script
is located here: /opt/intel/oneapi/modulefiles-setup.sh
The modulefiles-setup.sh script locates all modulefile scripts that are part of your oneAPI installation and
organizes them into a single directory of versioned modulefiles scripts.
Each of these versioned modulefiles scripts is a symlink that points to the modulefiles located by the
modulefiles-setup.sh script. Each component folder includes (at minimum) a “latest” version modulefile
that will be selected, by default, when loading a modulefile without specifying a version label. If you use the
--ignore-latest option when running the modulefiles-setup.sh script, the modulefile with the higest
semver version will be loaded if no version is specified by the module_load command.
29
Intel® oneAPI
Note: By default, the modulefiles-setup.sh script creates a folder named modulefiles in the oneAPI toolkit
installation folder. If your oneAPI installation folder is not writeable, use the --output-dir=<path-to-folder>
option to create the modulefiles folder in a writeable location. Run modulefiles-setup.sh --help for more
information about this and other modulefiles-setup.sh script options.
Running the modulefiles-setup.sh script creates the modulefiles output folder, which is organized like the
following example (the precise list of modulefiles depends on your installation). In this example there is one
modulefile for configuring the Intel® Advisor environment and two modulefiles for configuring the compiler en-
vironment (the compiler modulefile configures the environment for all Intel compilers). If you follow the latest
symlinks, they point to the highest version modulefile, per semver rules.
Update your MODULEFILESPATH to include to the modulefiles output folder that was created by the
modulefiles-setup.sh script or run the moduleuse <folder_name> command.
30
Intel® oneAPI
The instructions below will help you quickly get started with the Environment Modules utility on Ubuntu*. For
full details regarding installation and configuration of the module utility, see http://modules.sourceforge.net/.
Confirm that the local copy of tclsh is new enough (see the beginning of this page for a list of supported ver-
sions):
$ source /usr/share/modules/init/sh
$ module
Note: Initialization of the Modulefiles environment in POSIX-compatible shells should work with the source
command shown above. Shell-specific init scripts are provided in the /usr/share/modules/init/ folder. See
that folder and the initialization section in man module for more details.
Source the module alias init script (.../modules/init/sh) in a global or local startup script to ensure the mod-
ule command is always available. At this point, the system should be ready to use the module command as
shown in the following section.
31
Intel® oneAPI
Before the unload step, use the env command to inspect the environment and look for the changes that were
made by the modulefile you loaded. For example, if you loaded the tbb modulefile, the command will show
you some of the env changes made by that modulefile (inspect the modulefile to see all of the changes it will
make):
Note: A modulefile is a script, but it does not need to have the ‘x’ (executable) permission set, because it
is loaded and interpreted by the “module” interpreter that is installed and maintained by the end-user. Installa-
tion of the oneAPI toolkits do not include the modulefile interpreter. It must be installed separately. Likewise,
modulefiles do not require that the ‘w’ permission be set, but they must be readable (ideally, the ‘r’ permission
is set for all users).
Versioning
The oneAPI toolkit installer uses version folders to allow oneAPI tools and libraries to exist in a side-by-side
layout. These versioned component folders are used by the modulefiles-setup.sh script to create the ver-
sioned modulefiles. The script organizes the symbolic links it creates in the modulefiles output folder as
<modulefile-name>/version, so that each respective modulefile can be referenced by version when using
the module command.
$ module avail
---------------- modulefiles -----------------
ipp/1.1 ipp/1.2 compiler/1.0 compiler32/1.0
Multiple modulefiles
A tool or library may provide multiple modulefiles within its modulefiles folder. Each becomes a loadable
module. They will be assigned a version per the component folder from which they were extracted.
Symbolic links are used by the modulefiles-setup.sh script to gather all the available modulefiles into a sin-
gle modulefiles folder. This means that the actual modulefile scripts are not moved or modified. As a con-
sequence, the ${ModulesCurrentModulefile} variable points to the symlink to each modulefile, not to the
actual modulefile located in the respective installation folders. To determine the full path to the actual module-
files, each modulefile starts with a statement like this:
to get a direct reference to the original modulefile in the product install directory. This is done because the
actual install location can be customized and is, therefore, unknown at runtime and must be deduced. For that
reason, the actual modulefile cannot be moved outside of the installed location, otherwise it will not be able to
locate the absolute path to the library or application that it must configure.
For a better understanding, review the modulefiles included with the installation. Most include comments ex-
plaining how they resolve symlink references to a real file, as well as parsing the version number (and version
directory). They also include checks to insure that the installed TCL is an appropriate version level.
32
Intel® oneAPI
Several of the modulefiles use the module load command to ensure that any required dependent modules
are also loaded. There is no attempt to specify the version of those dependent modulefiles. This means you
have the option to load a specific version of a dependent module prior to loading the module that requires that
dependent module. If you do not preload a dependent module, the latest available version is loaded.
This is by design because it gives you the flexibility to control the environment. For example, you may have
installed an updated version of a library that you want to test against a previous version of the compiler. Perhaps
the updated library has a bug fix and you are not interested in changing the version of any other libraries in your
build. If the dependent modulefiles were hard-coded to require a specific dependent version of this library, you
could not perform such a test.
Note: If a dependent module load cannot be satisfied, the currently loading module file will be terminated and
no changes will be made to your environment.
Additional Resources
The CMake packages provided with Intel oneAPI products allow a CMake project to make easy use of oneAPI
libraries on Windows*, Linux*, or macOS*. Using the provided packages, the experience should be similar to how
other system libraries integrate with a CMake project. There are dependency and other build variables provided
to CMake project targets as desired.
The following components support CMake:
• Intel® oneAPI DPC++ Compiler - Linux, Windows
• Intel Integrated Performance Primitives (Intel IPP) and Intel Integrated Performance Primitives Cryptog-
raphy (Intel IPP Cryptography) - Linux, Windows
• Intel MPI Library - Linux, Windows
• Intel oneAPI Collective Communications Library (oneCCL) - Linux, Windows
• Intel oneAPI Data Analytics Library (oneDAL) - Linux, Windows
• Intel oneAPI Deep Neural Network Library (oneDNN) - Linux, Windows
33
Intel® oneAPI
34
Intel® oneAPI
35
Intel® oneAPI
For more information about options, you can go to the option descriptions found in the Compiler Options section
of the Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference.
The compiler driver has different compatibilities on different OS hosts. On Linux, icpx -fsycl provides GCC*-
style command line options. On Windows, icx-cl provides Microsoft Visual C++* compatibility with Microsoft
Visual Studio*.
• It recognizes GCC-style command line options (starting with “-“) and can be useful for projects that share
a build system across multiple operating systems.
• It recognizes Windows command line options (starting with “/”) and can be useful for Microsoft Visual
Studio-based projects.
The following code shows usage of an API call (a * x + y) employing the Intel oneAPI Math Kernel Library
function oneapi::mkl::blas::axpy to multiply a times x and add y across vectors of floating point numbers.
It takes advantage of the oneAPI programming model to perform the addition on an accelerator.
36
Intel® oneAPI
int incx = 1;
std::vector<double> x;
x.resize(incx * n_elements);
for (int i=0; i<n_elements; i++)
x[i*incx] = 4.0 * double(std::rand()) / RAND_MAX - 2.0;
// rand value between -2.0 and 2.0
int incy = 3;
std::vector<double> y;
y.resize(incy * n_elements);
for (int i=0; i<n_elements; i++)
y[i*incy] = 4.0 * double(std::rand()) / RAND_MAX - 2.0;
// rand value between -2.0 and 2.0
cl::sycl::device my_dev;
try {
my_dev = cl::sycl::device(cl::sycl::gpu_selector());
} catch (...) {
std::cout << "Warning, failed at selecting gpu device. Continuing on default(host)␣
,→device.\n";
}
// perform y = alpha*x + y
try {
37
Intel® oneAPI
catch(cl::sycl::exception const& e) {
std::cout << "\t\tCaught synchronous SYCL exception:\n"
<< e.what() << std::endl;
}
// print y_buffer
auto y_accessor = y_buffer.template
get_access<cl::sycl::access::mode::read>();
std::cout << std::endl;
std::cout << "y" << " = [ " << y_accessor[0] << " ]\n";
std::cout << " [ " << y_accessor[1*incy] << " ]\n";
std::cout << " [ " << "... ]\n";
std::cout << std::endl;
return 0;
}
On Windows:
38
Intel® oneAPI
On Windows:
./axpy.out
On Windows:
axpy.exe
The vector addition sample code is employed in this example. It takes advantage of the oneAPI programming
model to perform the addition on an accelerator.
The following command compiles and links the executable.
The components and function of the command and options are similar to those discussed in the API-Based
Code section above.
Execution of this command results in the creation of an executable file, which performs the vector addition when
run.
39
Intel® oneAPI
The traditional compilation flow is a standard compilation like the one used for C, C++, or other languages, used
when there is no offload to a device.
The traditional compilation phases are shown in the following diagram:
1. The front end translates the source into an intermediate representation and then passes that representa-
tion to the back end.
2. The back end translates the intermediate representation to object code and emits an object file (host.obj
on Windows*, host.o on Linux*).
3. One or more object files are passed to the linker.
4. The linker creates an executable.
5. The application runs.
The compilation flow for SYCL offload code adds steps for device code to the traditional compilation flow, with
JIT and AOT options for device code. In this flow, the developer compiles a SYCL application with icpx -fsycl,
and the output is an executable containing both host and device code.
The basic compilation phases for SYCL offload code are shown in the following diagram:
40
Intel® oneAPI
In the JIT compilation flow, the code for the device is translated to SPIR-V intermediate code by the back-end,
embedded in the fat binary as SPRI-V, and translated from SPIR-V to device code by the runtime. When the
application is run, the runtime determines the available devices and generates the code specific to that device.
This allows for more flexibility in where the application runs and how it performs than the AOT flow, which must
specify a device at compile time. However, performance may be worse because compilation occurs when the
application runs. Larger applications with significant amounts of device code may notice performance impacts.
Tip: The JIT compilation flow is useful when you do not know what the target device will be.
41
Intel® oneAPI
In the AOT compilation flow, the code for the device is translated to SPIR-V and then device code in the host
back-end and the resulting device code is embedded in the generated fat binary. The AOT flow provides less
flexibility than the JIT flow because the target device must be specified at compilation time. However, exe-
cutable start-up time is faster than the JIT flow.
Tip:
• The AOT compilation flow is good when you know exactly which device you are targeting.
• The AOT flow is recommended when debugging your application as it speeds up the debugging cycle.
42
Intel® oneAPI
A fat binary is generated from the JIT and AOT compilation flows. It is a host binary that includes embedded
device code. The contents of the device code vary based on the compilation flow.
• The host code is an executable in either the ELF (Linux) or PE (Windows) format.
• The device code is a SPIR-V for the JIT flow or an executable for the AOT flow. Executables are in one of
the following formats:
– CPU: ELF (Linux), PE (Windows)
43
Intel® oneAPI
Tip: Unsure whether your workload fits best on CPU, GPU, or FPGA? Compare the benefits of CPUs, GPUs,
and FPGAs for different oneAPI compute workloads.
The traditional CPU workflow runs on the CPU without a runtime. The compilation flow is a standard compilation
used when there is no offload to a device, like the one used for C, C++, or other languages.
Traditional workloads are compiled and run on host using the Traditional Compilation Flow (Host-only Applica-
tion) process described in Compilation Flow Overview.
Example compilation command:
By default, if you are offloading to a CPU device, it goes through an OpenCL™ runtime, which also uses Intel
oneAPI Threading Building Blocks for parallelism.
When offloading to a CPU, workgroups map to different logical cores and these workgroups can execute in par-
allel. Each work-item in the workgroup can map to a CPU SIMD lane. Work-items (sub-groups) execute together
in a SIMD fashion.
44
Intel® oneAPI
To learn more about CPU execution, see Compare Benefits of CPUs, GPUs, and FPGAs for Different oneAPI
Compute Workloads.
1. Make sure you have followed all steps in the oneAPI Development Environment Setup section, including
running the setvars script.
2. Check if you have the required OpenCL runtime associated with the CPU using the sycl-ls command.
For example:
$sycl-ls
CPU : OpenCL 2.1 (Build 0)[ 2020.11.12.0.14_160000 ]
GPU : OpenCL 3.0 NEO [ 21.33.20678 ]
GPU : 1.1[ 1.2.20939 ]
3. Use one of the following code samples to verify that your code is running on the CPU. The code sample
adds scalar to large vectors of integers and verifies the results.
SYCL*
To run on a CPU, SYCL provides built-in device selectors for convenience. They use device_selector as a
base class. cpu_selector selects a CPU device.
Alternatively, you could also use the following environment variable when using default_selector to select a
device according to implementation-defined heuristics.
export SYCL_DEVICE_FILTER=cpu
45
Intel® oneAPI
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
OpenMP*
OpenMP code sample:
46
Intel® oneAPI
#include<iostream>
#include<omp.h>
#define N 1024
int main(){
float *a = (float *)malloc(sizeof(float)*N);
for(int i = 0; i < N; i++)
a[i] = i;
#pragma omp target teams distribute parallel for simd map(tofrom: a[:N])
for(int i = 0; i < 1024; i++)
a[i]++;
std::cout<<a[100]<<"\n";
return 0;
}
export LIBOMPTARGET_DEVICETYPE=cpu
./simple-ompoffload
Successfully completed on device
When offloading your application, it is important to identify the bottlenecks and which code will benefit from
offloading. If you have a code that is compute intensive or a highly data parallel kernel, offloading your code
would be something to look into.
To find opportunities to offload your code, use the Intel Advisor for Offload Modeling.
The following list has some basic debugging tips for offloaded code.
• Check host target to verify the correctness of your code.
• Use printf to debug your application. Both SYCL and OpenMP offload support printf in kernel code.
• Use environment variables to control verbose log information.
– For SYCL, the following debug environment variables are recommended. A full list of environment
variables is available from GitHub.
47
Intel® oneAPI
– For OpenMP, the following debug environment variables are recommended. A full list is available
from the LLVM/OpenMP documentation.
• Use Ahead of Time (AOT) to move Just-in-Time (JIT) compilations to AOT compilation issues. For more
information, see Ahead-of-Time Compilation for CPU Architectures.
See Debugging the SYCL and OpenMP Offload Process for more information on debug techniques and de-
bugging tools available with oneAPI.
There are many factors that can affect the performance of CPU offload code. The number of work-items, work-
groups, and amount of work done depends on the number of cores in your CPU.
• If the amount of work being done by the core is not compute-intensive, then this could hurt performance.
This is because of the scheduling overhead and thread context switching.
• On a CPU, there is no need for data transfer through PCIe, resulting in lower latency because the offload
region does not have to wait long for the data.
• Based on the nature of your application, thread affinity could affect the performance on CPU. For details,
see Control Binary Execution on Multiple Cores.
• Offloaded code uses JIT compilation by default. Use AOT compilation (offline compilation) instead. With
offline compilation, you could target your code to specific CPU architecture. Refer to Optimization Flags
for CPU Architectures for details.
Additional recommendations are available from Optimize Offload Performance.
48
Intel® oneAPI
The commands below implement the scenario when part of the device code resides in a static library.
ar cr libstlib.a static_lib.o
In ahead-of-time (AOT) compilation mode, optimization flags can be used to produce code aimed to run better
on a specific CPU architecture.
icpx -fsycl -fsycl-targets=spir64_x86_64 -Xs “-device <CPU optimization flags>” a.cpp b.cpp -o␣
,→app.out
Note: The set of supported optimization flags may be changed in future releases.
49
Intel® oneAPI
Environment Variables
The following environment variables control the placement of SYCL* or OpenMP* threads on multiple CPU
cores during program execution. Use these variables if you are using the OpenCL™ runtime CPU device to
offload to a CPU.
50
Intel® oneAPI
See the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference for more information about all sup-
ported environment variables.
Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.
• S<num> denotes the socket number that has 8 cores specified in a list
• T<num> denotes the Intel® oneAPI Threading Building Blocks (Intel® oneTBB) thread number
• “-” means unused core
DPCPP_CPU_NUM_CUS=16
export DPCPP_CPU_PLACES=sockets
DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12␣
,→T13 T14 T15]
DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6 T8 T10 T12 T14] S1:[T1 T3 T5 T7 T9 T11␣
,→T13 T15]
export DPCPP_CPU_PLACES=cores
DPCPP_CPU_CU_AFFINITY=close : S0:[T0 T8 T1 T9 T2 T10 T3 T11] S1:[T4 T12 T5 T13 T6 T14␣
,→T7 T15]
DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T8 T2 T10 T4 T12 T6 T14] S1:[T1 T9 T3 T11 T5 T13␣
,→T7 T15]
DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13␣
,→T14 T15]
export DPCPP_CPU_PLACES=threads
DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[T8 T9 T10 T11 T12 T13␣
,→T14 T15]
DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6 T8 T10 T12 T14] S1:[T1 T3 T5 T7 T9 T11␣
,→T13 T15]
(continues on next page)
51
Intel® oneAPI
export DPCPP_CPU_NUM_CUS=8
DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
DPCPP_CPU_CU_AFFINITY=close close: S0:[T0 - T1 - T2 - T3 -] S1:[T4 - T5 - T6 - T7 -]
DPCPP_CPU_CU_AFFINITY=close spread: S0:[T0 - T2 - T4 - T6 -] S1:[T1 - T3 - T5 - T7 -]
DPCPP_CPU_CU_AFFINITY=close master: S0:[T0 T1 T2 T3 T4 T5 T6 T7] S1:[]
Assume a machine with 2 sockets, 4 physical cores per socket, and each physical core has 2 hyper threads.
• S<num> denotes the socket number that has 8 cores specified in a list
• T<num> denotes the Intel oneTBB thread number
• “-” means unused core
export DPCPP_CPU_NUM_CUS=8
DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
DPCPP_CPU_CU_AFFINITY=close: S0:[T0 T1 T2 T3] S1:[T4 T5 T6 T7]
DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 T2 T4 T6] S1:[T1 T3 T5 T7]
DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3] S1:[T4 T5 T6 T7]
export DPCPP_CPU_NUM_CUS=4
DPCPP_CPU_PLACES=sockets, cores and threads have the same bindings:
DPCPP_CPU_CU_AFFINITY=close: S0:[T0 - T1 - ] S1:[T2 - T3 - ]
DPCPP_CPU_CU_AFFINITY=spread: S0:[T0 - T2 - ] S1:[T1 - T3 - ]
DPCPP_CPU_CU_AFFINITY=master: S0:[T0 T1 T2 T3] S1:[ - - - - ]
Tip: Unsure whether your workload fits best on CPU, GPU, or FPGA? Compare the benefits of CPUs, GPUs,
and FPGAs for different oneAPI compute workloads.
52
Intel® oneAPI
Offloading a program to a GPU defaults to the level zero runtime. There is also an option to switch to the
OpenCL™ runtime. In SYCL* and OpenMP* offload, each work item is mapped to a SIMD lane. A subgroup
maps to SIMD width formed from work items that execute in parallel and subgroups are mapped to GPU EU
thread. Work-groups, which include work-items that can synchronize and share local data, are assigned for ex-
ecution on compute units (that is, streaming multiprocessors or Xe core, also known as sub-slices). Finally, the
entire global NDRange of work-items maps to the entire GPU.
To learn more about GPU execution, see Compare Benefits of CPUs, GPUs, and FPGAs for Different oneAPI
Compute Workloads.
1. Make sure you have followed all steps in the oneAPI Development Environment Setup section, including
running the setvars script.
2. Configure your GPU system by installing drivers and add the user to the video group. See the Get Started
Guide for instructions:
• Get Started with Intel oneAPI Base Toolkit for Linux* | Windows* | MacOS*
• Get Started with Intel oneAPI HPC Toolkit for Linux* | Windows* | MacOS*
• Get Started with Intel oneAPI IoT Toolkit for Linux* | Windows*
3. Check if you have a supported GPU and the necessary drivers installed using the sycl-ls command. In
the following example, if you had the OpenCL and Level Zero driver installed you would see two entries
for each runtime associated with the GPU:
53
Intel® oneAPI
4. Use one of the following code samples to verify that your code is running on the GPU. The code sample
adds scalar to large vectors of integers and verifies the results.
SYCL
To run on a GPU, SYCL provides built-in device selectors using device_selector as a base class. gpu_-
selector selects a GPU device. You can also create your own custom selector. For more information, see the
Choosing Devices section in Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems
using C++ and SYCL (book).
SYCL code sample:
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
54
Intel® oneAPI
./simple-iota
Running on device: Intel® UHD Graphics 630 [0x3e92]
Successfully completed on device.
OpenMP*
OpenMP code sample:
#include <stdlib.h>
#include <omp.h>
#include <iostream>
constexpr size_t array_size = 10000;
#pragma omp requires unified_shared_memory
int main(){
constexpr int value = 100000;
// Returns the default target device.
int deviceId = (omp_get_num_devices() > 0) ? omp_get_default_device() : omp_get_initial_
,→device();
int *sequential = (int *)omp_target_alloc_host(array_size, deviceId);
int *parallel = (int *)omp_target_alloc(array_size, deviceId);
55
Intel® oneAPI
omp_target_free(sequential, deviceId);
omp_target_free(parallel, deviceId);
./simple-iota
Successfully completed on device.
Note: If you have an offload region present and no accelerator, the kernel falls back to traditional host
compilation (without the OpenCL runtime) unless you are using the environment variable OMP_TARGET_-
OFFLOAD=mandatory.
To decide which GPU hardware and what parts of the code to offload, refer to the GPU optimization workflow
guide.
To find opportunities to offload your code to GPU, use the Intel Advisor for Offload Modeling.
The following list has some basic debugging tips for offloaded code.
• Check CPU or host/target or switch runtime to OpenCL to verify the correctness of code.
• You could use printf to debug your application. Both SYCL and OpenMP offload support printf in kernel
code.
• Use environment variables to control verbose log information.
For SYCL, the following debug environment variables are recommended. A full list is available from GitHub.
56
Intel® oneAPI
For OpenMP, the following debug environment variables are recommended. A full list is available from the
LLVM/OpenMP documentation.
Use Ahead of Time (AOT) to move Just-in-Time (JIT) compilations to AOT compilation issues.
CL_OUT_OF_RESOURCES Error
The CL_OUT_OF_RESOURCES error can occur when a kernel uses more __private or __local memory
than the emulator supports by default.
When this occurs, you will see
an error message similar to this:
$ ./myapp
:
Problem size: c(150,600) = a(150,300) * b(300,600)
(continues on next page)
57
Intel® oneAPI
Or if using onetrace:
$ onetrace -c ./myapp
:
>>>> [6254070891] zeKernelSuggestGroupSize: hKernel = 0x263b7a0 globalSizeX = 163850␣
,→globalSizeY = 1 globalSizeZ = 1 groupSizeX = 0x7fff94e239f0 groupSizeY = 0x7fff94e239f4␣
,→groupSizeZ = 0x7fff94e239f8
<<<< [6254082074] zeKernelSuggestGroupSize [922 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_
,→MEMORY(0x1879048195)
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_
,→RESOURCES)
To see how much memory was being copied to shared local memory and the actual hardware limit, set debug
keys:
export PrintDebugMessages=1
export NEOReadDebugKeys=1
$ ./myapp
:
Size of SLM (656384) larger than available (131072)
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Native API failed. Native API returns: -5 (CL_OUT_OF_RESOURCES) -5 (CL_OUT_OF_
,→RESOURCES)
Aborted (core dumped)
$
$ onetrace -c ./myapp
:
>>>> [317651739] zeKernelSuggestGroupSize: hKernel = 0x2175ae0 globalSizeX = 163850 globalSizeY␣
,→= 1 globalSizeZ = 1 groupSizeX = 0x7ffd9caf0950 groupSizeY = 0x7ffd9caf0954 groupSizeZ =␣
,→0x7ffd9caf0958
Size of SLM (656384) larger than available (131072)
<<<< [317672417] zeKernelSuggestGroupSize [10325 ns] -> ZE_RESULT_ERROR_OUT_OF_DEVICE_
,→MEMORY(0x1879048195)
58
Intel® oneAPI
See Debugging the DPC++ and OpenMP Offload Process for more information on debug techniques and de-
bugging tools available with oneAPI.
There are multiple ways to optimize offloaded code. The following list provides some starting points. Review the
oneAPI GPU Optimization Guide for additional information.
• Reduce overhead of memory transfers between host and device.
• Have enough work to keep the cores busy and reduce the data transfer overhead cost.
• Use GPU memory hierarchy like GPU caches, shared local memory for faster memory accesses.
• Use AOT compilation (offline compilation) instead of JIT compilation. With offline compilation, you could
target your code to specific GPU architecture. Refer to Offline Compilation for GPU for details.
• The Intel® GPU Occupancy Calculator allows you to compute the occupancy of an Intel® GPU for a given
kernel and work group parameters.
Additional recommendations are available from Optimize Offload Performance.
The examples below illustrate how to create and use static libraries with device code on Linux.
ar cr libstlib.a static_lib.o
59
Intel® oneAPI
The following example command produces app.out for a specific GPU target:
For DPC++:
For OpenMP*offload:
A list of allowed values for the device name are available from the Intel® oneAPI DPC++/C++ Compiler Developer
Guide and Reference.
Tip: You can also learn about programming for FPGA devices in detail from the Data Parallel C++ book available
at https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_17.
60
Intel® oneAPI
FPGAs differ from CPUs and GPUs in some ways. A significant difference compared to CPU or GPU is gener-
ating a device binary for FPGA hardware, which is a computationally intensive and time-consuming process. It
is normal for an FPGA compile to take several hours to complete. For this reason, only ahead-of-time (or offline)
kernel compilation mode is supported for FPGA. The long compile time for FPGA hardware makes just-in-time
(or online) compilation impractical.
Longer compile times are detrimental to developer productivity. The Intel® oneAPI DPC++/C++ Compiler pro-
vides several mechanisms that enable you to target FPGA and iterate quickly on your designs. By circumventing
the time-consuming process of full FPGA compilation wherever possible, you can benefit from the faster com-
pile times that you are familiar with for CPU and GPU development.
SYCL supports accelerators in general. The Intel® oneAPI DPC++/C++ Compiler implements additional FPGA-
specific support to assist FPGA code development. This article highlights FPGA development using the com-
piler and related tools for SYCL code development targeting FPGAs.
The following table summarizes the types of FPGA compilation:
A typical FPGA development workflow is to iterate in the emulation, simulation, and optimization report stages,
refining your code using the feedback provided by each stage. Intel® recommends relying on emulation and the
61
Intel® oneAPI
Tip: To compile for FPGA emulation or FPGA simulation, generate the FPGA optimization report, you require
only the Intel® oneAPI DPC++/C++ Compiler that is part of the Intel® oneAPI Base Toolkit.
An FPGA hardware compile requires installing the Intel® Quartus® Prime software separately. Targeting a board
also requires that you install the BSP for the board.
For more information, refer to the Intel® oneAPI Toolkits Installation Guide and Intel® FPGA development flow
webpage.
Also, generating RTL code for an IP component requires only the Intel® oneAPI DPC++/C++ Compiler that is part
of the Intel® oneAPI Base Toolkit. However, integrating that IP component into your hardware design requires
installing Intel® Quartus® Prime software.
FPGA Emulator
The FPGA emulator (Intel® FPGA Emulation Platform for OpenCL™ software) is the fastest method to verify the
correctness of your code. It executes the SYCL device code on the CPU. The emulator is similar to the SYCL
host device, but unlike the host device, the FPGA emulator device supports FPGA extensions such as FPGA
pipes and fpga_reg. For more information, refer to Pipes Extension and Kernel Variables topics in the FPGA
Optimization Guide for Intel® oneAPI Toolkits.
The following are some important caveats to remember when using the FPGA emulator:
• Performance is not representative.
Never draw inferences about FPGA performance from the FPGA emulator. The FPGA emulator’s timing
behavior is not correlated to that of the physical FPGA hardware. For example, an optimization that yields
a 100x performance improvement on the FPGA may not impact the emulator performance. The emulator
might show an unrelated increase or decrease.
• Undefined behavior may differ.
If your code produces different results when compiled for the FPGA emulator versus FPGA hardware,
your code most likely exercises undefined behavior. By definition, undefined behavior is not specified by
the language specification and might manifest differently on different targets.
For detailed information about emulation for full-stack acceleration kernels, refer to Emulate Your Kernel.
For information about emulation of IP components, refer to For more details, refer to Emulate and Debug Your
IP Component.
62
Intel® oneAPI
FPGA Simulator
The simulation flow allows you to use the Questa*-Intel® FPGA Edition simulator software to simulate the exact
behavior of the synthesized kernel. Like emulation, you can run simulation on a system that does not have a
target FPGA board installed. The simulator models a kernel much more accurately than the emulator, but it is
much slower than the emulator.
The simulation flow is cycle-accurate and bit-accurate. It exactly models the behavior of a kernel’s datapath and
the results of operations on floating-point data types. However, simulation cannot accurately model variable-
latency memories or other external interfaces. Intel recommends that you simulate your design with a small input
dataset because simulation is much slower than running on FPGA hardware or emulator.
You can use the simulation flow in conjunction with profiling to collect additional information about your design.
For more information about profiling, refer to Intel® FPGA Dynamic Profiler for DPC++ in the FPGA Optimization
Guide for Intel® oneAPI Toolkits.
Note: You cannot debug kernel code compiled for simulation using the GNU Project Debugger (GDB)*, Mi-
crosoft* Visual Studio*, or any standard software debugger.
For more information about the simulation flow, refer to one of the following topics:
• Evaluate Your Kernel Through Simulation
• Evaluate Your IP Component Through Simulation
A full FPGA compilation occurs in the following stages, and optimization reports are generated after both stages:
63
Intel® oneAPI
When your compilation targets an FPGA device or part number, this stage gives you RTL files for the IP com-
ponent in your code. You can then use Intel® Quartus® Prime software to integrate your IP components into a
larger design.
64
Intel® oneAPI
FPGA Hardware
An FPGA hardware compile requires the Intel® Quartus® Prime software (installed separately). This is a full com-
pilation stage through to the FPGA hardware image where you can target one of the following:
• Intel® FPGA device family
• Specific Intel® FPGA device part number
• Custom board
• Intel® Programmable Acceleration Card (PAC) (deprecated)
For more information about the targets, refer to the Intel® oneAPI DPC++/C++ Compiler System Requirements.
For more information about using Intel® PAC or custom boards, refer to the FPGA BSPs and Boards section
and and the Intel® oneAPI Toolkits Installation Guide for Linux* OS Installation Guide.
FPGA compilation flags control the FPGA image type the Intel® oneAPI DPC++/C++ Compiler targets.
The following are examples of Intel® oneAPI DPC++/C++ Compiler commands that target the FPGA image types:
65
Intel® oneAPI
The following table explains the compiler flags used in the above example commands:
Note: Using the prefix -Xs causes an argument to be passed to the FPGA
backend.
-fsycl-link=early Instructs the compiler to stop after creating the FPGA early image (and as-
sociated optimization report).
continues on next page
66
Intel® oneAPI
Warning: The output of a icpx compile command overwrites the output of previous compiles that used the
same output name. Therefore, Intel® recommends using unique output names (specified with -o). This is
especially important for FPGA compilation since a lost hardware image may take hours to regenerate.
In addition to the compiler flags demonstrated by the commands above, there are flags to control the verbosity of
the icpx command’s output, the number of parallel threads to use during compilation, and so on. The following
section briefly describes those flags.
The Intel® oneAPI DPC++/C++ Compiler offers several options that allow you to customize the kernel compilation
process. The following table summarizes other options supported by the compiler:
67
Intel® oneAPI
For more information about FPGA optimization flags, refer to the Optimization Flags section in the FPGA Opti-
mization Guide for Intel® oneAPI Toolkits.
68
Intel® oneAPI
The Intel® FPGA Emulation Platform for OpenCL™ software (also referred to as the emulator or the FPGA em-
ulator) assesses the functionality of your kernel. The emulator supports 64-bit Windows and Linux operating
systems. On Linux systems, the GNU C Library (glibc) version 2.15 or later is required.
Note:
• You cannot use the execution time of an emulated design to estimate its execution time on an FPGA.
Furthermore, running an emulated design is not a substitute for natively running a functionally equivalent
C/C++ implementation on an x86-64 host.
• Emulation does not support cross-compilation to ARM® processor. To run emulation on a design that
targets an ARM SoC device, emulate on a non-SoC board (for example, intel_a10gx_pac or intel_-
s10sx_pac). When satisfied with the emulation results, you can target your design on an SoC board for
subsequent optimization steps.
• For information about debugging with Intel® Distribution for GDB*, refer to the following:
– Debugging with Intel® Distribution for GDB* on Linux* OS Host
– Get Started with Intel® Distribution for GDB* on Linux* OS Host
– Get Started with Intel® Distribution for GDB* on Windows* OS Host
Emulator Installation
The Intel FPGA Emulation Platform for OpenCL software is installed as part of the Intel® oneAPI Base Toolkit.
For information about how to install this base kit, refer to the Intel® oneAPI Toolkits Installation Guides.
Refer to the following topics for additional information:
• Emulator Environment Variables
• Emulate Pipe Depth
• Emulate Applications with a Pipe That Reads or Writes to an I/O Pipe
• Compile and Emulate Your Design
• Limitations of the Emulator
• Discrepancies in Hardware and Emulator Results
• Emulator Known Issues
The following table lists environment variables that you can use to modify the behavior of the emulator:
69
Intel® oneAPI
Note: On Windows, the FPGA emulator can silently fail by running out of
memory. As a workaround to catch this error, write your kernel code using
the try-catch syntax.
CL_CONFIG_CHANNEL_- When you compile your kernel for emulation, the pipe depth is different
DEPTH_EMULATION_MODE from the pipe depth generated when your kernel is compiled for hard-
ware. You can change this behavior with the CL_CONFIG_CHANNEL_-
DEPTH_EMULATION_MODE environment variable. For details, see Emulate
Pipe Depth.
When you compile your kernel for emulation, the default pipe depth is different from the default pipe depth gen-
erated when your kernel is compiled for hardware. You can change this behavior when you compile your kernel
for emulation with the CL_CONFIG_CHANNEL_DEPTH_EMULATION_MODE environment variable.
Important: For pipes, you must set the CL_CONFIG_CHANNEL_DEPTH_EMULATION_MODE environment variable
before running the host program.
70
Intel® oneAPI
Table 12:
CL_CONFIG_CHANNEL_DEPTH_EMULATION_MODE
values
Environment Description
Variable
ignoredepth All pipes are given a pipe depth chosen to provide the fastest execution time for your
kernel emulation. Any explicitly set pipe depth attribute is ignored.
default Pipes with an explicit depth attribute have their specified depth. Pipes without a spec-
ified depth are given a default pipe depth that is chosen to provide the fastest execu-
tion time for your kernel emulation.
strict All pipe depths in the emulation are given a depth that matches the depth given for the
FPGA compilation. If the specified depth is not given, the depth will be 1. This value is
used by default if the CL_CONFIG_CHANNEL_DEPTH_EMULATION_MODE envi-
ronment variable is not set.
The Intel® FPGA Emulation Platform for OpenCL™ software emulates kernel-to-kernel pipes. However, it does
not support interacting directly with the hardware I/O pipes on your target board. Nevertheless, you can emulate
the behavior of I/O pipes using the following procedures:
1. Store input data to be transferred to the pipe in a file with a name matching the id specialization of the pipe.
Consider the following example:
Output data is automatically written to a file with a name matching the id specialization of the output pipe.
71
Intel® oneAPI
To compile and emulate your FPGA kernel design, perform the following steps:
1. Modify the host part of your program to declare the ext::intel::fpga_emulator_selector device se-
lector. Use this device_selector when instantiating a device queue for enqueuing your FPGA device
kernel.
2. Compile your design by including the -fintelfpga option in your icpx command to generate an exe-
cutable.
3. Run the resulting executable:
• For Windows:
a. Define the number of emulated devices by invoking the following command:
set CL_CONFIG_CPU_EMULATE_DEVICES=<number_of_devices>
set CL_CONFIG_CPU_EMULATE_DEVICES=
This command specifies the number of identical emulation devices that the emulator must provide.
Tip: If you want to use only one emulator device, you need not set the CL_CONFIG_CPU_EMULATE_-
DEVICES environment variable.
Note:
• The Intel® FPGA Emulation Platform for OpenCL™ does not provide access to physical boards. Only the
emulated devices are available.
• The emulator is built with GCC 7.4.0 as part of the Intel® oneAPI DPC++/C++ Compiler. When running
the executable for an emulated FPGA device, the version of libstdc++.so must be at least that of GCC
7.4.0. In other words, the LD_LIBRARY_PATH environment variable must ensure that the correct version of
libstdc++.so is found.
If the correct version of libstdc++.so is not found, the call to clGetPlatformIDs function fails to load
the FPGA emulator platform and returns CL_PLATFORM_NOT_FOUND_KHR (error code -1001). Depending
on which version of libstdc++.so is found, the call to clGetPlatformIDs may succeed, but a later call to
the clCreateContext function may fail with CL_DEVICE_NOT_AVAILABLE (error code -2).
If the LD_LIBRARY_PATH does not point to a compatible libstdc++.so, use the following syntax to invoke
the host program:
72
Intel® oneAPI
• To enable debugging of kernel code, optimizations are disabled by default for the FPGA emulator. This
can lead to sub-optimal execution speed when emulating kernel code. You can pass the -g0 flag to the
icpx compile command to disable debugging and enable optimizations. This enables faster emulator
execution.
• When targeting the FPGA emulator device, use the -O2 compiler flag to turn on optimizations and speed
up the emulation. To turn off optimizations (for example, to facilitate debugging), pass -O0.
The Intel® FPGA Emulation Platform for OpenCL™ software has the following limitations:
• Concurrent execution
Modeling of concurrent kernel executions has limitations. During execution, the emulator is not guaran-
teed to run interacting work items in parallel. Therefore, some concurrent execution behaviors, such as
different kernels accessing global memory without a barrier for synchronization, might generate inconsis-
tent emulation results between executions.
• Same address space execution
The emulator executes the host runtime and kernels in the same address space. Certain pointer or array
use in your host application might cause the kernel program to fail and vice versa. Example uses include
indexing externally allocated memory and writing to random pointers. To analyze your program, you may
use memory leak detection tools, such as Valgrind. However, the host might encounter a fatal error caused
by out-of-bounds write operations in your kernel and vice versa.
• Conditional pipe operations
Emulation of pipe behavior has limitations, especially for conditional pipe operations where the kernel
does not call the pipe operation in every loop iteration. In these cases, the emulator might execute pipe
operations in a different order than on the hardware.
• GCC version
You must run the emulator host programs on Linux with a version of libstdc++.so from GCC 7.4.0 or later.
You can achieve this either by installing GCC 7.4.0 or later on your system or setting the LD_LIBRARY_PATH
environment variable such that a compatible libstdc++.so is identified.
73
Intel® oneAPI
When you emulate a kernel, your kernel might produce results different from the kernel compiled for hardware.
You can further debug your kernel before you compile for hardware by running your kernel through simulation.
Warning: These discrepancies usually occur when the Intel® FPGA Emulation Platform for OpenCL™ is
unable to model some aspects of the hardware computation accurately or when your program relies on un-
defined behavior.
The most common reasons for differences in emulator and hardware results are as follows:
• Your kernel code is using the ivdep attribute. The emulator does not model your kernel when the ivdep
attribute breaks a true dependence. During a full hardware compilation, you observe this as an incorrect
result.
• Your kernel code relies on uninitialized data. Examples of uninitialized data include uninitialized variables
and uninitialized or partially initialized global buffers, local arrays, and private arrays.
• Your kernel code behavior depends on the precise results of floating-point operations. The emulator uses
floating-point computation hardware of the CPU, whereas the hardware run uses floating-point cores im-
plemented as FPGA cores.
Note: The SYCL* standard allows one or more least significant bits of floating-point computations to differ
between platforms while still being considered correct on both such platforms.
• Your kernel code behavior depends on the order of pipe accesses in different kernels. The emulation of
channel behavior has limitations, especially for conditional channel operations where the kernel does not
call the channel operation in every loop iteration. In such cases, the emulator might execute channel op-
erations in an order different from that of the hardware.
• Your kernel or host code is accessing global memory buffers out-of-bounds.
Note:
– Uninitialized memory read and write behaviors are platform-dependent. Verify the sizes of your
global memory buffers when using all addresses within kernels.
– You can use software memory leak detection tools, such as Valgrind, on the emulated version of
your kernel to analyze memory-related problems. The absence of warnings from such tools does
not mean the absence of issues. It only means that the tool could not detect any problem. In such a
scenario, Intel recommends manual verification of your kernel or host code.
• Your kernel code is accessing local variables out-of-bounds. For example, accessing a local array out-of-
bounds or accessing a variable after it has gone out of scope.
Note: In software terms, these issues are stack corruption issues because accessing variables out of
bounds usually affects unrelated variables located close to the variable being accessed on a software
74
Intel® oneAPI
stack. Emulated kernels are implemented as regular CPU functions and have an actual stack that can be
corrupted. When targeting hardware, no stack exists. Hence, the stack corruption issues are guaranteed
to manifest differently. When you suspect a stack corruption, use memory leak analyzer tools, such as Val-
grind. However, stack-related issues are usually difficult to identify. Intel recommends manual verification
of your kernel code to debug a stack-related issue.
• Your kernel code uses shifts that are larger than the type being shifted. For example, shifting a 64-bit inte-
ger by 65 bits. According to the SYCL specification version 1.0, the behavior of such shifts is undefined.
• When you compile your kernel for emulation, the default pipe depth is different from the default pipe depth
generated when your kernel is compiled for hardware. This difference in pipe depths might lead to sce-
narios where execution on the hardware hangs while kernel emulation works without any issue. Refer to
Emulate Pipe Depth for information about fixing the channel depth difference.
• In terms of ordering the printed lines, the output of the cout stream function might be ordered differently
on the emulator and hardware. This is because, in the hardware, cout stream data is stored in a global
memory buffer and flushed from the buffer only when the kernel execution is complete or when the buffer
is full. In the emulator, the cout stream function uses the x86 stdout.
• The hardware and emulator might produce different results if you perform an unaligned load/store
through upcasting of types. A load/store of this type is undefined in the C99 specification. For example,
the following operation might produce unexpected results:
A few known issues might affect your use of the emulator. Review these issues to avoid possible problems when
using the emulator.
Compiler Diagnostics
Some compiler diagnostics are not yet implemented for the emulator.
This can occur when a kernel uses more __private or __local memory than the emulator supports by default.
Once you have determined the amount of memory needed, try setting larger values for the CL_CONFIG_-
CPU_FORCE_PRIVATE_MEM_SIZE or the CL_CONFIG_CPU_FORCE_LOCAL_MEM_SIZE environment variable, as de-
scribed in Emulator Environment Variables.
Note: On Windows, the FPGA emulator can silently fail by running out of memory. As a workaround to catch
this error, write your kernel code using the try-catch syntax.
75
Intel® oneAPI
The oneAPI FPGA runtime does not support emulation binaries built using an earlier version of oneAPI. You
must recompile emulation binaries with the current oneAPI release.
When debugging unknown behaviors that differ between emulation and simulation/hardware, Intel recom-
mends using the -Weverything diagnostic command option for emulation. The -Weverything option turns
on all warnings allowing you to utilize available diagnostics and expose risky coding patterns, which you might
be inadvertently using in your design.
The Questa*-Intel® FPGA Edition simulator software assesses the functionality of your kernel.
The simulator flow generates a simulation binary file that runs on the host. The hardware portion of your code
is evaluated in an RTL simulator, and the host portion is executed natively on the processor. This feature allows
you to simulate the functionality of your kernel and iterate on your design without needing to compile your kernel
to hardware and running on the FPGA each time.
Note: The performance of the simulator is very slow when compared to that of hardware. So, Intel recommends
using a smaller data set for testing.
Use the simulator when you want an insight into the dynamic performance of your kernel and more information
about the functional correctness of your kernel than emulation or the reporting tools provide.
The simulator is cycle accurate and bit-accurate. It has a netlist identical to the generated hardware and can
provide full waveforms for debugging. View the waveforms with Siemens* EDA (formerly Mentor Graphics)
Questa* software.
Simulation Prerequisites
To use the FPGA simulation flow, you must download the following prerequisite software:
• Intel Quartus Prime Pro Edition software: Download this package from the FPGA Software Download
Center download page.
• Compatible simulation software (Questa*-Intel® FPGA Edition and Questa*-Intel® FPGA Starter
Edition): Obtain them from the FPGA Software Download Center.
Note:
– The Questa*-Intel® FPGA Edition requires a license. However, Questa*-Intel® FPGA Starter Edition
is free but requires a zero-cost license. For additional details, refer to the Licensing chapter of the
Intel FPGA Software Installation and Licensing.
76
Intel® oneAPI
– You can also use your licensed version of Siemens* EDA ModelSim* SE or Siemens* EDA Questa
Advanced Simulator software. For information about all ModelSim* and Questa* software versions
that your Intel® Quartus® Prime Pro Edition software supports, refer to the EDA Interface Information
section of the Intel® Quartus® Prime Pro Edition: <version_number> Software and Device Support
Release Notes.
– On Linux systems, you must install Red Hat* development tools to work with Questa*-Intel® FPGA
Edition and Questa*-Intel® FPGA Starter Edition software.
Note: You must download both Questa - Intel FPGA Edition (includes Starter Edition) and Questa -
Intel FPGA Edition (includes Starter Edition) Part 2 packages.
6. Accept the Software License Agreement by clicking the Accept button. File download starts automati-
cally.
7. Obtain and set up the license for the simulation software.
For comprehensive information about installing the Intel Quartus Prime software, including system re-
quirements, prerequisites, and licensing requirements, refer to Intel FPGA Software Installation and Li-
censing.
8. Run the Questa*-Intel FPGA Edition installer. The installer prompts you to select between the Questa*-
Intel® FPGA Starter Edition (free) and the Questa*-Intel FPGA Edition software.
9. Select the simulation software for which you have obtained a license. The installer prompts you to choose
where to install the Questa* simulation software.
77
Intel® oneAPI
10. Select the directory to install the Questa* simulation software. Although not a mandate, Intel recommends
installing the Questa* software in the same location as that of the Intel Quartus Prime Pro Edition software
directory.
Note: From within the oneAPI environment, you can determine the Intel® Quartus® Prime software installation
location by inspecting the QUARTUS_ROOTDIR_OVERRIDE environment variable.
You must add directories containing the Intel® Quartus® Prime and Questa* simulation software binaries to your
PATH environment variable.
Note: Commands listed in this topic assume that you have installed the Questa* simulation software alongside
the Intel® Quartus® Prime Pro Edition software, as mentioned in the Simulation Prerequisites. If you installed
the Questa* simulation software elsewhere, you must modify the PATH environment variable appropriately.
For the FPGA simulation flow only, you must explicitly add the Intel® Quartus® Prime software binary directory
to your PATH environment variable using the following command:
• Linux
$ export PATH=$PATH:<quartus_installdir>/quartus/bin
• Windows
set "PATH=%PATH%;<quartus_installdir>\quartus\bin64"
Additionally, you must also set the OCL_ICD_FILENAMES variable to specify the Installable Client Driver
(ICD) to load.
set "OCL_ICD_FILENAMES=%OCL_ICD_FILENAMES%;alteracl_icd.dll"
For the free Questa*-Intel® FPGA Starter Edition software, run the following command:
• Linux
$ export PATH=$PATH:<quartus_installdir>/questa_fse/bin
• Windows
78
Intel® oneAPI
set "PATH=%PATH%;<quartus_installdir>\questa_fse\win64"
For the licensed Questa*-Intel® FPGA Edition software, run the following command:
• Linux
$ export PATH=$PATH:<quartus_installdir>/questa_fe/bin
• Windows
set "PATH=%PATH%;<quartus_installdir>\questa_fe\win64"
Before performing simulation, you must ensure that you have installed the Intel® Quartus Prime Pro Edition soft-
ware on your system. For more information, refer to the Intel® oneAPI Toolkits Installation Guide and Intel® FPGA
development flow webpage.
To compile a kernel for simulation, include the -Xssimulation option in your icpx command as shown in the
following:
To enable collecting the waveform during the simulation, include the -Xsghdl[=<depth>] option in your icpx
command, where the optional <depth> attribute specifies how many levels of hierarchy are logged. If you do not
specify a value for the <depth> attribute, a depth of 1 is used by default.
When simulating on Windows systems, you need the Microsoft linker and additional compilation time libraries.
Verify the following settings:
• The PATH environment variable setting must include the path to the LINK.EXE file in Microsoft Visual Stu-
dio.
• LIB environment variable setting includes the path to the Microsoft compile-time libraries. The compile-
time libraries are available with Microsoft Visual Studio.
If you want to use the simulation flow and view the waveforms generated during simulation, you must have either
the Siemens EDA* Questa Simulator or ModelSim SE installed and available.
To run your SYCL library through the simulator:
1. Set the CL_CONTEXT_MPSIM_DEVICE_INTELFPGA environment variable to enable the simulation device:
• Linux
79
Intel® oneAPI
export CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
• Windows
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
Note: When the environment variable CL_CONTEXT_MPSIM_DEVICE_INTELFPGA is set, only the simula-
tion devices are available. That is, access to physical boards is disabled.
To unset the environment variable, run the following command:
• Linux
unset CL_CONTEXT_MPSIM_DEVICE_INTELFPGA
• Windows
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
You might need to set CL_CONTEXT_COMPILER_MODE_INTELFPGA=3 if the host program cannot find the
simulator device.
2. Run your host program. On Linux systems, you can use GDB or Eclipse to debug your host. If necessary,
you can inspect the simulation waveforms for your kernel code to verify the functionality of the generated
hardware. | If you compiled with the -Xsghdl flag, running your compiled program produces a waveform
file (vsim.wlf) that you can view in the Questa*-Intel FPGA Edition software as your host code executes.
The vsim.wlf file is written to the same directory from where you ran your host program.
By default, the Intel oneAPI DPC++/C++ Compiler instructs the simulator not to log any signals because logging
signals slows the simulation, and the waveform files are enormous. However, you can configure the compiler to
save these waveforms for debugging purposes.
To enable signal logging in the simulator, invoke the icpx command with the -Xsghdl option, as follows:
Specify the <depth> attribute to indicate the number of hierarchy levels logged. A depth value of 1 logs only the
top-level signals. A depth of 1 is used as the default if you do not specify the <depth> attribute.
After running the simulation, you can view the generated waveform files by invoking the appropriate script as
follows:
• Linux
bash <project_directory>/view_waveforms.sh
• Windows
80
Intel® oneAPI
<project_directory>\view_waveforms.cmd
Review this section to troubleshoot simulator problems you might have when attempting to run a simulation.
On Windows, simulation might fail at compilation time or run time if you are running from a directory with a very
long path. Use the -o compiler option to output your compilation results to a shorter path.
If you receive the following error message, you might be mixing resources from multiple simulators, such as
Questa*-Intel FPGA Edition and ModelSim* SE:
An example of mixing simulator resources is compiling a device with ModelSim* SE and running the host pro-
gram in Questa*-Intel FPGA Starter Edition.
Questa*-Intel FPGA Starter Edition software has limitations on design size that prevent it from simulating large
designs. When trying to launch a simulation using Questa*-Intel FPGA Starter Edition software, you may en-
counter the following error message:
Instead, simulate the designs with Questa*-Intel FPGA Edition or ModelSim* SE software.
Depending on whether you are targeting the FPGA emulator or FPGA hardware, you must use the correct
SYCL* device selector in the host code. You can use the FPGA hardware device selector for simulation also.
The following host code snippet demonstrates how you can use a selector to specify the target device at com-
pile time:
81
Intel® oneAPI
// FPGA device selectors are defined in this utility header, along with
// all FPGA extensions such as pipes and fpga_reg
#include <sycl/ext/intel/fpga_extensions.hpp>
int main() {
// Select either:
// - the FPGA emulator device (CPU emulation of the FPGA)
// - the FPGA device (a real FPGA, can be used for simulation too)
#if defined(FPGA_EMULATOR)
ext::intel::fpga_emulator_selector device_selector;
#elif defined(FPGA_SIMULATOR)
ext::intel::fpga_simulator_selector device_selector;
#else
ext::intel::fpga_selector device_selector;
#endif
queue q(device_selector);
...
}
Note:
• The FPGA emulator and the FPGA are different target devices. Intel® recommends using a preprocessor
define to choose between the emulator and FPGA selectors. This makes it easy to switch between tar-
gets using only command-line flags. For example, you can compile the above code snippet for the FPGA
emulator by passing the flag -DFPGA_EMULATOR to the icpx command.
• Since FPGAs support only the ahead-of-time compilation method, dynamic selectors (such as the
default_selector) are less useful that explicit selectors when targeting FPGAs.
Caution: When targeting the FPGA emulator or FPGA hardware, you must pass correct compiler flags and
use the correct device selector in the host code. Otherwise, you might experience runtime failures. Refer to
the fpga_compile tutorial in the Intel® oneAPI Samples Browser to get started with compiling SYCL code for
FPGA.
In the FPGA IP Authoring flow, you target your SYCL* code to generate IP components that you can integrate
into a custom Intel® Quartus® Prime project. You target your compilation to a supported Intel® FPGA device
family or part number instead of a specific acceleration platform.
Use this flow to help speed your IP development by letting you compile your SYCL* code to standalone IPs on
different targets that you can take and deploy into your systems.
For details about getting started with the IP component development flow, refer to Getting Started with Intel®
oneAPI Toolkits and Intel® Quartus® Prime Software.
The typical design flow when you author IP components consists of the following stages:
82
Intel® oneAPI
83
Intel® oneAPI
the FPGA optimization report estimates the maximum clock rate that your component can cleanly close timing
for.
After the Intel® Quartus® Prime compilation is completed, the summary section of the FPGA optimization report
shows the area and performance data for your components.
These estimates are more accurate than estimates generated when you compile your IP component for simu-
lation only.
Typically, Intel® Quartus® Prime compilation times can take minutes to hours depending on the size and com-
plexity of your IP components.
To synthesize your component IP and generate quality of results (QoR) data, instruct the compiler to run
the Intel® Quartus® Prime compilation flow automatically after synthesizing the components. Include the
-Xstarget=<FPGA device family> or -Xstarget=<FPGA part number> options in your icpx command:
84
Intel® oneAPI
When you write IP components in SYCL, consider these additional requirements and techniques.
The compiler generates a component interface for integrating your RTL component into a larger system. A IP
component has two basic interface types: the component invocation interface and the data interface.
IPs are generated by default using a control-and-status register (CSR) agent interface for consuming inputs.
The Streaming IP Component Kernels section demonstrates how to use a streaming interface instead.
You can pass data into a kernel using the default arguments, host pipes, or through memory (using accessors
or USM). You can pass items by value in the capture list of the lambda expression (often called a lambda) or
85
Intel® oneAPI
by using an accessor or a Unified Shared Memory (USM) pointer to create an Avalon memory mapped host
interface on your IP.
Your IP can produce output only through an accessor, USM pointer, or pipe. The CSR interface cannot capture
output from an IP component generated from the Intel oneAPI DCP++/C++ compiler.
For creating your IP, use one of the following recommended general coding styles:
• Lambda Coding Style Example: The lambda coding style is typically used in most full-system SYCL pro-
grams.
• Functor Coding Style Example: You can write your IP component (kernel) code out-of-line from the host
code with the functor coding style.
#include <sycl/sycl.hpp>
#include <iostream>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <vector>
#define VECT_SIZE 4
int main() {
#ifdef FPGA_EMULATOR
sycl::ext::intel::fpga_emulator_selector my_selector;
#else
sycl::ext::intel::fpga_selector my_selector;
#endif
queue q(my_selector);
86
Intel® oneAPI
std::cout << "add two vectors of size " << count << std::endl;
// Copy the input arrays into the USM so the kernel can see them
int *A = malloc_shared<int>(count, q);
int *B = malloc_shared<int>(count, q);
int *C = malloc_shared<int>(count, q);
// The code inside the lambda expression describes your IP. Inputs
// and outputs are inferred from the lambda capture list.
q.single_task<SimpleVAdd>([=]() [[intel::kernel_args_restrict]] {
[[intel::speculated_iterations(0)]]
[[intel::initiation_interval(1)]]
for (int i = 0; i < count; i++) {
C[i] = A[i] + B[i];
}
})
.wait();
87
Intel® oneAPI
With this style, you can specify all the interfaces in one location and make a call to your IP component from your
SYCL* host program.
#include <sycl/sycl.hpp>
#include <iostream>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <vector>
// The members of the functor serve as inputs and outputs to your IP.
// The code inside the operator()() function describes your IP.
class SimpleVAddKernel {
int *A, *B, *C;
int count;
public:
SimpleVAddKernel(int *A_in, int *B_in, int *C_out, int count_in)
: A(A_in),
B(B_in),
C(C_out),
count(count_in) {}
#define VECT_SIZE 4
int main() {
#ifdef FPGA_EMULATOR
sycl::ext::intel::fpga_emulator_selector my_selector;
#else
sycl::ext::intel::fpga_selector my_selector;
#endif
queue q(my_selector);
88
Intel® oneAPI
std::cout << "add two vectors of size " << count << std::endl;
// Copy the input arrays into the USM so the kernel can see them
int *A = malloc_shared<int>(count, q);
int *B = malloc_shared<int>(count, q);
int *C = malloc_shared<int>(count, q);
q.single_task<SimpleVAdd>(SimpleVAddKernel{A, B, C, count}).wait();
89
Intel® oneAPI
Memory-Mapped interfaces
You can instantiate Memory-mapped (MM) interfaces in one of the following ways:
• Memory-Mapped Interface Using Accessors: Using accesors allows the compiler to manage the copy-
ing of the memory between the host and device.
• Memory-Mapped Interface Using Unified Shared Memory: Using unified shared memory (USM) lets
you take full control and manage copying data from the host to the device and vice versa. However, USM
pointers allow you to customize the memory-mapped host interface further.
The following example shows how to create multiple memory-mapped (mm_host) interfaces using the buffer_-
location property in SYCL*:
#include <sycl/sycl.hpp>
#include <iostream>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <vector>
// The members of the functor serve as inputs and outputs to your IP.
// The code inside the operator()() function describes your IP.
template <class AccA, class AccB, class AccC>
class SimpleVAddKernel {
AccA A;
AccB B;
AccC C;
int count;
public:
SimpleVAddKernel(AccA A_in, AccB B_in, AccC C_out, int count_in)
: A(A_in),
B(B_in),
C(C_out),
count(count_in) {}
90
Intel® oneAPI
int main() {
#ifdef FPGA_EMULATOR
sycl::ext::intel::fpga_emulator_selector my_selector;
#elif FPGA_SIMULATOR
sycl::ext::intel::fpga_simulator_selector my_selector;
#else
sycl::ext::intel::fpga_selector my_selector;
#endif
queue q(my_selector);
std::cout << "add two vectors of size " << count << std::endl;
buffer bufferA{VA};
buffer bufferB{VB};
buffer bufferC{VC};
q.submit([&](handler &h) {
accessor accessorA{bufferA, h, read_only};
accessor accessorB{bufferB, h, read_only};
accessor accessorC{bufferC, h, read_write, no_init};
h.single_task<SimpleVAdd>(SimpleVAddKernel<decltype(accessorA), decltype(accessorB),␣
,→decltype(accessorC)>{accessorA, accessorB, accessorC, count});
});
91
Intel® oneAPI
You can use unified shared memory (USM) to fully customize a memory-mapped interface when compiling for
an IP-only flow.
To customize the interface, use a functor to specify the component and use one of the two compiler-defined
macros.
The following macro creates a memory-mapped host interface with the specified parameters. The base pointer
is passed in through the register map.
register_map_mmhost(
BL1, // buffer_location or aspace
28, // address width
64, // data width
16, // ! latency, must be atleast 16
0, // read_write_mode, 0: ReadWrite, 1: Read, 2: Write
1, // maxburst
0, // align, 0 defaults to alignment of the type
1 // waitrequest, 0: false, 1: true
) int *x;
You can also use the following macro instead to have the base pointer passed in through a conduit interface.
conduit_mmhost(
BL1, // buffer_location or aspace
28, // address width
64, // data width
16, // ! latency, must be atleast 16
0, // read_write_mode, 0: ReadWrite, 1: Read, 2: Write
1, // maxburst
0, // align, 0 defaults to alignment of the type
1 // waitrequest, 0: false, 1: true
) int *x;
When you specify the macro properties, the order of the properties must be preserved. The compiler exits with
an error out when you provide an unsupported combination. You can customize the following properties:
92
Intel® oneAPI
To include this macro in your program, create a kernel as a functor, allocate the memory on the host, and copy
the data from the host to the kernel and back.
The following example creates two memory-mapped interfaces. The host program must allocate the memory
using malloc_shared. Also, this allocation requires the buffer location as a property.
It initializes two values to 0. The kernel code then sets them to 5 and 6, respectively. It copies the desired memory
locations to the host program, frees the allocated memory, and then verifies the output is as expected.
#include <sycl/sycl.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/ext/intel/prototype/interfaces.hpp>
struct MyIP {
register_map_mmhost(
BL1, // buffer_location or aspace
28, // address width
64, // data width
16, // ! latency, must be at least 16
0, // read_write_mode, 0: ReadWrite, 1: Read, 2: Write
1, // maxburst
0, // align, 0 defaults to alignment of the type
1 // waitrequest, 0: false, 1: true
) int *x;
register_map_mmhost(
BL2, // buffer_location or aspace
28, // address width
64, // data width
(continues on next page)
93
Intel® oneAPI
register_map_interface
void operator()() const {
*x = 5;
*y = 6;
}
};
q.single_task(MyIP{HostA, HostB}).wait();
*first = *HostB;
*second = *HostA;
sycl::free(HostA, q);
sycl::free(HostB, q);
}
int main() {
int first = 0;
int second = 0;
Test(&first, &second);
return 0;
}
94
Intel® oneAPI
Host Pipes
The host pipe implementation is a prototype implementation that relies on prototype features that are not incor-
porated into the standard interkernel pipes.
To separate this host pipe implementation from the existing interkernel pipe implementation, host pipes are de-
clared in a different namespace than interkernel pipes.
This namespace is as follows:
sycl::ext::intel::prototype
$INTELFPGAOCLSDKROOT/include/sycl/ext/intel/prototype/host_pipes.hpp
Additionally, this prototype implementation of host pipes relies on USM for simulation. When simulating your IP
for verification in a SYCL* program, you can use only boards and devices that support USM with host pipes.
Each individual host pipe is a function scope class declaration of the templated pipe class.
96
Intel® oneAPI
Both pipe declarations use an alias for the full templated pipe class name for convenience. The first carries int
type and has a min_capacity of 8. The second carries float type data and has a min_capacity of 4. By not
specifying parameters after the min_capacity parameter, the default values from the earlier table are used for
both pipes.
Host pipes expose read and write interfaces that allow a single element to be read or written in FIFO order to the
pipe. These read and write interfaces are static class methods on the templated classes described later in this
section and in Declare a Host Pipe.
Blocking Write
The host pipe write interface writes a single element of the given data type (int in the examples that follow)
to the host pipe. On the host side, this class method accepts a reference to a SYCL* device queue as its first
argument and the element being written as its second argument.
queue q(...);
...
int data_element = ...;
In the FPGA kernel, writes to a host pipe accept a single argument, which is the element being written.
Non-blocking Write
Non-blocking writes add a bool argument in both host and device APIs that is passed by reference and returns
true in this argument if the write was successful, and false if it was unsuccessful.
On the host:
97
Intel® oneAPI
queue q(...);
...
int data_element = ...;
On the device:
Blocking Read
The host pipe read interface reads a single element of a given data type from the host pipe. Like the write in-
terface, the read interface on the host takes a SYCL* device queue as a parameter. The device read interface
consists of the class method read call with no arguments.
On the host:
On the device:
Non-blocking Read
Like non-blocking writes, non-blocking reads add a bool argument in both host and device APIs that is passed
by reference and returns true in this argument if the read was successful and false if it was unsuccessful.
On the host:
98
Intel® oneAPI
On the device:
Host pipe connections for a particular host pipe are inferred by the compiler from the presence of read and write
calls to that host pipe in your code.
A host pipe can be connected from the host only to a single kernel. That is, host pipe calls for a particular host
pipe must be restricted to the same kernel.
Host pipes can also operate in only one direction. That is, host-to-kernel or kernel-to-host.
Host code for a particular host pipe can contain either only all writes or only all reads to that pipe, and the corre-
sponding kernel code for the same host pipe can consist only of the opposite transaction.
The prototype implementation of host pipes is intended to use a two-part compilation flow to generate your IP.
To simulate your IP using a SYCL* program testbench, compile your full SYCL* program as follows:
icpx -fsycl -fintelfpga -Xssimulation -Xstarget=<FPGA device family or part number> <source.cpp>
The simulation flow uses additional “helper” kernels to connect the host pipes from each kernel to the host part
of the program. In the reports generated by the compiler, you can identify your IP by the name you have given it
in your SYCL* program.
When you have verified the functionality of your IP authoring kernel, you can generate RTL for your IP with the
following compile command:
This command generates a separate project directory in your current working directory for each of your IPs, and
directories for the “helper” kernels that you can ignore.
Note: You cannot simulate your full program when using the -fsycl-device-code-split=per_kernel op-
tion. It is primarily used to generate RTL for each of your kernels
You can identify these directories by the extension .prj, with each subsequent kernel appending _<#> to the
project directory, where <#> is an incrementing integer. For example, when compiling a source program named
main, project directories are named main.prj, main_1.prj, main_2.prj, and so on.
99
Intel® oneAPI
This section provides a summary of interfacing with host pipes in your IP based on the choice of protocol.
Host pipes support Avalon streaming and memory-mapped interfaces. Refer to the Intel® Avalon Interface
Specifications for details about these protocols.
For AVALON_MM protocols, register addresses in the CRA are specified in the generated kernel header file in the
project directory. Refer to Example Register Map File for further details on CRA agent registers.
AVALON_STREAMING_USES_READY Protocol
This protocol allows the sink to backpressure by deasserting the ready signal asserted. The sink signifies that it
is ready to consume data by asserting the ready signal. On the cycle where the sink asserts the ready signal, the
source must wait for the ready_latency signal to cycle before responding with valid and data signals, where
the template parameter specifies the ready_latency in the host pipe dec
Host-to-Device Pipe
When the uses_valid template parameter is set to false and the ready signal is asserted by the kernel and
sampled by the host, the host must wait ready_latency cycles before the value on the data interface is sampled
by the kernel and consumed.
When the uses_valid template parameter is set to true and the ready signal is asserted by the kernel and
sampled by the host, the host must wait ready_latency cycles before valid, and the data interface is sampled
by the kernel and consumed.
Device-to-Host Pipe
When the uses_valid template parameter is set to true and the host asserts the ready signal, the kernel replies
with valid=1 and qualified data (if available) ready_latency cycles after the corresponding ready was first as-
serted.
When the uses_valid template parameter is set to false and the host asserts the ready signal, the kernel
replies with qualified data ready_latency cycles after the corresponding ready was first asserted.
AVALON_STREAMING_ALWAYS_READY Protocol
With this choice of protocol, no ready signal is exposed by the host pipe, and the sink cannot backpressure.
The valid signal qualifies data transfer from source to sink per cycle when the uses_valid template parameter
is set to true. When the uses_valid template parameter is set to false, the source implicitly provides a valid
output on every cycle, and the sink assumes a valid input on every cycle.
100
Intel® oneAPI
Host-to-Device Pipe
When the uses_valid template parameter is set to false, the kernel samples and processes the value on the
host pipe data interfaces on each cycle.
When the uses_valid template parameter is set to true, the kernel samples and processes the value on the
host pipe data interface on each cycle that the valid signal is asserted.
Device-to-Host Pipe
When the uses_valid template parameter is set to false, the host must sample and process values on the host
pipe data interface every clock cycle. Failure to do so causes the data to be dropped.
When the uses_valid template parameter is set to true, the host must sample and process values on the host
pipe data interface every clock cycle that the valid signal is asserted. Failure to do so causes the data to be
dropped.
AVALON_MM Protocol
With this protocol, an implicit ready signal is held high, and the sink cannot backpressure.
Intel does not recommend using this protocol with device-to-host pipes. The uses_valid template parameter
must also be set to true. Both the valid and data signals for the pipe are stored in registers implemented in the
CRA agent.
Host-to-Device Pipe
The host writes a 1 to the valid register to indicate that the value in the data register is qualified. When the
kernel has consumed this data, the kernel automatically clears the value in the valid register. A cleared valid
register signifies that the host is free to write a new value into the data register.
AVALON_MM_USES_READY Protocol
With this protocol, an additional register in the CRA is created to hold the ready signal. You must set the uses_-
valid template parameter to true.
Host-to-Device Pipe
The kernel writes a 1 to the ready register when it is available to receive data. The host writes a 1 to the valid
register to indicate that the value in the data register is qualified.
101
Intel® oneAPI
Device-to-Host Pipe
The kernel writes a 1 to the valid register to indicate that the value in the data register is qualified. This value is
held in the data register until the host writes a 1 to the ready register, which signifies that the host has consumed
valid data from the data register. The kernel clears the ready register when the kernel has written subsequent
qualified data and the valid register.
Avalon packet sideband signal support is enabled by including the host_pipes.hpp header and defining host
pipes using the AvalonPacket struct defined in the following header file:
$INTELFPGAOCLSDKROOT/include/sycl/ext/intel/prototype/pipes_ext.hpp
Using the AvalonPacket struct with the uses_packets template parameter set to true adds two additional 1-bit
signals to the Avalon interface, start_of_packet (sop), and end_of_packet (eop).
Assert the sop signal when you send the first packet along with a valid signal assertion. Assert the eop signal
when you send the last packet, along with a valid signal assertion. You can assert sop and eop signals in the
same cycle for a single packet transfer transaction. The sop signal can also be asserted on the cycle immediately
after the eop signal was asserted for the previous packet.
The third template parameter for the AvalonPacket struct signifies uses_empty. When uses_empty is set
data_width
to true, it adds an extra empty signal that is ceil(log2 ( bits_per_symbol ) bits long. The empty signal indicates the
number of symbols that are empty during the eop cycle.
Empty symbols are always the last symbols in the data. That is, the symbols carried by low-order bits when
first_symbol_In_high_order_bits is true, or the high-order bits if first_symbol_In_high_order_bits
is set to false.
Setting uses_empty is required for all packet interfaces carrying more than one symbol of data that have a vari-
able length packet format.
The following example uses the AvalonPacket struct with the uses_packets and users_empty template pa-
rameters both set to true. The size of the PipeData type should be a multiple of the number of bits per symbol.
When you define the host pipe, set the data type to the AvalonPacket struct:
The following code example instantiates a packet struct and writes to the pipe (from the host):
102
Intel® oneAPI
The following code example reads from the pipe and extracts the packet signals (from device):
SYCL* kernels generate an interface that can control the kernel and pass in the default arguments to the IP com-
ponent.
By default, the Intel® oneAPI DPC++/C++ Compiler generates an Avalon agent interface to control the kernel and
pass in the default arguments. The compiler also generates a header file that provides the addresses of various
registers in the agent memory map. A top-level header named register_map_offsets.hpp is created for each
device image that you can include if you are interfacing with the SYCL* device image.
An additional header is generated for each of your kernels within the .prj directory. The register_map_-
offsets.hpp header file includes these files, but contain the addresses and offsets for each of the kernels.
/************************************************************************/
/* Memory Map Summary */
/************************************************************************/
/*
Address | Access | Register | Argument
------------|---------|-----------------------|-----------------------------
0x0 | R/W | register0[31:0] | Status[31:0]
------------|---------|-----------------------|-----------------------------
0x28 | R/W | register5[31:0] | FinishCounter[31:0]
| | register5[63:32] | FinishCounter[31:0]
------------|---------|-----------------------|-----------------------------
0x78 | R/W | register15[63:0] | arg_AccRes[63:0]
------------|---------|-----------------------|-----------------------------
0x80 | R/W | register16[63:0] | arg_AccRes1[63:0]
------------|---------|-----------------------|-----------------------------
0x88 | R/W | register17[63:0] | arg_AccRes2[63:0]
------------|---------|-----------------------|-----------------------------
(continues on next page)
103
Intel® oneAPI
While the default option for kernels are agent kernels, there is a register_map_interface macro to mark a
function as an agent kernel. This is shown in the following example:
#include <sycl/ext/intel/prototype/interfaces.hpp>
using namespace sycl;
struct MyIP {
int *input_a, *input_b, *input_c;
int n;
MyIP(int *a, int *b, int *c, int N_)
: input_a(a), input_b(b), input_c(c), n(N_) {}
register_map_interface void operator()() const {
for (int i = 0; i < n; i++) {
input_c[i] = input_a[i] + input_b[i];
}
}
};
104
Intel® oneAPI
You can also choose to have the Intel® oneAPI DPC++/C++ Compiler implement the IP component kernel inter-
face as a streaming interface.
To have the compiler implement the IP kernel interface as a streaming interface:
1. Implement the IP kernel as a functor.
2. Include the following header file:
sycl/ext/intel/prototype/interfaces.hpp
3. Add one of the following options to the compiler command (icpx -fsycl):
• Linux: I/$INTELFPGAOCLSDKROOT/include
• Windows: /I %INTELFPGAOCLSDKROOT%\include
4. Add the streaming_interface macro to the functor operator().
The following code shows an example of implementing a streaming interface:
#include <sycl/ext/intel/prototype/interfaces.hpp>
using namespace sycl;
struct MyIP {
int *input_a, *input_b, *input_c;
int n;
MyIP(int *a, int *b, int *c, int N_)
: input_a(a), input_b(b), input_c(c), n(N_) {}
streaming_interface void operator()() const {
for (int i = 0; i < n; i++) {
input_c[i] = input_a[i] + input_b[i];
}
}
};
The resulting IP component kernel is invoked as a streaming kernel. Compiling the example code generates the
start signal, the done signal, the ready_in signal, and ready_out signals as conduits. The compilation of the
example code also generates conduits for the base addresses of the three pointers as well the value of N.
The streaming handshaking follows the Avalon Streaming (ST) protocol. The IP kernel consumes the argu-
ments on the clock cycle that the start and ready_out signals are asserted. The IP component kernel invoca-
tion is finished on the clock cycle that the done and ready_in signals are asserted.
Note: In the SYCL* device image generated for a streaming-controlled kernel, the top-level RTL still contains
Avalon Agent interface ports. You can safely ignore these ports if the user kernel does not contain any agent
interfaces.
105
Intel® oneAPI
The following actions are not supported when using a streaming IP component kernel:
• Using streaming kernels as SYCL NDRange kernels.
• Profiling of streaming kernels.
• Using agent kernel arguments in streaming kernels.
Streaming Arguments
When you generate a streaming kernel, you might want to have one or more arguments with an opposite type of
interface. For example, a streaming argument with an agent kernel.
By default, the arguments follow the same type of interface as the kernel.
To override a specific interface to use conduits with an agent kernel, use the conduit macro, like in the following
example:
#include <sycl/ext/intel/prototype/interfaces.hpp>
using namespace sycl;
struct MyIP {
conduit int *input_a, *input_b, *input_c;
conduit int n;
MyIP(int *a, int *b, int *c, int N_)
: input_a(a), input_b(b), input_c(c), n(N_) {}
register_map_interface void operator()() const {
for (int i = 0; i < n; i++) {
input_c[i] = input_a[i] + input_b[i];
}
}
};
Pipelined Kernels
By default, SYCL* task kernels are not pipelined. They must execute in a back-to-back manner. You must wait
for the previous invocation to finish before invoking the kernel again.
However, streaming kernels can be optionally pipelined by using the streaming_pipelined_interface
macro, as shown in the following example:
struct MyIP {
conduit int *input;
MyIP(int *inp_a_) : input(inp_a_) {}
streaming_pipelined_interface void operator()() const {
int temp = *input;
*input = something_complicated(temp);
(continues on next page)
106
Intel® oneAPI
Stable Arguments
By default, the Intel® oneAPI DPC++/C++ Compiler assumes that the values of kernel arguments change during
kernel executions.
For pipelined kernels, if a kernel argument does not change while the kernel is executing, you can mark the cor-
responding kernel argument as stable.
Declare a streaming (conduit) kernel argument to be stable with the stable_conduit attribute.
Changing the value of a stable kernel argument results in undefined behavior.
You might save some FPGA area in your kernel design when you declare a streaming (conduit) kernel argument
as stable.
If all the kernel arguments do not change while the kernel is executing, you can include the
-Xsno-hardware-kernel-invocation-queue option in your icpx command.
Verify the functionality of your design by compiling your component and testbench to an x86-64 FPGA emula-
tion executable that you can debug with a oneAPI debugger. This process is sometimes referred to as debug-
ging through emulation.
Compiling your design to an x86-64 executable is faster than generating and simulating RTL. Shorter compi-
lation time allows you to debug and refine your component quickly before verifying how your component is im-
plemented in hardware.
No additional software is required to emulate your IP component, and no modifications to your host code are
required.
107
Intel® oneAPI
You can compile your component and testbench to an x86-64 executable for functional verification using the
icpx -fscyl -fintelfpga <source>.cpp command.
To verify the design functionality from the x86-64 emulation of your testbench and component, use one of the
following debugging techniques:
• Running the program to see if it generates the expected output.
• Using print statements in your code (such as printf or std::cout) to output variable values at specific
points.
• Stepping through your code with a debugger.
If you want to step through your code with a debugger, ensure that you set the compiler command to include
debug information and to generate unoptimized binary files. Debug versions of your executables are generated
by default, so a command option such as -ggdb is unnecessary.
To disable debug information, add the -g0 option to your icpx compiler command.
On Linux systems, you can use the GDB provided with the Intel® oneAPI Base Toolkit to debug your component
and testbench.
You can automate the process by using a Makefile or batch script. Use the Makefiles and scripts provided in the
Intel® oneAPI Base Toolkit example designs and tutorials as guides for creating your Makefiles or batch scripts.
When you compile your component to an Intel® FPGA device family or part number with the -Xstarget compiler
option, the Intel oneAPI DPC++/C++ Compiler links your design C++ testbench with an RTL-compiled version of
your component that runs in an RTL simulator.
Use Siemens® EDA Questa® software to perform the simulation. You must have Questa® simulation software
installed when authoring IP components with the Intel oneAPI Base Toolkit. For a list of supported versions of
the Questa® software, refer to the EDA Interface Information section in the Intel® Quartus® Prime Software and
Device Support Release Notes.
Verifying the functionality of your design in this way is sometimes called debugging through simulation.
To verify the design functionality from your design simulation, use the following debugging techniques:
• Run the executable that the compiler generates by targeting the FPGA device. By default, the executable
name is a.out (Linux). For example, you might invoke a command like one of the following commands for
a simple single-file design:
– Linux:
icpx -fsycl -fintelfpga -Xssimulation -Xstarget="Arria10" […] design.cpp
• Write variable values to output pipes or mm_host interfaces at certain points in your code.
• Review the waveforms generated when running your design.
The compiler does not log signals by default when you compile your design. To enable signal logging in
simulation, refer to Debug During Verification.
108
Intel® oneAPI
By default, the compiler instructs the simulator not to log any signals because logging signals slows the simula-
tion, and waveform files can be extremely large. However, you can configure the compiler to save these wave-
forms for debugging purposes.
To enable signal logging in the simulator, invoke the icpx -fsycl command with the -Xsghdl option command
as follows:
Note: After you compile your component and testbench with the -Xsghdl option, run the resulting executable
to run the simulation and generate the waveform. By default, the name of the executable is a.out (Linux). You
can change the name of the output by using the -o <output_name> option.
When the simulation finishes, open the vsim.wlf file inside the current directory to view the waveform.
To view the waveform after the simulation finishes:
1. In the Questa® simulator, open the vsim.wlf file inside the <project name>.prj directory.
2. Right-click the <IP_component_name>_inst block and select Add Wave.
You can now view the top-level component signals: start, done, ready_in, ready_out, parameters, and out-
puts. Use the waveform to see how the component interacts with its interfaces.
Tip: When you view the simulation waveform in the Questa® simulator, the simulation clock period is set
to a default value of 1000 picoseconds (ps). To synchronize the Time axis to show one cycle per tick mark,
change the time resolution from picoseconds (ps) to nanoseconds (ns):
1. Right-click the timeline and select Grid, Timeline & Cursor Control.
2. Under Timeline Configuration, set the Time units to ns.
The Intel® oneAPI DPC++/C++ Compiler provides tools that you can use to find areas for improvement and a
variety of flags, attributes, and extensions to control design and compiler behavior.
For more information about optimizing your design, refer to the FPGA Optimization Guide for Intel®
oneAPI Toolkits.
109
Intel® oneAPI
When you are satisfied with the predicted performance of your component, use Intel® Quartus® Prime software
to synthesize your component. Synthesis also generates accurate area and performance (fMAX ) estimates for
your design. However, your design is not expected to cleanly close timing in the Intel® Quartus® Prime reports.
You can expect to see timing closure warnings in the Intel® Quartus® Prime logs because the generated project
targets a clock speed of 1000 MHz to achieve the best possible placement for your design. The fMAX value
presented in the FPGA optimization report estimates the maximum clock rate your component can cleanly close
timing for.
After the Intel® Quartus® Prime compilation is completed, the summary section of the FPGA optimization report
shows the area and performance data for your components. These estimates are more accurate than estimates
generated when you compile your IP component for simulation only.
Typically, Intel® Quartus® Prime compilation times can take minutes to hours, depending on the size and com-
plexity of your IP components.
To synthesize your component IP and generate quality of results (QoR) data, instruct the compiler to run
the Intel® Quartus® Prime compilation flow automatically after synthesizing the components. Include the –
Xshardware option in your icpx -fsycl command:
To integrate your IP component into a system with the Intel® Quartus® Prime software, you must be familiar with
Intel® Quartus® Prime software, including Platform Designer.
The Intel® oneAPI DPC++/C++ Compiler generates a project directory (<result>.prj/) and a set of IP
files per device image (a set of kernels that are part of the same system). You can control this with the
-fsycl-device-code-split=<off|per_source|per_kernel> option.
The <result>.prj/ directory generated by the compiler contains all the files that you need to include your IP
component in an Intel® Quartus® Prime project, including the following files:
• <project_name>_di.ip
An ip format file that you can add to your Intel Quartus Prime projects.
• <project_name>_di_hw.tcl
An ip format file that Platform Designer can read.
• <project_name>_di_inst.v
An example of how to instantiate the IP into other Verilog modules.
110
Intel® oneAPI
To use the IP component generated by the Intel® oneAPI DPC++/C++ compiler in an Intel® Quartus® Prime
project, you must first add the .ip file to the project.
The .ip file contains information to add to all the necessary HDL files for the component. It also applies to any
component-specific Intel® Quartus® Prime Settings File (.qsf) settings that are necessary for IP synthesis.
Follow these steps:
1. Create an Intel Quartus Prime Pro Edition project.
2. Open the Platform Designer and select your IP from the oneAPI folder.
For your IP to be in the oneAPI folder, either create the project in the same directory that contains the
generated IP project or add the file path.
3. Create the rest of your Intel Quartus Prime software project.
For an example of how to instantiate the IP component top-level module, examine the <result>.prj/
<project_name>_di_inst.v file.
To use the IP component generated by the Intel® oneAPI DPC++/C++ compiler in a Platform Designer system,
you must first add the directory to the IP search path or the IP Catalog.
In Platform Designer, if your IP generated by the compiler IP does not appear in the IP Catalog, perform the
following tasks:
1. In the Intel® Quartus® Prime software, click Tools > Options.
2. In the Options dialog box, under Category, expand IP Settings and click IP Catalog Search Locations.
3. In the IP Catalog Search Locations dialog box, add the path to the directory that contains the _hw.tcl
file to IP Search Paths as <result>.prj/<project_name>.
4. In the IP Catalog, add your IP to the Platform Designer system by selecting it from the oneAPI project
directory.
For more information about Platform Designer, refer to Creating a System with Platform Designer in Intel® Quar-
tus® Prime Pro Edition User Guide: Platform Designer.
If you are a member of the Intel® FPGA Design Solutions Network, you have access to tools to encrypt your IP
design files and generate a license for it. Your IP users can use the encrypted IP only in ways specified by the
generated license.
This license is compatible with the FlexLM licensing technology used by Intel® Quartus® Prime software.
If you have the Intel-provided IP encryption and licensing infrastructure installed, you can also generate en-
crypted IP with the Intel® oneAPI DCP++/C++ Compiler.
111
Intel® oneAPI
Your encrypted IP can then be used by your customers in Intel® Quartus® Prime software, licensed by the file that
your users added to their Intel® Quartus® Prime license search path. For more details, refer to the documentation
that Intel provided you when you joined the Intel® FPGA Design Solutions Network.
If you want to support simulation with your encrypted IP, you must create a separately-encrypted version of your
IP for simulation. For simulation, an IEEE 1735 compliant encryption scheme is used.
To generate encrypted IP for use in Intel® Quartus® Prime software, use the following command:
Important: Before you run this command, you must create a license file for the IP and add the license file to
your $LM_LICENSE_FILE environment variable.
112
Intel® oneAPI
If the device code and options affecting the device have not changed since the previous compilation, passing
the -reuse-exe=<exe_name> flag instructs the compiler to extract the compiled FPGA hardware image from
the existing executable and package it into the new executable, saving the device compilation time.
Sample use:
# Initial compilation
icpx -fsycl -fintelfpga -Xshardware <files.cpp> -o out.fpga
The initial compilation generates an FPGA device image, which takes several hours. Suppose you now make
some changes to the host code.
# Subsequent recompilation
icpx -fsycl <files.cpp> -o out.fpga -reuse-exe=out.fpga -Xshardware -fintelfpga
Suppose the program is separated into two files, main.cpp and kernel.cpp, where only the kernel.cpp file
contains the device code.
In the normal compilation process, FPGA device image generation happens at link time.
As a result, any change to either the main.cpp or kernel.cpp triggers the regeneration of an FPGA hardware
image.
The following graph depicts this compilation process:
113
Intel® oneAPI
If you want to iterate on the host code and avoid a long compile time for your FPGA device, consider using a
device link to separate the device and host compilation:
Input files must include all files that contain the device code. This step might take several hours to com-
plete.
2. Compile the host code.
Input files should include all source files that contain only the host code. These files must not contain any
source code that executes on the device but may contain setup and tear-down code, for example, parsing
command-line options and reporting results. This step takes seconds to complete.
3. Create the device link.
This step takes seconds to complete. The input should include one or more host object files (.o) and
exactly one device image file (.a). When linking a static library (.a file), always include the static library
after its use. Otherwise, the library’s functions are discarded. For additional information about static library
linking, refer to Library order in static linking.
114
Intel® oneAPI
Note: You only need to perform steps 2 and 3 when modifying host-only files.
Refer to the fast_recompile tutorial in the Intel® oneAPI Samples Browser for an example using the device link
method.
When you use the -fsycl-device-code-split[=value] option, the compiler compiles each split partition as
if targeting its own device. This option supports the following modes:
• auto: This is the default mode and the same as the -fsycl-device-code-split option without any value.
The compiler uses a heuristic to select the best way of splitting device code.
• off: Creates a single module for all kernels.
• per_kernel: Creates a separate device code module for each kernel. Each device code module contains
a kernel and dependencies, such as called functions and user variables.
• per_source: Creates a separate device code module for each source (translation unit). Each device code
module contains a bunch of kernels grouped on a per-source basis and all their dependencies, such as all
used variables and called functions, including the SYCL_EXTERNAL macro-marked functions from other
translation units.
115
Intel® oneAPI
Attention: For FPGA, each split must not share device resources, such as memory, across it. Furthermore,
kernel pipes must have their source and sink within the same split.
For additional information about this option, refer to the fsycl-device-code-split topic in Intel® oneAPI
DPC++/C++ Compiler Developer Guide and Reference.
Of the mechanisms described above, the -reuse-exe flag mechanism is easier to use than the device link mech-
anism. The flag also allows you to keep your host and device code as a single source, which is preferred for small
programs. For larger and more complex projects, the device link method gives you more control over the com-
piler’s behavior.
However, there are some drawbacks of the -reuse-exe flag when compared to compiling separate files. Con-
sider the following when using the -reuse-exe flag:
• The compiler must spend time partially recompiling and then analyzing the device code to ensure that it
is unchanged. This takes several minutes for larger designs. Compiling separate files does not incur this
extra time.
• You might occasionally encounter a false positive where the compiler incorrectly believes it must recom-
pile your device code. In a single source file, the device and host code are coupled, so certain changes to
the host code can change the compiler’s view of the device code. The compiler always behaves conser-
vatively and triggers a full recompilation if it cannot prove that reusing the previous FPGA binary is safe.
Compiling separate files eliminates this possibility.
Use this feature of the Intel® oneAPI DPC++/C++ Compiler when you want to split your FPGA compilation into
different FPGA images. This feature is particularly useful when your design does not fit on a single FPGA. You
can use it to split your very large design into multiple smaller images, which you can use to partially reconfigure
your FPGA device.
You can split your design using one of the following approaches, each giving you different benefits:
• Dynamic Linking Flow
• Dynamic Loading Flow
Between the two flows, dynamic linking is easier to implement than dynamic loading. However, dynamic linking
can require more memory on the host device as all of the device images must be loaded into memory. Dynamic
loading addresses these limitations but introduces the need for some extra source-level changes. The following
comparison table highlights the differences between the flows:
116
Intel® oneAPI
This flow allows you to split your design into different source files and map them into a separate FPGA image.
Intel® recommends this flow for designs with a small number of FPGA images.
To use this flow, perform the following steps:
1. Split your source code such that for each FPGA image you want, you create a separate .cpp file that sub-
mits various kernels. Separate the host code into one or more .cpp files that can then interface with func-
tions in the kernel files.
Consider that you now have the following three files:
• main.cpp containing your host code. For example:
// main.cpp
int main() {
queue queueA;
add(queueA);
mul(queueA);
}
• vector_add.cpp containing a function that submits the vector_add kernel. For example:
// vector_add.cpp
extern "C"{
void add(queue queueA) {
queue.submit(
// Kernel Code
);
}
}
• vector_mul.cpp containing a function that submits the vector_mul kernel. For example:
// vector_mul.cpp
extern "C"{
void mul(queue queueA) {
queue.submit(
// Kernel Code
(continues on next page)
117
Intel® oneAPI
With this flow, the long FPGA compile steps are split into separate commands that you can potentially run on
different systems or only when you change the files.
Use this flow to avoid loading all of the different FPGA images into memory at once. Similar to dynamic linking
flow, this flow also requires you to split your code. However, for this flow, you must load the .so (shared object)
files in the host program. The advantage of this flow is that you can load large FPGA image files dynamically as
necessary instead of linking all image files at compile time.
To use this flow, perform the following steps:
1. Split your source code in the same manner as done in step 1 of the dynamic linking flow.
2. Modify the main.cpp file to appear as follows:
// main.cpp
#include <dlfcn.h>
int main() {
queue queueA;
bool runAdd, runMul;
// Assuming runAdd and runMul are set dynamically at runtime
if (runAdd) {
auto add_lib = dlopen("./vector_add.so", RTLD_NOW);
auto add = (void (*)(queue))dlsym(add_lib, "add");
add(queueA);
}
if (runMul) {
auto mul_lib = dlopen("./vector_mul.so", RTLD_NOW);
auto mul = (void (*)(queue))dlsym(mul_lib, "mul");
mul(queueA);
}
}
118
Intel® oneAPI
Note: You do not have to link the .so files at compile time since they are loaded dynamically at runtime.
With this approach, you can arbitrarily load many .so files at runtime. This is useful when you have a large library
of FPGA images, and you want to select a subset of files from it.
As mentioned earlier in Types of FPGA Compilation, generating an FPGA hardware image requires Intel® Quar-
tus® Prime software, to map your design from RTL to the FPGA’s primitive hardware resources. For BSPs nec-
essary to compile to FPGA hardware, refer to the Intel® FPGA development flow webpage.
What is a Board?
Like a GPU, an FPGA is an integrated circuit that must be mounted onto a card or a board to interface with a
server or a desktop computer. In addition to the FPGA, the board provides memory, power, and thermal man-
agement, and physical interfaces to allow the FPGA to communicate with other devices.
What is a BSP?
A BSP consists of software layers and an FPGA hardware scaffold design that makes it possible to target the
FPGA through the Intel® oneAPI DPC++/C++ Compiler. The FPGA design generated by the compiler is stitched
into the framework provided by the BSP.
A BSP can provide multiple board variants that support different functionality. For example, the intel_s10sx_-
pac BSP contains two variants that differ in their support for Unified Shared Memory (USM). For additional in-
formation about USM, refer to the Unified Shared Memory and USM Interfaces topics in the SYCL Reference
Documentation.
Note: A board can be supported by more than one BSP and a BSP might support more than one board variant.
119
Intel® oneAPI
The Intel® FPGA Add-On for oneAPI Base Toolkit provides BSPs for two boards and board variants provided
by these BSPs can be selected using the following flags in your icpx -fsycl command:
Note:
• The (part of the Intel® oneAPI Base Toolkit) provides partial BSPs sufficient for generating the FPGA early
image and optimization report. In contrast, the Intel® FPGA Add-On for oneAPI Base Toolkit provides full
BSPs, which are necessary for generating the FPGA hardware image.
• When running an executable on an FPGA board, you must ensure that you have initialized the FPGA board
for the board variant that the executable is targeting. For information about initializing an FPGA board,
refer to FPGA Board Initialization.
• For information about FPGA optimizations possible with Restricted USM, refer to Prepinning and Zero-
Copy Memory Access topics in the FPGA Optimization Guide for Intel® oneAPI Toolkits.
Before you run an executable containing an FPGA hardware image, you must initialize the FPGA board using
the following command:
where:
120
Intel® oneAPI
For example, consider that you have a single Intel® Programmable Acceleration Card (PAC) D5005 (previously
known as Intel® Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX) on your system, and you
compile the executable using the following compiler command:
In this case, you must initialize the board using the following command:
Once this is complete, you can run the executable without initializing the board again, unless you are doing one
of the following:
• Running a SYCL*-compiled workload for the first time after power cycling the host.
• Running a SYCL-compiled workload after running a non-SYCL workload on the FPGA.
• Running a SYCL compiled workload compiled with a different board variant in -Xstarget flag.
The aocl binedit utility allows you to extract the following useful information about the compiled binary:
• Compilation environment details, such as: * Compiler version * Compile command used * Intel® Quartus®
Prime software version
• board_spec.xml from the BSP used for compiling
• Kernel fMAX (Quartus-compiled fMAX )
• BSP and board used for compiling
Syntax
121
Intel® oneAPI
You can also identify the BSP versions using the following command:
The Intel® oneAPI DPC++/C++ Compiler supports targeting multiple homogeneous FPGA devices from a single
host CPU. This allows to improve your design’s throughput by parallelizing the execution of your program on
multiple FPGAs.
Intel® recommends creating a single context with multiple device queues because, with multi-context, buffers at
OpenCL layer must be copied between contexts, which introduces overhead and impacts overall performance.
However, you can use multi-context if your design is simple and the overhead does not affect the overall perfor-
mance.
Follow one of the following methods to target multiple FPGA devices:
Perform the following steps to target multiple FPGA devices with a single context:
1. Create a single SYCL* context to encapsulate a collection of FPGA devices of the same platform.
std::vector<queue> queueList;
for (unsigned int i = 0; i < ctxt.get_devices().size(); i++) {
queue newQueue(ctxt, ctxt.get_devices()[i], &m_exception_handler);
queueList.push_back(newQueue);
}
3. Submit either the same or different device codes to all available FPGA devices. If you want to target a
subset of all available devices, then you must first perform device selection to filter out unwanted devices.
Perform the following steps to target multiple FPGA devices with multiple contexts:
1. Obtain a list of all available FPGA devices. Optionally, you can select a device based on the device mem-
ber or device property. For device properties such as device name, use the member function get_-
info()const with the desired device property.
122
Intel® oneAPI
std::vector<queue> queueList;
for (unsigned int i = 0; i < deviceList.size(); i++) {
queue newQueue(deviceList[i], &m_exception_handler);
queueList.push_back(newQueue);
}
3. Submit either the same or different device codes to all available FPGA devices. If you want to target a
subset of all available devices, then you must first perform device selection to filter out unwanted devices.
Limitations
To compile a design that targets multiple target device types (using different device selectors), you can run the
following commands:
Emulation Compile
For compiling your SYCL* code for the FPGA emulator target, execute the following commands:
# For Linux:
icpx -fsycl jit_kernel.cpp -c -o jit_kernel.o
# For Windows:
icx-cl -fsycl jit_kernel.cpp -c -o jit_kernel.o
123
Intel® oneAPI
The design uses libraries and includes an FPGA kernel (AOT flow) and a CPU kernel (JIT flow).
Specifically, there should be a main function residing in the main.cpp file and two kernels for both CPU (jit_-
kernel.cpp) and FPGA (fpga_kernel.cpp).
sycl::cpu_selector device_selector;
queue deviceQueue(device_selector);
deviceQueue.submit([&](handler &cgh) {
// CPU Kernel function
});
#if defined(FPGA_EMULATOR)
ext::intel::fpga_emulator_selector device_selector;
#elif defined(FPGA_SIMULATOR)
ext::intel::fpga_simulator_selector device_selector;
#else
ext::intel::fpga_selector device_selector;
#endif
queue deviceQueue(device_selector);
deviceQueue.submit([&](handler &cgh) {
// FPGA Kernel Function
});
To compile for the FPGA hardware target, add the -Xshardware flag and remove the -DFPGA_EMULATOR flag, as
follows:
# For Linux:
icpx -fsycl jit_kernel.cpp -c -o jit_kernel.o
# For Windows:
icx-cl -fsycl jit_kernel.cpp -c -o jit_kernel.o
124
Intel® oneAPI
One of the main influences on the overall performance of an FPGA design is how kernels executing on the FPGA
interact with the host on the CPU.
FPGA devices typically communicate with the host (CPU) via PCIe.
This is an important factor influencing the performance of SYCL* programs targeting FPGAs. Furthermore, the
first time you run a particular SYCL program, you must configure the FPGA with its hardware bitstream, and this
may require several seconds.
Data Transfer
Typically, the FPGA board has its own private Double Data Rate (DDR) memory on which it primarily operates.
The CPU must bulk transfer or dynamic memory access (DMA) all data that the kernel needs to access into
the FPGA’s local DDR memory. After the kernel completes its operations, results must be transferred over DMA
back to the CPU. The transfer speed is bound by the PCIe link itself and the efficiency of the DMA solution. For
example, the Intel® PAC with Intel® Arria® 10 GX FPGA has a PCIe Gen 3 x 8 link, and transfers are typically limited
to 6-7 GB/s.
The following are the techniques to manage these data transfer times:
• SYCL allows buffers to be tagged as read-only or write-only, which eliminates some unnecessary transfers.
• Improve the overall system efficiency by maximizing the number of concurrent operations. Since PCIe
supports simultaneous transfers in opposite directions and PCIe transfers do not interfere with kernel ex-
ecution, you can apply techniques such as double buffering. Refer to the Double Buffering Host Utilizing
Kernel Invocation Queue topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits and the dou-
ble_buffering tutorial for additional information about these techniques.
• Improve data transfer throughput by prepinning system memory on board variants that support Restricted
USM. Refer to the Prepinning topic in the FPGA Optimization Guide for Intel® oneAPI Toolkits for addi-
tional information.
125
Intel® oneAPI
Configuration Time
You must program the hardware bitstream on the FPGA device in a process called configuration. Configuration
is a lengthy operation requiring several seconds of communication with the FPGA device. The SYCL runtime
manages configuration for you automatically. The runtime decides when the configuration occurs. For exam-
ple, the configuration might be triggered when a kernel is first launched, but subsequent launches of the same
kernel may not trigger configuration since the bitstream has not changed. Therefore, during development, Intel®
recommends to time the execution of the kernel after the FPGA has been configured, for example, by perform-
ing a warm-up execution of the kernel before timing kernel execution. You must remove this warm-up execution
in the production code.
If a SYCL program submits the same kernel to a SYCL queue multiple times (for example, by calling single_-
task within a loop), only one kernel invocation is active at a time. Each subsequent invocation of the kernel waits
for the previous run of the kernel to complete.
The preceding FPGA flow covered the basics of compiling for FPGA, but there is still much to learn about im-
proving the performance of your designs. The Intel® oneAPI DPC++/C++ Compiler provides tools that you can
use to find areas for improvement and a variety of flags, attributes, and extensions to control design and com-
piler behavior. You can find this information in the FPGA Optimization Guide for Intel® oneAPI Toolkits, which
should be your main reference if you want to understand how to optimize your design.
A static library is a single file that contains multiple functions. You can create a static library file using register
transfer level (RTL). You can then include this library file and use the functions inside your SYCL* kernels.
To generate libraries that you can use with SYCL, you need to create the following files:
126
Intel® oneAPI
The format of the library files is determined by which operating system you compile your source code on, with
additional sections that carry additional library information.
• On Linux* platforms, a library is a .a archive file that contains .o object files.
• On Windows* platforms, a library is a .lib archive file that contains .obj object files.
You can call the functions in the library from your kernel without the need to know the hardware design or the
implementation details of the underlying functions in the library. Add the library to the icpx command line when
you compile your kernel.
Creating a library is a two-step process:
1. Each object file is created from an input source file using the fpga_crossgen command.
• An object file is effectively an intermediate representation of your source code with both a CPU rep-
resentation and an FPGA representation of your code.
• An object can be targeted for use with only one Intel® high-level design product. If you want to tar-
get more than one high-level design product, you must generate a separate object for each target
product.
2. Object files are combined into a library file using the fpga_libtool command. Objects created from dif-
ferent types of source code can be combined into a library, provided all objects target the same high-level
design product.
A library is automatically assigned a toolchain version number and can be used only with the targeted
high-level design product with the same version number.
127
Intel® oneAPI
You can create a library from object files from your source code. A SYCL-based object file includes code for
CPU and hardware execution of CPU capturing for use in host and emulation of the kernel.
Use the fpga_crossgen command to create library objects from your source code. An object created from your
source code contains information required both for emulating the functions in the object and synthesizing the
hardware for the object functions.
The fpga_crossgen command creates one object file from one input source file. The object created can be
used only in libraries that target the same high-level design tool. Also, objects are versioned. Each object is as-
signed a compiler version number and be used only with high-level design tools with the same version number.
Create a library object using the following command:
Example command:
128
Intel® oneAPI
Gather the object files into a library file so that others can incorporate the library into their projects and call the
functions that are contained in the objects in the library. To package object files into a library, use the fpga_-
libtool command.
Before you package object files into a library, ensure that you have the path information for all of the object files
that you want to include in the library.
All objects you want to package into a library must have the same version number. The fpga_libtool command
creates libraries encapsulated in operating-system-specific archive files (.a on Linux* and .lib on Windows*).
You cannot use libraries created on one operating system with an Intel® high-level design product running on a
different operating system.
Create a library file using the following command:
Example command:
where, the command packages objects created from RTL source code into a SYCL library called lib.a.
Note: For additional information, refer to the FPGA tutorial sample “Use Library” listed in the Intel® oneAPI
Samples Browser on Linux* or Windows*, or access the code sample on Github.
You can include static libraries in your compiler command along with your source files, as shown in the following
command:
Note: For the functions you implemented in RTL to be usable, you must declare them in your source code so
that the compiler can dynamically link the functions. For example:
129
Intel® oneAPI
When creating your RTL module for use inside SYCL kernels, ensure that the RTL module operates within the
following restrictions:
• An RTL module must use a single input Avalon® streaming interface. A single pair of ready and valid logic
must control all the inputs. You have the option to provide the necessary Avalon® streaming interface ports
but declare the RTL module as stall-free. In this case, you do not have to implement proper stall behavior
because the Intel® oneAPI DPC++/C++ Compiler creates a wrapper for your module. Refer to Object Man-
ifest File Syntax of an RTL Module for additional information.
Note: You must handle ivalid signals properly if your RTL module has an internal state. Refer to Stall-
Free RTL for more information.
• The RTL module must work correctly regardless of the kernel clock frequency.
• RTL modules cannot connect to external I/O signals. All input and output signals must come from a SYCL
kernel.
• An RTL module must have a clock port, a resetn port, and Avalon® streaming interface input and output
ports (that is, ivalid, ovalid, iready, oready). Name the ports as specified here.
• RTL modules that communicate with external memory must have Avalon® memory-mapped interface port
parameters that match the corresponding Custom Platform parameters. The Intel® oneAPI DPC++/C++
Compiler does not perform any width or burst adaptation.
• RTL modules that communicate with external memory must behave as follows:
– They cannot burst across the burst boundary.
– They cannot make requests every clock cycle and stall the hardware by monopolizing the arbitration
logic. An RTL module must pause its requests regularly to allow other load or store units to execute
their operations.
• RTL modules cannot act as stand-alone SYCL kernels. RTL modules can only be helper functions and be
integrated into a SYCL kernel during kernel compilation.
• Every function call corresponding to RTL module instantiation is independent of other instantiations.
There is no hardware sharing.
• Do not incorporate kernel code into a SYCL library file. Incorporating kernel code into the library file causes
the offline compiler to issue an error message. You may incorporate helper functions into the library file.
• An RTL component must receive all its inputs at the same time. A single ivalid input signifies that all
inputs contain valid data.
130
Intel® oneAPI
• You can only set RTL module parameters in the <RTL module description file name>.xml specifi-
cation file and not in the SYCL kernel source file. To use the same RTL module with multiple parameters,
create a separate FUNCTION tag for each parameter combination.
• You can only pass data inputs to an RTL module by value via the SYCL kernel code. Do not pass data
inputs to an RTL module via pass-by-reference, structs, or channels. In the case of channel data, pass the
extracted scalar data.
Note: Passing data inputs to an RTL module via pass-by-reference or structs causes a fatal error in the
offline compiler.
• The debugger (for example, GDB for Linux) cannot step into a library function during emulation if the li-
brary is built without the debug information. However, irrespective of whether the library is built with or
without the debug data, optimization and area reports are not mapped to the individual code line num-
bers inside a library.
• Names of RTL module source files cannot conflict with the file names of Intel® oneAPI DPC++/C++ Com-
piler IP. Both the RTL module source files and the compiler IP files are stored in the <kernel file name>/
system/synthesis/submodules directory. Naming conflicts cause existing compiler IP files in the direc-
tory to be overwritten by the RTL module source files.
• The compiler does not support .qip files. You must manually parse nested .qip files to create a flat list of
RTL files.
Tip: It is challenging to debug an RTL module that works correctly on its own but works incorrectly as
part of a SYCL kernel. Double-check all parameters under the ATTRIBUTES element in the <RTL object
manifest file name>.xml file.
• All compiler area estimation tools assume that the RTL module area is 0. The compiler does not currently
support specifying an area model for RTL modules.
Use the Intel® oneAPI DPC++/C++ Compiler to compile your SYCL code to a C-standard shared library (.so file
on Linux and .dll file on Windows). You can then call this library from other third-party code to access a broad
base of accelerated functions from your preferred programming language.
131
Intel® oneAPI
Intel® recommends defining an interface between the C-standard shared library and your SYCL code. The inter-
face must include functions you want to export and how those functions interface with your SYCL code. Prefix
the functions that you want to include in the shared library with extern "C".
Note: If you do not prefix with extern "C", then the functions appear with mangled names in the shared library.
extern "C" int vector_add(int *a, int *b, int **c, size_t vector_len) {
// Create device selector for the device of your interest.
#if FPGA_EMULATOR
// SYCL extension: FPGA emulator selector on systems without an FPGA card.
ext::intel::fpga_emulator_selector d_selector;
#elif FPGA_SIMULATOR
// SYCL extension: FPGA simulator selector
ext::intel::fpga_simulator_selector d_selector;
#elif FPGA
// SYCL extension: FPGA selector on systems with an FPGA card.
ext::intel::fpga_selector d_selector;
#else
// The default device selector selects the most performant device.
default_selector d_selector;
#endif
132
Intel® oneAPI
If you are using a Linux system, then perform these steps to generate the shared library file:
1. Compile the device code separately.
icpx -fsycl -fPIC –fintelfpga –fsycl-link=image [kernel src files] –o <hw image name> -
,→Xshardware
Where:
• fPIC: Determines whether the compiler generates position-independent code for the host portion
of the device image. Option -fPIC specifies full symbol preemption. Global symbol definitions and
global symbol references get default (preemptable) visibility unless explicitly specified otherwise.
You must use this option when building shared objects. You can also specify this option as -fpic.
Note: PIC is required so that pointers in the shared library reference global addresses and not local
addresses.
icpx -fsycl –fPIC –fintelfpga <host src files> -o <host image name> -c -DFPGA=1
Where:
• DFPGA=1: Sets a compiler macro, FPGA, equal to 1. It is used in the device selector to change between
target devices (requires corresponding host code to support this). This is optional as you can also
set your device selector to FPGA.
3. Link the host and device images and create the binary.
133
Intel® oneAPI
icpx -fsycl –fPIC –fintelfpga –shared <host image name> <hw image name> -o lib<library␣
,→name>.so
Where:
• shared: Outputs a shared library (.so file).
• Output file name: Prefix with lib for the GCC type of compilers. For additional information, see
Shared libraries with GCC on Linux. For example:
Note: Instead of the above multi-step process, you can also perform a single-step compilation to generate the
shared library. However, you must perform a full compile if you want to build the executable for testing purposes
(for example, a.out) or if you make changes in the SYCL code or C interface.
If you are using a Windows system, then perform these steps to generate the library file:
Note:
• Intel® recommends creating a new configuration in the same project properties. If you want to build the
application, you can avoid changing the configuration type for your project.
• Creating a Windows library with the default Intel® oneAPI Base Toolkit and Intel® Programmable Acceler-
ation Card (PAC) with Intel® Arria® 10 GX FPGA or Intel® FPGA PAC D5005 (previously known as Intel®
PAC with Intel® Stratix® 10 SX FPGA) are supported only for FPGA emulation. For custom platforms, con-
tact your board vendor for Windows support for FPGA hardware compiles.
1. In Microsoft Visual Studio*, navigate to Project > Properties. The Property Pages dialog is displayed for
your project.
2. Under the Configuration Properties > General > Project Defaults > Configuration Type option, select
Dynamic Library (“.dll“) from the drop-down list.
134
Intel® oneAPI
These steps may vary depending on the language or compiler you decide to use. Consult the specifications for
your desired language for more details. See Shared libraries with GCC on Linux for an example.
Generally, follow these steps to use the shared library:
1. Use the shared library function call in your third-party host code.
2. Link your host code with the shared library during the compilation.
3. Ensure that the library file is discoverable. For example:
135
Intel® oneAPI
The oneAPI tools integrate with third-party integrated development environments (IDEs) on Linux (Eclipse*)
and Windows (Visual Studio*) to provide a seamless GUI experience for software development. See FPGA
Workflows on Third-Party IDEs for Intel® oneAPI Toolkits for more details.
For FPGA development with Visual Studio Code on Linux*, refer to FPGA Development for Intel® oneAPI Toolk-
its with Visual Studio Code on Linux.
136
Intel® oneAPI
137
Intel® oneAPI
– Range-Based API
• Tested Standard C++ APIs
• Random Number Generator
oneDPL sample code is available from the oneAPI GitHub repository https://github.com/oneapi-src/
oneAPI-samples/tree/master/Libraries/oneDPL. Each sample includes a readme with build instructions.
138
Intel® oneAPI
When using the SYCL* interfaces, there are a few changes to consider:
• oneMKL has a dependency on the Intel oneAPI DPC++/C++ Compiler and Intel oneAPI DPC++ Library.
Applications must be built with the Intel oneAPI DPC++/C++ Compiler, the SYCL headers made available,
and the application linked with oneMKL using the DPC++ linker.
• SYCL interfaces in oneMKL use device-accessible Unified Shared Memory (USM) pointers for input data
(vectors, matrices, etc.).
• Many SYCL interfaces in oneMKL also support the use of sycl::buffer objects in place of the device-
accessible USM pointers for input data.
• SYCL interfaces in oneMKL are overloaded based on the floating point types. For example, there are sev-
eral general matrix multiply APIs, accepting single precision real arguments (float), double precision real
arguments (double), half precision real arguments (half), and complex arguments of different precision
using the standard library types std::complex<float>, std::complex<double>.
• A two-level namespace structure for oneMKL is added for SYCL interfaces:
Namespace Description
oneapi::mkl Contains common elements between various domains in oneMKL
oneapi::mkl:: Contains dense vector-vector, matrix-vector, and matrix-matrix low level operations
blas
oneapi::mkl:: Contains higher-level dense matrix operations like matrix factorizations and eigen-
lapack solvers
oneapi::mkl:: Contains random number generators for various probability density functions
rng
oneapi::mkl:: Contains basic statistical estimates for single and double precision multi-
stats dimensional datasets
oneapi::mkl::vm Contains vector math routines
oneapi::mkl:: Contains fast fourier transform operations
dft
oneapi::mkl:: Contains sparse matrix operations like sparse matrix-vector multiplication and
sparse sparse triangular solver
To demonstrate a typical workflow for the oneMKL with SYCL* interfaces, the following example source code
snippets perform a double precision matrix-matrix multiplication on a GPU device.
Note: The following code example requires additional code to compile and run, as indicated by the inline com-
ments.
139
Intel® oneAPI
140
Intel® oneAPI
return 0;
}
Consider that (double precision valued) matrices A(of size m-by-k), B( of size k-by-n) and C(of size m-by-n) are
stored in some arrays on the host machine with leading dimensions ldA, ldB, and ldC, respectively. Given scalars
(double precision) alpha and beta, compute the matrix-matrix multiplication (mkl::blas::gemm):
C = alpha * A * B + beta * C
Include the standard SYCL headers and the oneMKL SYCL/DPC++ specific header that declares the desired
mkl::blas::gemm API:
Next, load or instantiate the matrix data on the host machine as usual and then create the GPU device, create
an asynchronous exception handler, and finally create the queue on the device with that exception handler. Ex-
ceptions that occur on the host can be caught using standard C++ exception handling mechanisms; however,
141
Intel® oneAPI
exceptions that occur on a device are considered asynchronous errors and stored in an exception list to be pro-
cessed later by this user-provided exception handler.
// Create GPU device
sycl::device my_device;
try {
my_device = sycl::device(sycl::gpu_selector());
}
catch (...) {
std::cout << "Warning: GPU device not found! Using default device instead." << std::endl;
}
// Create asynchronous exceptions handler to be attached to queue.
// Not required; can provide helpful information in case the system isn’t correctly configured.
auto my_exception_handler = [](sycl::exception_list exceptions) {
for (std::exception_ptr const& e : exceptions) {
try {
std::rethrow_exception(e);
}
catch (sycl::exception const& e) {
std::cout << "Caught asynchronous SYCL exception:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "Caught asynchronous STL exception:\n"
<< e.what() << std::endl;
}
}
};
The matrix data is now loaded into the SYCL buffers, which enables offloading to desired devices and then back
to host when complete. Finally, the mkl::blas::gemm API is called with all the buffers, sizes, and transpose
operations, which will enqueue the matrix multiply kernel and data onto the desired queue.
// create execution queue on my gpu device with exception handler attached
sycl::queue my_queue(my_device, my_exception_handler);
// create sycl buffers of matrix data for offloading between device and host
sycl::buffer<double, 1> A_buffer(A.data(), A.size());
sycl::buffer<double, 1> B_buffer(B.data(), B.size());
sycl::buffer<double, 1> C_buffer(C.data(), C.size());
// add oneapi::mkl::blas::gemm to execution queue and catch any synchronous exceptions
try {
using oneapi::mkl::blas::gemm;
using oneapi::mkl::transpose;
gemm(my_queue, transpose::nontrans, transpose::nontrans, m, n, k, alpha, A_buffer, ldA, B_
,→buffer,
ldB, beta, C_buffer, ldC);
}
catch (sycl::exception const& e) {
std::cout << "\t\tCaught synchronous SYCL exception during GEMM:\n"
<< e.what() << std::endl;
}
catch (std::exception const& e) {
std::cout << "\t\tCaught synchronous STL exception during GEMM:\n"
(continues on next page)
142
Intel® oneAPI
At some time after the gemm kernel has been enqueued, it will be executed. The queue is asked to wait for all
kernels to execute and then pass any caught asynchronous exceptions to the exception handler to be thrown.
The runtime will handle transfer of the buffer’s data between host and GPU device and back. By the time an
accessor is created for the C_buffer, the buffer data will have been silently transferred back to the host machine
if necessary. In this case, the accessor is used to print out a 2x2 submatrix of C_buffer.
return 0;
Note that the resulting data is still in the C_buffer object and, unless it is explicitly copied elsewhere (like back
to the original C container), it will only remain available through accessors until the C_buffer is out of scope.
143
Intel® oneAPI
oneTBB can be used with the Intel oneAPI DPC++/C++ Compiler in the same way as with any other C++ compiler.
For more details, see the oneTBB documentation.
Currently, oneTBB does not directly use any accelerators. However, it can be combined with SYCL*, OpenMP*
offload, and other oneAPI libraries to build a program that efficiently uses all available hardware resources.
Two basic oneTBB code samples are available within the oneAPI GitHub repository https://github.com/
oneapi-src/oneAPI-samples/tree/master/Libraries/oneTBB. Both samples are prepared for CPU and GPU.
• tbb-async-sycl: illustrates how computational kernel can be split for execution between CPU and GPU
using oneTBB Flow Graph asynchronous node and functional node. The Flow Graph asynchronous node
uses SYCL* to implement calculations on GPU while the functional node does CPU part of calculations.
• tbb-task-sycl: illustrates how two oneTBB tasks can execute similar computational kernels with one
task executing SYCL code and another one the oneTBB code.
• tbb-resumable-tasks-sycl: illustrates how a computational kernel can be split for execution between
a CPU and GPU using oneTBB resumable task and parallel_for. The resumable task uses SYCL to imple-
ment calculations on GPU while parallel_for does the CPU portion of calculations.
144
Intel® oneAPI
Information about dependencies needed to build and link your application with oneDAL are available from the
oneDAL System Requirements.
A oneDAL-based application can seamlessly execute algorithms on CPU or GPU by picking the proper device
selector. New capabilities also allow:
• extracting SYCL* buffers from numeric tables and pass them to a custom kernel
• creating numeric tables from SYCL buffers
Algorithms are optimized to reuse SYCL buffers to keep GPU data and remove overload from repeatedly copy-
ing data between GPU and CPU.
oneDAL code samples are available from the oneDAL GitHub. The following code sample is a recommended
starting point: https://github.com/oneapi-src/oneDAL/tree/master/examples/oneapi/dpc/source/svm
145
Intel® oneAPI
Refer to the Intel oneAPI Collective Communications Library System Requirements for a full list of hardware and
software dependencies, such as MPI and Intel oneAPI DPC++/C++ Compiler.
SYCL*-aware API is an optional feature of oneCCL. There is a choice between CPU and SYCL back ends when
creating the oneCCL stream object.
• For CPU backend: Specify ccl_stream_host as the first argument.
• For SYCL backend: Specify ccl_stream_cpu or ccl_stream_gpu depending on the device type.
• For collective operations that operate on the SYCL stream:
– For C API, oneCCL expects communication buffers to be sycl::buffer objects casted to void*.
– For C++ API, oneCCL expects communication buffers to be passed by reference.
Additional usage details are available from https://oneapi-src.github.io/oneCCL/.
oneCCL code samples are available from the oneAPI GitHub repository https://github.com/oneapi-src/
oneAPI-samples/tree/master/Libraries/oneCCL.
A Getting Started sample with instructions to build and run the code is available from within the same GitHub
repository.
146
Intel® oneAPI
oneDNN supports systems based on Intel 64 architecture or compatible processors. A full list of supported CPU
and graphics hardware is available from the Intel oneAPI Deep Neural Network Library System Requirements.
oneDNN detects the instruction set architecture (ISA) in the runtime and uses online generation to deploy the
code optimized for the latest supported ISA.
Several packages are available for each operating system to ensure interoperability with CPU or GPU runtime
libraries used by the application.
The packages do not include library dependencies and these need to be resolved in the application at build time
with oneAPI toolkits or third-party tools.
When used in the SYCL* environment, oneDNN relies on the DPC++ SYCL runtime to interact with CPU or GPU
hardware. oneDNN may be used with other code that uses SYCL. To do this, oneDNN provides API extensions
to interoperate with underlying SYCL objects.
One of the possible scenarios is executing a SYCL kernel for a custom operation not provided by oneDNN. In
this case, oneDNN provides all necessary APIs to seamlessly submit a kernel, sharing the execution context with
oneDNN: using the same device and queue.
The interoperability API is provided for two scenarios:
• Construction of oneDNN objects based on existing SYCL objects
• Accessing SYCL objects for existing oneDNN objects
The mapping between oneDNN and SYCL objects is summarized in the tables below.
Note: Internally, library memory objects use 1D uint8_t SYCL buffers, however SYCL buffers of a different type
can be used to initialize and access memory. In this case, buffers will be reinterpreted to the underlying type
cl::sycl::buffer<uint8_t, 1>.
147
Intel® oneAPI
Note:
• Building applications with oneDNN requires a compiler. The Intel oneAPI DPC++/C++ Compiler is avail-
able as part of the Intel oneAPI Base Toolkit.
• You must include dnnl_sycl.hpp to enable the SYCL-interop API.
• Because OpenMP does not rely on the passing of runtime objects, it does not require an interoperability
API to work with oneDNN.
148
Intel® oneAPI
priority support is available as a paid option. For Intel community-support, visit the oneVPL forum. For the
community-supported open-source version, visit the oneVPL GitHub* page.
Applications can use oneVPL to program video decoding, encoding, and image processing components.
oneVPL provides a default CPU implementation that can be used as a reference design before using other ac-
celerators.
oneVPL applications follow a basic sequence in the programming model:
1. The oneVPL dispatcher automatically finds all available accelerators during runtime.
2. Dispatcher uses the selected accelerator context to initialize a session.
3. oneVPL configures the video component at the start of the session.
4. oneVPL processing loop is launched. The processing loop handles work asynchronously.
5. If the application chooses to let oneVPL manage working memory, then memory allocation will be implicitly
managed by the video calls in the processing loop.
6. After work is done, oneVPL uses a clear call to clean up all resources.
The oneVPL API is defined using a classic C style interface and is compatible with C++ and SYCL*.
oneVPL provides rich code samples to show how to use the oneVPL API. The code samples are included in the
release package and are also available from the oneAPI-samples repository on GitHub*.
For example, the hello-decode sample shows a simple decode operation of HEVC input streams and demon-
strates the basic steps in the oneVPL programming model.
The sample can be broken down into the following key steps in the code:
Note: The snippets below may not reflect the latest version of the sample. Refer to the release package or
sample repository for the latest version of this example.
loader = MFXLoad();
cfg = MFXCreateConfig(loader);
ImplValue.Type = MFX_VARIANT_TYPE_U32;
ImplValue.Data.U32 = MFX_CODEC_HEVC;
(continues on next page)
149
Intel® oneAPI
Here, MFXCreateConfig() creates the dispatcher internal configuration. Once the dispatcher is config-
ured, the application uses MFXSetConfigFilterProperty() to set its requirements including codec ID
and accelerator preference. After the application sets the desired requirements, the session is created.
2. Start the decoding loop:
while(is_stillgoing) {
sts = MFXVideoDECODE_DecodeFrameAsync(session,
(isdraining) ? NULL : &bitstream,
NULL,
&pmfxOutSurface,
&syncp);
......
}
After preparing the input stream, the stream has the required context and the decoding loop is started
immediately.
MFXVideoDECODE_DecodeFrameAsync() takes the bit stream as the second parameter. When the bit
stream becomes NULL, oneVPL drains the remaining frames from the input and completes the operation.
The third parameter is the working memory; the NULL input shown in the example means the application
wants oneVPL to manage working memory.
3. Evaluate results of a decoding call:
while(is_stillgoing) {
sts = MFXVideoDECODE_DecodeFrameAsync(...);
switch(sts) {
case MFX_ERR_MORE_DATA:
......
ReadEncodedStream(bitstream, codec_id, source);
......
}
break;
case MFX_ERR_NONE:
do {
sts = pmfxOutSurface->FrameInterface->Synchronize(pmfxOutSurface, WAIT_100_
,→MILLSECONDS);
150
Intel® oneAPI
WriteRawFrame(pmfxOutSurface, sink);
sts = pmfxOutSurface->FrameInterface->Unmap(pmfxOutSurface);
sts = pmfxOutSurface->FrameInterface->Release(pmfxOutSurface);
framenum++;
}
} while( sts == MFX_WRN_IN_EXECUTION );
break;
default:
break;
}
For each MFXVideoDECODE_DecodeFrameAsync() call, the application continues to read the input bit
stream until oneVPL completes a new frame with MFX_ERR_NONE, indicating the function successfully
completed its operation. For each new frame, the application waits until the output memory (surface)
is ready and then outputs and releases the output frame.
The Map() call is used to map the memory from the discrete graphic memory space to the host memory
space.
4. Exit and do cleanup:
MFXUnload(loader);
free(bitstream.Data);
fclose(sink);
fclose(source);
Finally, MFXUnload() is called to reclaim the resources from oneVPL. This is the only call that the applica-
tion must execute to reclaim the oneVPL library resources.
Note: This example explains the key steps in the oneVPL programming model. It does not explain utility func-
tions for input and output.
151
Intel® oneAPI
152
Intel® oneAPI
SYCL is a single-source style programming model based on C++. It builds on features of C++17 and C++20 to
offer an open, multivendor, multiarchitecture solution for heterogeneous programming.
The DPC++ compiler project is bringing SYCL* to an LLVM C++ compiler, with high performance implementa-
tions for multiple vendors and architectures.
When accelerating an existing C++ application, SYCL provides seamless integration as most of the C++ code
remains intact. Refer to sections within oneAPI Programming Model for SYCL constructs to enable device side
compilation.
The Intel® DPC++ Compatibility Tool is part of the Intel® oneAPI Base Toolkit. The goal of this tool is to assist in the
migration of an existing program that is written in NVIDIA* CUDA* to a program written in SYCL* and compiled
with the DPC++ compiler. This tool generates SYCL code as much as it can. However, it will not migrate all code
and manual changes may be required. The tool provides help with IDE plug-ins, a user guide, and embedded
comments in the code to complete the migration to be compiled with DPC++. After completing any manual
changes, use a DPC++ compiler to create executables.
153
Intel® oneAPI
• Additional details, including examples of migrated code and download instructions for the tool, are avail-
able from the Intel® DPC++ Compatibility Tool website.
• Full usage information is available from the Intel® DPC++ Compatibility Tool User Guide
The SYCL runtime for the DPC++ project uses OpenCL and other means to enact the parallelism. SYCL typically
requires fewer lines of code to implement kernels and also fewer calls to essential API functions and methods. It
enables creation of OpenCL programs by embedding the device source code in line with the host source code.
OpenCL application developers are keenly aware of the somewhat verbose setup code that goes with offload-
ing kernels on devices. Using SYCL, it is possible to develop a clean, modern C++ based application without
most of the setup associated with OpenCL C code. This reduces the learning effort and allows for focus on
parallelization techniques.
However, OpenCL application features can continue to be used via the SYCL API. The updated code can use
as much or as little of the SYCL interface as desired.
Programming with SYCL* and using the DPC++ compiler, a platform consists of a host device connected to zero
or more devices, such as CPU, GPU, FPGA, or other kinds of accelerators and processors.
When a platform has multiple devices, design the application to offload some or most of the work to the devices.
There are different ways to distribute work across devices in the oneAPI programming model:
1. Initialize device selector – SYCL provides a set of classes called selectors that allow manual selection of
devices in the platform or let oneAPI runtime heuristics choose a default device based on the compute
power available on the devices.
2. Splitting datasets – With a highly parallel application with no data dependency, explicitly divide the
datasets to employ different devices. The following code sample is an example of dispatching workloads
across multiple devices. Use icpx -fsycl snippet.cpp to compile the code.
154
Intel® oneAPI
int main() {
int data[1024];
for (int i = 0; i < 1024; i++)
data[i] = i;
try {
cpu_selector cpuSelector;
queue cpuQueue(cpuSelector);
gpu_selector gpuSelector;
queue gpuQueue(gpuSelector);
buffer<int, 1> buf(data, range<1>(1024));
cpuQueue.submit([&](handler& cgh) {
auto ptr =
buf.get_access<access::mode::read_write>(cgh);
cgh.parallel_for<class divide>(range<1>(512),
[=](id<1> index) {
ptr[index] -= 1;
});
});
gpuQueue.submit([&](handler& cgh1) {
auto ptr =
buf.get_access<access::mode::read_write>(cgh1);
cgh1.parallel_for<class offset1>(range<1>(1024),
id<1>(512), [=](id<1> index) {
ptr[index] += 1;
});
});
cpuQueue.wait();
gpuQueue.wait();
}
catch (exception const& e) {
std::cout <<
"SYCL exception caught: " << e.what() << '\n';
return 2;
}
return 0;
}
3. Target multiple kernels across devices – If the application has scope for parallelization on multiple inde-
pendent kernels, employ different queues to target devices. The list of SYCL supported platforms can
be obtained with the list of devices for each platform by calling get_platforms() and platform.get_-
devices() respectively. Once all the devices are identified, construct a queue per device and dispatch
different kernels to different queues. The following code sample represents dispatching a kernel on mul-
tiple SYCL devices.
#include <stdio.h>
#include <vector>
#include <CL/sycl.hpp>
using namespace cl::sycl;
using namespace std;
int main()
{
size_t N = 1024;
(continues on next page)
155
Intel® oneAPI
156
Intel® oneAPI
6.2 Composability
The oneAPI programming model enables an ecosystem with support for the entire development toolchain. It
includes compilers and libraries, debuggers, and analysis tools to support multiple accelerators like CPU, GPUs,
FPGA, and more.
The oneAPI programming model provides a unified compiler based on LLVM/Clang with support for OpenMP*
offload. This allows seamless integration that allows the use of OpenMP constructs to either parallelize host side
applications or offload to a target device. Both the Intel® oneAPI DPC++/C++ Compiler, available with the Intel®
oneAPI Base Toolkit, and Intel® C++ Compiler Classic, available with the Intel® oneAPI HPC Toolkit or the Intel®
oneAPI IoT Toolkit, support OpenMP and SYCL composability with a set of restrictions. A single application can
offload execution to available devices using OpenMP target regions or SYCL constructs in different parts of the
code, such as different functions or code segments.
OpenMP and SYCL offloading constructs may be used in separate files, in the same file, or in the same function
with some restrictions. OpenMP and SYCL offloading code can be bundled together in executable files, in static
libraries, in dynamic libraries, or in various combinations.
Note: The SYCL runtime for DPC++ uses the TBB runtime when executing device code on the CPU; hence, us-
ing both OpenMP and SYCL a CPU can lead to oversubscribing of threads. Performance analysis of workloads
executing on the system could help determine if this is occurring.
Restrictions
There are some restrictions to be considered when mixing OpenMP and SYCL constructs in the same applica-
tion.
• OpenMP directives cannot be used inside SYCL kernels that run in the device. Similarly, SYCL code can-
not be used inside the OpenMP target regions. However, it is possible to use SYCL constructs within the
OpenMP code that runs on the host CPU.
• OpenMP and SYCL device parts of the program cannot have cross dependencies. For example, a func-
tion defined in the SYCL part of the device code cannot be called from the OpenMP code that runs on the
device and vice versa. OpenMP and SYCL device parts are linked independently and they form separate
binaries that become a part of the resulting fat binary that is generated by the compiler.
157
Intel® oneAPI
• The direct interaction between OpenMP and SYCL runtime libraries are not supported at this time. For
example, a device memory object created by OpenMP API is not accessible by SYCL code. That is, using
the device memory object created by OpenMP in SYCL code results unspecified execution behavior.
Example
The following code snippet uses SYCL and OpenMP offloading constructs in the same application.
#include <CL/sycl.hpp>
#include <array>
#include <iostream>
float computePi(unsigned N) {
float Pi;
#pragma omp target map(from : Pi)
#pragma omp parallel for reduction(+ : Pi)
for (unsigned I = 0; I < N; ++I) {
float T = (I + 0.5f) / N;
Pi += 4.0f / (1.0 + T * T);
}
return Pi / N;
}
int main() {
std::array<float, 1024u> Vec;
float Pi;
158
Intel® oneAPI
The following command is used to compile the example code: icpx -fsycl -fiopenmp
-fopenmp-targets=spir64 offloadOmp_dpcpp.cpp
where
• -fsycl option enables SYCL
• -fiopenmp -fopenmp-targets=spir64 option enables OpenMP* offload
The following shows the program output from the example code.
./a.out
Vec[512] = 512
Pi = 3.14159
Note: If the code does not contain OpenMP offload, but only normal OpenMP code, use the following com-
mand, which omits -fopenmp-targets: icpx -fsycl -fiopenmp omp_dpcpp.cpp
The oneAPI programming model enables developers to continue using all OpenCL code features via different
parts of the SYCL* API. The OpenCL code interoperability mode provided by SYCL helps reuse the existing
OpenCL code while keeping the advantages of higher programming model interfaces provided by SYCL. There
are 2 main parts in the interoperability mode:
1. To create SYCL objects from OpenCL code objects. For example, a SYCL buffer can be constructed from
an OpenCL cl_mem or SYCL queue from a cl_command_queue.
2. To get OpenCL code objects from SYCL objects. For example, launching an OpenCL kernel that uses an
implicit cl_mem associated to a SYCL accessor.
159
Intel® oneAPI
– Debug what is happening in your code on the host using a standard debugger, such as Intel Distri-
bution for GDB*.
– Debug problems on the offload device using a device-specific debugger. Note, however, that the
device may have a different architecture, conventions for representing compute threads, or assem-
bly than the host.
– To debug problems that show up in the intermediate software stack only when kernels and data are
being exchanged with the device, you need to monitor the communication between device and host
and any errors that are reported during the process.
• Besides the usual performance issues that can occur on the host and offload devices, the patterns by
which the host and offload device work together can have a profound impact on application performance.
This is another case where you need to monitor the communications between the host and offload device.
This section discusses the various debugging and performance analysis tools and techniques available to you
for the entire lifecycle of the offload program.
The following tools are available to help with debugging the SYCL* and OpenMP* offload process.
160
Intel® oneAPI
Intercept Layer for OpenCL™ When using the OpenCL™ backend for SYCL and OpenMP Offload, this
Applications library can be used to debug backend errors and for performance profiling
on both the host and device (has wider functionality comparing with one-
trace).
Intel® Distribution for GDB* Used for source-level debugging of the application, typically to inspect
logical bugs, on the host and any devices you are using (CPU, GPU, FPGA
emulation).
Intel® Inspector This tool helps to locate and debug memory and threading problems, in-
cluding those that can cause offloading to fail.
Note: Intel Inspector is included in the Intel oneAPI HPC Toolkit or the
Intel oneAPI IoT Toolkit.
In-application debugging In addition to these tools and runtime based approaches, the developer
can locate problems using other approaches. For example:
• Comparing kernel output to expected output
• Sending intermediate results back by variables they cre-
ate for debugging purposes
• Printing results from within kernels
Intel® Advisor Use to ensure Fortran, C, C++, OpenCL™, and SYCL applications realize full
performance potential on modern processors.
Intel® VTune TM Profiler Use to gather performance data either on the native system or on a remote
system.
161
Intel® oneAPI
Both the OpenMP* and SYCL offload runtimes, as well as Level Zero, OpenCL, and the Shader Compiler, pro-
vide environment variables that help you understand the communication between the host and offload device.
The variables also allow you to discover or control the runtime chosen for offload computations.
There are several environment variables that you can use to understand how OpenMP Offload works and con-
trol which backend it uses.
162
Intel® oneAPI
Note: The Level Zero backend is only supported for GPU devices.
Values:
• LEVEL0 or LEVEL_ZERO - uses the Level Zero backend
• OPENCL - uses the OpenCL™ backend
Default:
• For GPU offload devices: LEVEL0
• For CPU or FPGA offload devices: OPENCL
The DPC++ compiler supports all standard SYCL environment variables. The full list is available from GitHub. Of
interest for debugging are the following SYCL environment variables, plus an additional Level Zero environment
variable.
163
Intel® oneAPI
The Level Zero backend provides a few environment variables that can be used to control behavior and aid in
diagnosis.
• Level Zero Specification, core programming guide: https://spec.oneapi.com/level-zero/latest/core/
PROG.html#environment-variables
• Level Zero Specification, tool programming guide: https://spec.oneapi.com/level-zero/latest/tools/
PROG.html#environment-variables
164
Intel® oneAPI
An additional source of debug information comes from the Intel® Graphics Compiler, which is called by the
Level Zero or OpenCL backends (used by both the OpenMP Offload and SYCL/DPC++ Runtimes) at runtime
or during Ahead-of-Time (AOT) compilation. Intel Graphics Compiler creates the appropriate executable code
for the target offload device. The full list of these environment variables can be found at https://github.com/
intel/intel-graphics-compiler/blob/master/documentation/configuration_flags.md. The two that are most of-
ten needed to debug performance issues are:
• IGC_ShaderDumpEnable=1 (default=0) causes all LLVM, assembly, and ISA code generated by the
Intel® Graphics Compiler to be written to /tmp/IntelIGC/<application_name>
• IGC_DumpToCurrentDir=1 (default=0) writes all the files created by IGC_ShaderDumpEnable to your
current directory instead of /tmp/IntelIGC/<application_name>. Since this is potentially a lot of files,
it is recommended to create a temporary directory just for the purpose of holding these files.
If you have a performance issue with your OpenMP offload or SYCL offload application that arises between dif-
ferent versions of Intel® oneAPI, when using different compiler options, when using the debugger, and so on,
then you may be asked to enable IGC_ShaderDumpEnable and provide the resulting files. For more information
on compatibility, see oneAPI Library Compatibility.
In addition to debuggers and diagnostics built into the offload software itself, it can be quite useful to monitor
offload API calls and the data sent through the offload pipeline. For Level Zero, if your application is run as an
argument to the onetrace and ze_tracer tools, they will intercept and report on various aspects of Level Zero
made by your application. For OpenCL™, you can add a library to LD_LIBRARY_PATH that will intercept and report
on all OpenCL calls, and then use environment variables to control what diagnostic information to report to a
file. You can also use onetrace or cl_tracer to report on various aspects of OpenCL API calls made by your
application. Once again, your application is run as an argument to the onetrace or cl_tracer tool.
This library collects debugging and performance data when OpenCL is used as the backend to your SYCL
or OpenMP offload program. When OpenCL is used as the backend to your SYCL or OpenMP offload pro-
gram, this tool can help you detect buffer overwrites, memory leaks, mismatched pointers, and can provide more
detailed information about runtime error messages (allowing you to diagnose these issues when either CPU,
FPGA, or GPU devices are used for computation). Note that you will get nothing useful if you use ze_tracer on
a program that uses the OpenCL backend, or the Intercept Layer for OpenCL Applications library and cl_tracer
on a program that uses the Level Zero backend.
Additional resources:
• Extensive information on building and using the Intercept Layer for OpenCL Applications is available from
https://github.com/intel/opencl-intercept-layer.
Note: For best results, run cmake with the following flags: -DENABLE_CLIPROF=TRUE -DENABLE_-
CLILOADER=TRUE
165
Intel® oneAPI
Like the Intercept Layer for OpenCL™ Applications, these tools collect debugging and performance data from
applications that use the OpenCL and Level Zero offload backends for offload via OpenMP* or SYCL. Note that
Level Zero can only be used as the backend for computations that happen on the GPU (there is no Level Zero
backend for the CPU or FPGA at this time). The onetrace tool is part of the Profiling Tools Interfaces for GPU
(PTI for GPU) project, found at https://github.com/intel/pti-gpu. This project also contains the ze_tracer and
cl_tracer tools, which trace just activity from the Level Zero or OpenCL offload backends respectively. The ze_-
tracer and cl_tracer tools will produce no output if they are used with the application using the other backend,
while onetrace will provide output no matter which offload backend you use.
The onetrace tool is distributed as source. Instructions for how to build the tool are available from https://github.
com/intel/pti-gpu/tree/master/tools/onetrace. The tool provides the following features:
• Call logging: This mode allows you to trace all standard Level Zero (L0) and OpenCL™ API calls along
with their arguments and return values annotated with time stamps. Among other things, this can give you
supplemental information on any failures that occur when a host program tries to make use of an attached
compute device.
• Host and device timing: These provide the duration of all API calls, the duration of each kernel, and appli-
cation runtime for the entire application.
• Device Timeline mode: Gives time stamps for each device activity. All the time stamps are in the same
(CPU) time scale.
• Chrome Call Logging mode: Dumps API calls to JSON format that can be opened in chrome://tracing
browser tool.
These data can help debug offload failures or performance issues.
Additional resources:
• Profiling Tools Interfaces for GPU (PTI for GPU) GitHub project
• Onetrace tool GitHub
The Intel Distribution for GDB* is an application debugger that allows you to inspect and modify the program
state. With the debugger, both the host part of your application and kernels that are offloaded to a device can be
debugged seamlessly in the same debug session. The debugger supports the CPU, GPU, and FPGA-emulation
devices. Major features of the tool include:
• Automatically attaching to the GPU device to listen to debug events
166
Intel® oneAPI
Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multi-
threaded applications. It can be used to verify correctness of the native part of the application as well as dynam-
ically generated offload code.
Unlike the tools and techniques above, Intel Inspector cannot be used to catch errors in offload code that is com-
municating with a GPU or an FPGA. Instead, Intel Inspector requires that the SYCL or OpenMP runtime needs
to be configured to execute kernels on CPU target. In general, it requires definition of the following environment
variables prior to an analysis run.
• To configure a SYCL application to run kernels on a CPU device
export SYCL_DEVICE_FILTER=opencl:cpu
export OMP_TARGET_OFFLOAD=MANDATORY
export LIBOMPTARGET_DEVICETYPE=cpu
export CL_CONFIG_USE_VTUNE=True
export CL_CONFIG_USE_VECTORIZER=false
Use one of the following commands to start analysis from the command line. You can also start from the Intel
Inspector graphical user interface.
• Memory: inspxe-cl -c mi3 -- <app> [app_args]
167
Intel® oneAPI
Before offload code can run on the device, the machine-independent version of the kernel needs to be com-
piled for the target device, and the resulting code needs to be copied to the device. This can complicate/skew
benchmarking if this kernel setup time is not considered. Just-in-time compilation can also introduce a notice-
able delay when debugging an offload application.
If you have an OpenMP* offload program, setting LIBOMPTARGET_PLUGIN_PROFILE=T[,usec] explicitly reports
the amount of time required to build the offload code “ModuleBuild”, which you can compare to the overall exe-
cution time of your program.
Kernel setup time is more difficult to determine if you have a SYCL* offload program.
• If Level Zero or OpenCL™ is your backend, you can derive kernel setup time from the Device Timing and
Device Timeline returned by onetrace or ze_tracer.
• If OpenCL™ is your backend, you may also be able to derive the information by set-
ting the BuildLogging, KernelInfoLogging, CallLogging, CallLoggingElapsedTime,
KernelInfoLogging, HostPerformanceTiming, HostPerformanceTimeLogging, ChromeCallLogging,
or CallLoggingElapsedTime flags when using the Intercept Layer for OpenCL Applications to get
similar information. You can also derive kernel setup time from the Device Timing and Device Time- line
returned by onetrace or cl_tracer.
You can also use these tools to supplement the information returned by LIBOMPTARGET_PLUGIN_PROFILE=T.
Understanding when buffers are created, how many buffers are created, and whether they are reused or con-
stantly created and destroyed can be key to optimizing the performance of your offload application. This may
not always be obvious when using a high-level programming language like OpenMP or SYCL, which can hide a
lot of the buffer management from the user.
At a high level, you can track buffer-related activities using the LIBOMPTARGET_DEBUG and SYCL_PI_TRACE envi-
ronment variables when running your program. LIBOMPTARGET_DEBUG gives you more information than SYCL_-
PI_TRACE - it reports the addresses and sizes of the buffers created. By contrast, SYCL_PI_TRACE just reports
the API calls, with no information you can easily tie to the location or size of individual buffers.
168
Intel® oneAPI
At a lower level, if you are using Level Zero or OpenCL™ as your backend, the Call Logging mode of onetrace
or ze_tracer will give you information on all API calls, including their arguments. This can be useful because,
for example, a call for buffer creation (such as zeMemAllocDevice) will give you the size of the resulting buffer
being passed to and from the device. onetrace and ze_tracer also allows you to dump all the Level Zero device-
side activities (including memory transfers) in Device Timeline mode. For each activity one can get append (to
command list), submit (to queue), start and end times.
If you are using OpenCL as your backend, setting the CallLogging, CallLoggingElapsedTime, and
ChromeCallLogging flags when using the Intercept Layer for OpenCL Applications should give you similar in-
formation. The Call Logging mode of onetrace or cl_tracer will give you information on all OpenCL API calls, in-
cluding their arguments. As was the case above, onetrace and cl_tracer also allow you to dump all the OpenCL
device-side activities (including memory transfers) in Device Timeline mode.
Comparing total data transfer time to kernel execution time can be important for determining whether it is prof-
itable to offload a computation to a connected device.
If you have an OpenMP offload program, setting LIBOMPTARGET_PLUGIN_PROFILE=T[,usec] explicitly reports
the amount of time required to build (“DataAlloc”), read (“DataRead”), and write data (“DataWrite”) to the offload
device (although only in aggregate).
Data transfer times can be more difficult to determine if you have a C++ program using SYCL.
• If Level Zero or OpenCL™ is your backend, you can derive total data transfer time from the Device Timing
and Device Timeline returned by onetrace or ze_tracer.
• If OpenCL is your backend, you can use onetrace or cl_tracer, or alternatively you may
also be able to derive the information by setting the BuildLogging, KernelInfoLogging,
CallLogging, CallLoggingElapsedTime, KernelInfoLogging, HostPerformanceTiming,
HostPerformanceTimeLogging, ChromeCallLogging, or CallLoggingElapsedTime flags when
using the Intercept Layer for OpenCL Applications.
169
Intel® oneAPI
On occasion, offload kernels are created and transferred to the device a long time before they actually start ex-
ecuting (usually only after all data required by the kernel has also been transferred, along with control).
You can set a breakpoint in a device kernel using the Intel® Distribution for GDB* and a compatible GPU. From
there, you can query kernel arguments, monitor thread creation and destruction, list the current threads and their
current positions in the code (using “info thread”), and so on.
When an offload program fails to run correctly or produces incorrect results, a relatively quick sanity check is to
run the application on a different runtime (OpenCL™ vs. Level Zero) or compute device (CPU vs. GPU) using
LIBOMPTARGET_PLUGIN and OMP_TARGET_OFFLOAD for OpenMP* applications, and SYCL_DEVICE_FILTER for
SYCL* applications. Errors that reproduce across runtimes mostly eliminate the runtime as being a problem.
Errors that reproduce on all available devices mostly eliminates bad hardware as the problem.
Offload code has two options for CPU execution: either a “host” implementation, or the CPU version of OpenCL.
A “host” implementation is a truly native implementation of the offloaded code, meaning it can be debugged like
any of the non-offloaded code. The CPU version of OpenCL, while it goes through the OpenCL runtime and
code generation process, eventually ends up as normal parallel code running under a TBB runtime. Again, this
provides a familiar debugging environment with familiar assembly and parallelism mechanisms. Pointers have
meaning through the entire stack, and data can be directly inspected. There are also no memory limits beyond
the usual limits for any operating system process.
Finding and fixing errors in CPU offload execution may solve errors seen in GPU offload execution with less pain,
and without requiring use of a system with an attached GPU or other accelerator.
For OpenMP applications, to get a “host” implementation, remove the “target” or “device” constructs, replacing
them with normal host OpenMP code. If LIBOMPTARGET_PLUGIN=OPENCL and offload to the GPU is disabled,
then the offloaded code runs under the OpenMP runtime with TBB providing parallelism.
For SYCL applications, with SYCL_DEVICE_FILTER=host the “host” device is actually single-threaded, which
may help you determine if threading issues, such as data races and deadlocks, are the source of execution er-
rors. Setting SYCL_DEVICE_FILTER=opencl:cpu uses the CPU OpenCL runtime, which also uses TBB for par-
allelism.
Debug GPU Execution Using Intel® Distribution for GDB* on compatible GPUs
Intel® Distribution for GDB* is extensively documented in Get Started with Intel Distribution for GDB on Linux*
Host | Windows* Host. Useful commands are briefly described in the Intel Distribution for GDB Reference
Sheet. However, since debugging applications with GDB* on a GPU differs slightly from the process on a host
(some commands are used differently and you might see some unfamiliar output), some of those differences
are summarized here.
The Debugging with Intel Distribution for GDB on Linux OS Host Tutorial shows a sample debug session where
we start a debug session of a SYCL program, define a breakpoint inside the kernel, run the program to offload to
170
Intel® oneAPI
the GPU, print the value of a local variable, switch to the SIMD lane 5 of the current thread, and print the variable
again.
As in normal GDB*, for a command <CMD>, use the help <CMD> command of GDB to read the information text
for <CMD>. For example:
Options:
-gid
Show global thread IDs.
The threads of the application can be listed using the debugger. The printed information includes the thread ids
and the locations that the threads are currently stopped at. For the GPU threads, the debugger also prints the
active SIMD lanes.
In the example referenced above, you may see some unfamiliar formatting used when threads are displayed via
the GDB “info threads” command:
Id Target Id Frame
1.1 Thread <id omitted> <frame omitted>
1.2 Thread <id omitted> <frame omitted>
* 2.1:1 Thread 1073741824 <frame> at array-transform.cpp:61
2.1:[3 5 7] Thread 1073741824 <frame> at array-transform.cpp:61
2.2:[1 3 5 7] Thread 1073741888 <frame> at array-transform.cpp:61
2.3:[1 3 5 7] Thread 1073742080 <frame> at array-transform.cpp:61
Here, GDB is displaying the threads with the following format: <inferior_number>.<thread_number>:<SIMD
Lane/s>
So, for example, the thread id “2.3:[1 3 5 7]” refers to SIMD lanes 1, 3, 5, and 7 of thread 3 running on inferior
2.
An “inferior” in the GDB terminology is the process that is being debugged. In the debug session of a program
that offloads to the GPU, there will typically be two inferiors; one “native” inferior representing a host part of the
program (inferior 1 above), and another “remote” inferior representing the GPU device (inferior 2 above). Intel
Distribution for GDB automatically creates the GPU inferior - no extra steps are required.
When you print the value of an expression, the expression is evaluated in the context of the current thread’s
current SIMD lane. You can switch the thread as well as the SIMD lane to change the context using the “thread”
command such as “thread 3:4 “, “thread :6 “, or “thread 7 “. The first command makes a switch to the thread
3 and SIMD lane 4. The second command switches to SIMD lane 6 within the current thread. The third com-
mand switches to thread 7. The default lane selected will either be the previously selected lane, if it is active, or
the first active lane within the thread.
171
Intel® oneAPI
The “thread apply command” may be similarly broad or focused (which can make it easier to limit the output
from, for example, a command to inspect a variable). For more details and examples about debugging with SIMD
lanes, see the Debugging with Intel Distribution for GDB on Linux OS Host Tutorial.
More information about threads and inferiors in GDB can be found from https://sourceware.org/
gdb/current/onlinedocs/gdb/Threads.html and https://sourceware.org/gdb/current/onlinedocs/gdb/
Inferiors-Connections-and-Programs.html#Inferiors-Connections-and-Programs.
By default, when a thread hits a breakpoint, the debugger stops all the threads before displaying the breakpoint
hit event to the user. This is the all-stop mode of GDB. In the non-stop mode, the stop event of a thread is dis-
played while the other threads run freely.
In all-stop mode, when a thread is resumed (for example, to resume normally with the continue command, or
for stepping with the next command), all the other threads are also resumed. If you have some breakpoints set
in threaded applications, this can quickly get confusing, as the next thread that hits the breakpoint may not be
the thread you are following.
You can control this behavior using the set scheduler-locking command to prevent resuming other threads
when the current thread is resumed. This is useful to avoid intervention of other threads while only the current
thread executes instructions. Type help set scheduler-locking for the available options, and see https://
sourceware.org/gdb/current/onlinedocs/gdb/Thread-Stops.html for more information. Note that SIMD lanes
cannot be resumed individually; they are resumed together with their underlying thread.
In non-stop mode, by default, only the current thread is resumed. To resume all threads, pass the “-a” flag to the
continue command.
Commands for inspecting the program state are typically executed in the context of the current thread’s current
SIMD lane. Sometimes it is desired to inspect a value in multiple contexts. For such needs, the thread apply
command can be used. For instance, the following executes the print element command for the SIMD lanes
3-5 of Thread 2.5:
Similarly, the following runs the same command in the context of SIMD lane 3, 5, and 6 of the current thread:
172
Intel® oneAPI
To stop inside the kernel that is offloaded to the GPU, simply define a breakpoint at a source line inside the ker-
nel. When a GPU thread hits that source line, the debugger stops the execution and shows the breakpoint hit.
To single-step a thread over a source-line, use the step or next commands. The step commands steps into
functions while next steps over calls. Before stepping, we recommend to set scheduler-locking step to
prevent intervention of other threads.
Building a SYCL Executable for Use with Intel® Distribution for GDB*
Much like when you want to debug a host application, you need to set some additional flags to create a binary
that can be debugged on the GPU. See Get Started with Intel Distribution for GDB on Linux* Host for details.
For a smooth debug experience when using the just-in-time (JIT) compilation flow, enable debug information
emission from the compiler via the -g flag, and disable optimizations via the -O0 flag for both a host and JIT-
compiled kernel of the application. The flags for the kernel are taken during link time. For example:
• Compile your program using: icpx -fsycl -g -O0 -c myprogram.cpp
• Link your program using: icpx -fsycl -g -O0 myprogram.o
If you are using CMake to configure the build of your program, use the Debug type for the CMAKE_BUILD_-
TYPE, and append -O0 to the CMAKE_CXX_FLAGS_DEBUG variable. For example: set (CMAKE_CXX_FLAGS_DEBUG
"${CMAKE_CXX_FLAGS_DEBUG} -O0")
Applications that are built for debugging may take a little longer to start up than when built with the usual “release”
level of optimization. Thus, your program may appear to run a little more slowly when started in the debugger.
If this causes problems, developers of larger applications may want to use ahead-of-time (AOT) compilation to
JIT the offload code when their program is built, rather than when it is run (warning, this may also take longer to
build when using -g -O0). For more information, see Compilation Flow Overview.
When doing ahead-of-time compilation for GPU, you must use a device type that fits your target device. Run the
following command to see the available GPU device options on your current machine: ocloc compile --help
Additionally, the debug mode for the kernel must be enabled. The following example AoT compilation command
targets the KBL device:
Building an OpenMP* Executable for use with Intel® Distribution for GDB*
Compile and link your program using the -g -O0 flags. For example:
173
Intel® oneAPI
Set the following environment variables to disable optimizations and enable debug info for the kernel:
A common issue with offload programs is that they may to fail to run at all, instead giving a generic OpenCL™ error
with little additional information. The Intercept Layer for OpenCL Applications along with onetrace, ze_tracer,
and cl_tracer can be used to get more information about these errors, often helping the developer identify the
source of the problem.
Using this library, in particular the Buildlogging, ErrorLogging, and USMChecking=1 options, you can often
find the source of the error.
1. Create a clintercept.conf file in the home directory with the following content:
SimpleDumpProgramSource=1
CallLogging=1
LogToFile=1
//KernelNameHashTracking=1
BuildLogging=1
ErrorLogging=1
USMChecking=1
//ContextCallbackLogging=1
// Profiling knobs
KernelInfoLogging=1
DevicePerformanceTiming=1
DevicePerformanceTimeLWSTracking=1
DevicePerformanceTimeGWSTracking=1
...
<<<< clSetKernelArgMemPointerINTEL -> CL_SUCCESS
>>>> clGetKernelInfo( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_
,→S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225-
,→>13 ): param_name = CL_KERNEL_CONTEXT (1193)
174
Intel® oneAPI
In this example, the following values help with debugging the error:
• ZTSZZ10outer_coreiP5mesh
• index = 3, value = 0x41995e0
Using this data, you can identify which kernel had the problems, what argument was problematic, and why.
Similar to Intercept Layer for OpenCL Applications, the onetrace, ze_tracer and cl_tracer tools can help find the
source of errors detected by the Level Zero and OpenCL™ runtimes.
To use the onetrace or ze_tracer tools to root-cause Level Zero issues (cl_tracer would be used the same way
to root-cause OpenCL issues):
1. Use Call Logging mode to run the application. Redirecting the tool output to a file is optional, but recom-
mended.
The command for ze_tracer is the same - just substitute “ze_tracer” for “onetrace”.
1. Review the call trace to figure out the error (log.txt). For example:
175
Intel® oneAPI
Level Zero Matrix Multiplication (matrix size: 1024 x 1024, repeats 4 times)
Target device: Intel® Graphics [0x3ea5]
...
>>>> [104131109] zeKernelCreate: hModule = 0x55af5f39ca10 desc = 0x7ffe289c7f80 {29 0 0 GEMM}␣
,→phKernel = 0x7ffe289c7e48 (hKernel = 0)
<<<< [104158819] zeKernelCreate [27710 ns] hKernel = 0x55af5f3ca600 -> ZE_RESULT_SUCCESS (0)
...
>>>> [104345820] zeKernelSetGroupSize: hKernel = 0x55af5f3ca600 groupSizeX = 256 groupSizeY = 1␣
,→groupSizeZ = 1
<<<< [104360082] zeKernelSetGroupSize [14262 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104373679] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 0 argSize = 8␣
,→pArgValue = 0x7ffe289c7e50
<<<< [104389443] zeKernelSetArgumentValue [15764 ns] -> ZE_RESULT_SUCCESS (0)
>>>> [104402448] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 1 argSize = 8␣
,→pArgValue = 0x7ffe289c7e68
<<<< [104415871] zeKernelSetArgumentValue [13423 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT␣
,→(2013265924)
>>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8␣
,→pArgValue = 0x7ffe289c7e60
176
Intel® oneAPI
Correctness
Offload code is often used for kernels that can efficiently process large amounts of information on the attached
compute device, or to generate large amounts of information from some input parameters. If these kernels are
running without crashing, this can often mean that you learn that they are not producing the correct results much
later in program execution.
In these cases, it can be difficult to identify which kernel is producing incorrect results. One technique for finding
the kernel producing incorrect data is to run the program twice, once using a purely host-based implementation,
and once using an offload implementation, capturing the inputs and outputs from every kernel (often to individ-
ual files). Now compare the results and see which kernel call is producing unexpected results (within a certain
epsilon - the offload hardware may have a different order of operation or native precision that causes the results
to differ from the host code in the last digit or two).
Once you know which kernel is producing incorrect results, and you are working with a compatible GPU, use
Intel Distribution for GDB to determine the reason. See the Debugging with Intel Distribution for GDB on Linux
OS Host Tutorial for basic information and links to more detailed documentation.
Both SYCL and OpenMP* also allow for the use of standard language print mechanisms (printf for SYCL and
C++ OpenMP offload, print *, ... for Fortran OpenMP offload) within offloaded kernels, which you can use
to verify correct operation while they run. Print the thread and SIMD lane the output is coming from and con-
sider adding synchronization mechanisms to ensure printed information is in a consistent state when printed.
Examples for how to do this in SYCL using the stream class can be found in the Intel oneAPI GPU Optimization
Guide. You could use a similar approach to the one described for SYCL for OpenMP offload.
Tip: Using printf can be verbose in SYCL kernels. To simplify, add the following macro:
#ifdef __SYCL_DEVICE_ONLY__
#define CL_CONSTANT __attribute__((opencl_constant))
#else
#define CL_CONSTANT
#endif
#define PRINTF(format, ...) { \
static const CL_CONSTANT char _format[] = format; \
sycl::ONEAPI::experimental::printf(_format, ## __VA_ARGS__); }
177
Intel® oneAPI
Failures
Just-in-time (JIT) compilation failures that occur at runtime due to incorrect use of the SYCL or OpenMP* of-
fload languages will cause your program to exit with an error.
In the case of SYCL, if you cannot find these using ahead-of-time compilation of your SYCL code, selecting the
OpenCL backend, setting SimpleDumpProgramSource and BuildLogging, and using the Intercept Layer for
OpenCL Applications may help identify the kernel with the syntax error.
Logic errors can also result in crashes or error messages during execution. Such issues can include:
• Passing a buffer that belongs to the wrong context to a kernel
• Passing the “this” pointer to a kernel rather than a class element
• Passing a host buffer rather than a device buffer
• Passing an uninitialized pointer, even if it is not used in the kernel
Using the Intel® Distribution for GDB* (or even the native GDB), if you watch carefully, you can record the ad-
dresses of all contexts created and verify that the address being passed to an offload kernel belongs to the cor-
rect context. Likewise, you can verify that the address of a variable passed matches that of the variable itself,
and not its containing class.
It may be easier to track buffers and addresses using the Intercept Layer for OpenCL™ allocation or onetrace/cl_-
tracer and choosing the appropriate backend. When using the OpenCL backend, setting CallLogging,
BuildLogging, ErrorLogging, and USMChecking and running your program should produce output that ex-
plains what error in your code caused the generic OpenCL error to be produced.
Using onetrace or ze_tracer’s Call Logging or Device Timeline should give additional enhanced error informa-
tion to help you better understand the source of generic errors from the Level Zero backend. This can help
locate many of the logic errors mentioned above.
If the code is giving an error when offloading to a device using the Level Zero backend, try using the OpenCL
backend. If the program works, report an error against the Level Zero backend. If the error reproduces in the
OpenCL backend to the device, try using the OpenCL CPU backend. In OpenMP offload, this can be specified
by setting OMP_TARGET_OFFLOAD to CPU. For SYCL, this can be done by setting SYCL_DEVICE_FILTER=opencl:
cpu. Debugging with everything on the CPU can be easier, and removes complications caused by data copies
and translation of the program to a non-CPU device.
As an example of a logic issue that can get you in trouble, consider what is captured by the lambda function used
to implement the parallel_for in this SYCL code snippet.
class MyClass {
private:
int *data;
int factor;
:
void run() {
:
auto data2 = data;
auto factor2 = factor;
{
(continues on next page)
178
Intel® oneAPI
In the above code snippet, the program crashes because [=] will copy by value all variables used inside the
lambda. In the example it may not be obvious that “factor” is really “this->factor” and “data” is really
“this->data,” so “this” is the variable that is captured for the use of “data” and “factor” above. OpenCL
or Level Zero will crash with an illegal arguments error in the “kernel(data, b, factor, LEN, item_ct1)”
call.
The fix is the use of local variables auto data2 and auto factor2. “auto factor2 = factor” becomes “int
factor2 = this->factor” so using factor2 inside the lambda with [=] would capture an “int”. We would
rewrite the inner section as “kernel(data2, b, factor2, LEN, item_ct1);” .
Note: This issue is commonly seen when migrating CUDA* kernels. You can also resolve the issue by keeping
the same CUDA kernel launch signature and placing the command group and lambda inside the kernel itself.
Using the Intercept Layer for OpenCL™ allocation or onetrace or ze_tracer, you would see that the kernel was
called with two identical addresses, and the extended error information would tell you that you are trying to copy
a non-trivial data structure to the offload device.
Note that if you are using unified shared memory (USM), and “MyClass” is allocated in USM, the above code will
work. However, if only “data” is allocated in USM, then the program will crash for the above reason.
In this example, note that you can also re-declare the variables in local scope with the same name so that you
don’t need to change everything in the kernel call.
Intel® Inspector can also help diagnose these sorts of failures. If you set the following environment variables and
then run Memory Error Analysis on offload code using the CPU device, Intel Inspector will flag many of the above
issues:
• OpenMP*
– export OMP_TARGET_OFFLOAD=CPU
– export OMP_TARGET_OFFLOAD=MANDATORY
– export LIBOMPTARGET_PLUGIN=OPENCL
179
Intel® oneAPI
• SYCL
– export SYCL_DEVICE_FILTER=opencl:cpu
– Or initialize your queue with a CPU selector to force use of the OpenCL CPU device: cl::sycl::
queue Queue(cl::sycl::cpu_selector{});
• Both
– export CL_CONFIG_USE_VTUNE=True
– export CL_CONFIG_USE_VECTORIZER=false
Note: A crash can occur when optimizations are turned on during the compilation process. If turning off opti-
mizations causes your crash to disappear, use -g -[optimization level] for debugging. For more informa-
tion, see the Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference.
Transferring any data to or from an offload device is relatively expensive, requiring memory allocations in user
space, system calls, and interfacing with hardware controllers. Unified shared memory (USM) adds to these
costs by requiring that some background process keeps memory being modified on either the host or offload
device in sync. Furthermore, kernels on the offload device must wait to run until all the input or output buffers
they need to run are set up and ready to use.
All this overhead is roughly the same no matter how much information you need to transfer to or from the offload
device in a single data transfer. Thus, it is much more efficient to transfer 10 numbers in bulk rather than one at
a time. Still, every data transfer is expensive, so minimizing the total number of transfers is also very important.
180
Intel® oneAPI
If, for example, you have some constants that are needed by multiple kernels, or during multiple invocations of
the same kernel, transfer them to the offload device once and reuse them, rather than sending them with every
kernel invocation. Finally, as might be expected, single large data transfers take more time than single small data
transfers.
The number and size of buffers sent is only part of the equation. Once the data is at the offload device, consider
how long the resulting kernel executes. If it runs for less time than it takes to transfer the data to the offload device,
it may not be worthwhile to offload the data in the first place unless the time to do the same operation on the host
is longer than the combined kernel execution and data transfer time.
Finally, consider how long the offload device is idle between the execution of one kernel and the next. A long
wait could be due data transfer or just the nature of the algorithm on the host. If the former, it may be worthwhile
to overlap data transfer and kernel execution, if possible.
In short, execution of code on the host, execution of code on the offload device, and data transfer is quite com-
plex. The order and time of such operations isn’t something you can gain through intuition, even in the simplest
code. You need to make use of tools like those listed below to get a visual representation of these activities and
use that information to optimize your offload code.
In addition to giving you detailed performance information on the host, VTune can also provide detailed infor-
mation about performance on a connected GPU. Setup information for GPUs is available from the Intel VTune
Profiler User Guide.
Intel VTune Profiler’s GPU Offload view gives you an overview of the hotspots on the GPU, including the amount
of time spent for data transfer to and from each kernel. The GPU Compute/Media Hotspots view allows you to
dive more deeply into what is happening to your kernels on the GPU, such as by using the Dynamic Instruction
Count to view a micro analysis of the GPU kernel performance. With these profiling modes, you can observe
how data transfer and compute occur over time, determine if there is enough work for a kernel to run effectively,
learn how your kernels use the GPU memory hierarchy, and so on.
Additional details about these analysis types is available from the Intel VTune Profiler User Guide. A detailed
look at optimizing for GPU using VTune Profiler is available from the Optimize Applications for Intel GPUs with
Intel VTune Profiler page.
You can also use Intel VTune Profiler to capture kernel execution time. The following commands provide light-
weight profiling results:
• Collect
– Level zero backend: vtune -collect-with runss -knob enable-gpu-level-zero=true
-finalization-mode=none -app-working-dir <app_working_dir> – <app>
181
Intel® oneAPI
Intel® Advisor
Intel® Advisor provides two features that can help you get the improved performance when offloading compu-
tation to GPU:
• Offload Modeling can watch your host OpenMP* program and recommend parts of it that would be prof-
itably offloaded to the GPU. It also allows you to model a variety of different target GPUs, so that you can
learn if offload will be profitable on some but not others. Offload Advisor gives detailed information on
what factors may bound offload performance.
• GPU Roofline analysis can watch your application when it runs on the GPU, and graphically show how well
each kernel is making use of the memory subsystem and compute units on the GPU. This can let you know
how well your kernel is optimized for the GPU.
To run these modes on an application that already does some offload, you need to set up your environment to
use the OpenCL™ device on the CPU for analysis. Instructions are available from the Intel Advisor User Guide.
Offload modeling does not require that you have already modified your application to use a GPU - it can work
entirely on host code.
Resources:
• Intel Advisor Cookbook: GPU Offload
• Get Started with Offload Modeling
• Get Started with GPU Roofline
If you do not want to use Intel® VTune™ Profiler to understand when data is being copied to the GPU, and when
kernels run, onetrace, ze_tracer, cl_tracer, and the Intercept Layer for OpenCL™ Applications give you a way to
observe this information /(although, if you want a graphical timeline, you’ll need to write a script to visualize the
output/). For more information, see oneAPI Debug Tools, Trace the Offload Process, and Debug the Offload
Process.
182
Intel® oneAPI
Establish a baseline that includes a metric such as elapsed time, time in a compute kernel, or floating-point op-
erations per second that can be used to measure the performance improvement and that provides a means to
verify the correctness of the results.
A simple method is to employ the chrono library routines in C++, placing timer calls before and after the workload
executes.
To best utilize the compute cycles available on the devices of a heterogeneous platform, it is important to identify
the tasks that are compute intensive and that can benefit from parallel execution. Consider an application that
executes solely on a CPU, but there may be some tasks suitable to execute on a GPU. This can be determined
using the Offload Modeling perspective of the Intel® Advisor.
Intel Advisor estimates performance characterizations of the workload as it may execute on an accelerator. It
consumes the information from profiling the workload and provides performance estimates, speedup, bottle-
neck characterization, and offload data transfer estimates and recommendations.
Typically, kernels with high compute, a large dataset, and limited memory transfers are best suited for offload to
a device.
See Get Started: Identify High-impact Opportunities to Offload to GPU for quick steps to ramp up with the Of-
fload Modeling perspective. For more resources about modeling performance of your application on GPU plat-
forms, see Offload Modeling Resources for Intel® Advisor Users.
After identifying kernels that are suitable for offload, employ SYCL* or OpenMP* to offload the kernel onto the
device. Consult the previous chapters as an information resource.
6.4.4 Optimize
oneAPI enables functional code that can execute on multiple accelerators; however, the code may not be the
most optimal across the accelerators. A three-step optimization strategy is recommended to meet performance
needs:
1. Pursue general optimizations that apply across accelerators.
2. Optimize aggressively for the prioritized accelerators.
3. Optimize the host code in conjunction with step 1 and 2.
Optimization is a process of eliminating bottlenecks, i.e. the sections of code that are taking more execution time
relative to other sections of the code. These sections could be executing on the devices or the host. During
optimization, employ a profiling tool such as Intel® VTune™ Profiler to find these bottlenecks in the code.
This section discusses the first step of the strategy - Pursue general optimizations that apply across accelera-
tors. Device specific optimizations and best practices for specific devices (step 2) and optimizations between
the host and devices (step 3) are detailed in device-specific optimization guides, such as the FPGA Optimiza-
tion Guide for Intel® oneAPI Toolkits. This section assumes that the kernel to offload to the accelerator is already
183
Intel® oneAPI
determined. It also assumes that work will be accomplished on one accelerator. This guide does not speak to
division of work between host and accelerator or between host and potentially multiple and/or different accel-
erators.
General optimizations that apply across accelerators can be classified into four categories:
1. High-level optimizations
2. Loop-related optimizations
3. Memory-related optimizations
4. SYCL-specific optimizations
The following sections summarize these optimizations only; specific details on how to code most of these opti-
mizations can be found online or in commonly available code optimization literature. More detail is provided for
the SYCL-specific optimizations.
• Increase the amount of parallel work. More work than the number of processing elements is desired to
help keep the processing elements more fully utilized.
• Minimize the code size of kernels. This helps keep the kernels in the instruction cache of the accelerator,
if the accelerator contains one.
• Load balance kernels. Avoid significantly different execution times between kernels as the long-running
kernels may become bottlenecks and affect the throughput of the other kernels.
• Avoid expensive functions. Avoid calling functions that have high execution times as they may become
bottlenecks.
Loop-related Optimizations
• Prefer well-structured, well-formed, and simple exit condition loops – these are loops that have a single
exit and a single condition when comparing against an integer bound.
• Prefer loops with linear indexes and constant bounds – these are loops that employ an integer index into
an array, for example, and have bounds that are known at compile-time.
• Declare variables in deepest scope possible. Doing so can help reduce memory or stack usage.
• Minimize or relax loop-carried data dependencies. Loop-carried dependencies can limit parallelization.
Remove dependencies if possible. If not, pursue techniques to maximize the distance between the de-
pendency and/or keep the dependency in local memory.
• Unroll loops with pragma unroll.
184
Intel® oneAPI
Memory-related Optimizations
• When possible, favor greater computation over greater memory use. The latency and bandwidth of mem-
ory compared to computation can become a bottleneck.
• When possible, favor greater local and private memory use over global memory use.
• Avoid pointer aliasing.
• Coalesce memory accesses. Grouping memory accesses helps limit the number of individual memory
requests and increases utilization of individual cache lines.
• When possible, store variables and arrays in private memory for high-execution areas of code.
• Beware of loop unrolling effects on concurrent memory accesses.
• Avoid a write to a global that another kernel reads. Use a pipe instead.
• Consider employing the [[intel::kernel_args_restrict]] attribute to a kernel. The attribute allows
the compiler to ignore dependencies between accessor arguments in the kernel. In turn, ignoring acces-
sor argument dependencies allows the compiler to perform more aggressive optimizations and potentially
improve the performance of the kernel.
SYCL-specific Optimizations
Once the code is optimized, it is important to measure the performance. The questions to be answered include:
• Did the metric improve?
• Is the performance goal met?
• Are there any more compute cycles left that can be used?
Confirm the results are correct. If you are comparing numerical results, the numbers may vary depending on
how the compiler optimized the code or the modifications made to the code. Are any differences acceptable?
If not, go back to optimization step.
185
Intel® oneAPI
• New Intel oneAPI device drivers, oneAPI dynamic libraries, and oneAPI compilers will not break previously
deployed applications built with oneAPI tools. Current APIs will not be removed or modified without notice
and an iteration of the major version.
• Developers of oneAPI applications should ensure that the header files and libraries have the same release
version. For example, an application should not use 2021.2 Intel® oneAPI Math Kernel Library header files
with 2021.1 Intel oneAPI Math Kernel Library.
• New dynamic libraries provided with the Intel compilers will work with applications built by older versions
of the compilers (this is commonly referred to as backward compatibility). However, the converse is not
true: newer versions of the oneAPI dynamic libraries may contain routines that are not available in earlier
versions of the library.
• Older dynamic libraries provided with the oneAPI Intel compilers will not work with newer versions of the
oneAPI compilers.
Developers of oneAPI applications should ensure that thorough application testing is conducted to ensure that
a oneAPI application is deployed with a compatible oneAPI library.
186
Intel® oneAPI
7.0 Glossary
7.1 Accelerator
Specialized component containing compute resources that can quickly execute a subset of operations. Exam-
ples include CPU, FPGA, GPU.
See also: Device
7.2 Accessor
Communicates the desired location (host, device) and mode (read, write) of access.
7.4 Buffers
Memory object that communicates the type and number of items of that type to be communicated to the device
for computation.
7.8 Device
An accelerator or specialized component containing compute resources that can quickly execute a subset of
operations. A CPU can be employed as a device, but when it is, it is being employed as an accelerator. Examples
include CPU, FPGA, GPU.
See also: Accelerator
187
Intel® oneAPI
7.10 DPC++
An open source project is adding SYCL* support to the LLVM C++ compiler.
7.14 Host
A CPU-based system (computer) that executes the primary portion of a program, specifically the application
scope and command group scope.
7.16 Images
Formatted opaque memory object that is accessed via built-in function. Typically pertains to pictures comprised
of pixels stored in format like RGB.
188
Intel® oneAPI
7.18 ND-range
Short for N-Dimensional Range, a group of kernel instances, or work item, across one, two, or three dimensions.
7.21 SPIR-V
Binary intermediate language for representing graphical-shader stages and compute kernels.
7.22 SYCL
A standard for a cross-platform abstraction layer that enables code for heterogeneous processors to be written
using standard ISO C++ with the host and kernel code for an application contained in the same source file.
7.23 Work-groups
Collection of work-items that execute on a compute unit.
7.24 Work-item
Basic unit of computation in the oneAPI programming model. It is associated with a kernel which executes on
the processing element.
189
Intel® oneAPI
190
Intel® oneAPI
Unless stated otherwise, the code examples in this document are provided to you under an MIT license, the
terms of which are as follows:
Copyright 2022 Intel Corporation
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the “Software”), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to per-
mit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of
the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PAR-
TICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFT-
WARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Use this guide to learn about:
• Introduction to oneAPI Programming: A basic overview of oneAPI, Intel oneAPI Toolkits, and related re-
sources
• oneAPI Programming Model: An introduction to the oneAPI programming model for SYCL* and
OpenMP* offload for C, C++, and Fortran
• oneAPI Development Environment Setup: Instructions on how to set up the oneAPI application devel-
opment environment
• Compile and Run oneAPI Programs: Details about how to compile code for various accelerators (CPU,
FPGA, etc.)
191
Intel® oneAPI
192