Arm CPP Compiler Reference en
Arm CPP Compiler Reference en
Version 20.0
Reference Guide
Copyright © 2018, 2019 Arm Limited or its affiliates. All rights reserved.
101458_2000_00_en
Arm® C/C++ Compiler
Document History
Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use
the information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE
WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation with respect to, and has
undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other
rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR
CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING
OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of
this document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is
not exported, directly or indirectly, in violation of such export laws. Use of the word “partner” in reference to Arm’s customers is
not intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at
any time and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed written
agreement covering this document with Arm, then the click through or signed written agreement prevails over and supersedes the
conflicting provisions of these terms. This document may be translated into other languages for convenience, and you agree that if
there is any conflict between the English version of this document and any translation, the terms of the English version of the
Agreement shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm Limited (or its
subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the
trademarks of their respective owners. Please follow Arm’s trademark usage guidelines at http://www.arm.com/company/policies/
trademarks.
Copyright © 2018, 2019 Arm Limited (or its affiliates). All rights reserved.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2
reserved.
Non-Confidential
Arm® C/C++ Compiler
LES-PRE-20349
Confidentiality Status
This document is Non-Confidential. The right to use, copy and disclose this document may be subject to license restrictions in
accordance with the terms of the agreement entered into by Arm and the party that Arm delivered this document to.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3
reserved.
Non-Confidential
Contents
Arm® C/C++ Compiler Reference Guide
Preface
About this book ...................................................... ...................................................... 8
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 4
reserved.
Non-Confidential
3.3 Optimizing C/C++ code with Arm SIMD (Neon™) .......................... .......................... 3-42
3.4 Optimizing C/C++ code with SVE ............................................................................ 3-43
3.5 Writing inline SVE assembly .................................................................................... 3-44
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5
reserved.
Non-Confidential
List of Tables
Arm® C/C++ Compiler Reference Guide
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 6
reserved.
Non-Confidential
Preface
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7
reserved.
Non-Confidential
Preface
About this book
Glossary
The Arm® Glossary is a list of terms used in Arm documentation, together with definitions for those
terms. The Arm Glossary does not contain terms that are industry standard unless the Arm meaning
differs from the generally accepted meaning.
See the Arm® Glossary for more information.
Typographic conventions
italic
Introduces special terminology, denotes cross-references, and citations.
bold
Highlights interface elements, such as menu names. Denotes signal names. Also used for terms
in descriptive lists, where appropriate.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 8
reserved.
Non-Confidential
Preface
About this book
monospace
Denotes text that you can enter at the keyboard, such as commands, file and program names,
and source code.
monospace
Denotes a permitted abbreviation for a command or option. You can enter the underlined text
instead of the full command or option name.
monospace italic
Denotes arguments to monospace text where the argument is to be replaced by a specific value.
monospace bold
Denotes language keywords when used outside example code.
<and>
Encloses replaceable terms for assembler syntax where they appear in code or code fragments.
For example:
MRC p15, 0, <Rd>, <CRn>, <CRm>, <Opcode_2>
SMALL CAPITALS
Used in body text for a few terms that have specific technical meanings, that are defined in the
Arm® Glossary. For example, IMPLEMENTATION DEFINED, IMPLEMENTATION SPECIFIC, UNKNOWN, and
UNPREDICTABLE.
Feedback
Feedback on content
If you have comments on content then send an e-mail to errata@arm.com. Give:
• The title Arm C/C++ Compiler Reference Guide.
• The number 101458_2000_00_en.
• If applicable, the page number(s) to which your comments refer.
• A concise explanation of your comments.
Arm also welcomes general suggestions for additions and improvements.
Note
Arm tests the PDF only in Adobe Acrobat and Acrobat Reader, and cannot guarantee the quality of the
represented document when used with any other PDF reader.
Other information
• Arm® Developer.
• Arm® Information Center.
• Arm® Technical Support Knowledge Articles.
• Technical Support.
• Arm® Glossary.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 9
reserved.
Non-Confidential
Chapter 1
Get started
Arm C/C++ Compiler is an auto-vectorizing compiler for the 64-bit Armv8-A architecture. This getting
started tutorial shows you how to install, compile C/C++ code, use different optimization levels, and
generate an executable.
The Arm C/C++ Compiler tool chain for the 64-bit Armv8-A architecture enables you to compile C/C++
code for Armv8-A compatible platforms, with an advanced auto-vectorizer capable of taking advantage
of SIMD features.
It contains the following sections:
• 1.1 Get started with Arm® C/C++ Compiler on page 1-11.
• 1.2 Using the compiler on page 1-13.
• 1.3 Generate annotated assembly code from C and C++ code on page 1-15.
• 1.4 Compile C/C++ code for Arm SVE and SVE2 architectures on page 1-17.
• 1.5 Get help on page 1-19.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-10
reserved.
Non-Confidential
1 Get started
1.1 Get started with Arm® C/C++ Compiler
Prerequisites
• Install Arm Compiler for Linux. For information about installing Arm Compiler for Linux, see Install
Arm Compiler for Linux.
Procedure
1. Load the environment module for Arm Compiler for Linux:
a. As part of the installation, your system administrator must make the Arm Compiler for Linux
environment modules available. To see which environment modules are available, run:
module avail
Note
Depending on the configuration of Environment Modules on your system, you might need to
configure the MODULEPATH environment variable to include the installation directory:
export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles/
If you chose to install Arm Compiler for Linux to a custom location, replace /opt/arm/ with the
path to your installation.
For example:
module load Generic-AArch64/SUSE/12/suites/arm-linux-compiler/20.0
c. Check your environment. Examine the PATH variable. PATH must contain the
appropriate bin directory from /opt/arm, as installed in the previous section:
echo $PATH
/opt/arm/arm-linux-compiler-20.0_Generic-AArch64_SUSE-
12_aarch64-linux/bin:...
Note
To automatically load the Arm Compiler for Linux every time you log into your Linux terminal, add
the module load command for your system and product version to your .profile file.
2. Create a “Hello World” program and save it in a file, for example: hello.c.
/* Hello World */
#include <stdio.h>
int main()
{
printf("Hello World");
return 0;
}
3. To generate an executable binary, compile your program with Arm C/C++ Compiler and specify (-o)
the input file, hello.c, and the binary name, hello:
armclang -o hello hello.c
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-11
reserved.
Non-Confidential
1 Get started
1.1 Get started with Arm® C/C++ Compiler
Next Steps
For more information about compiling and linking as separate steps, and how optimization levels effect
auto-vectorization, see Using the compiler on page 1-13.
Related references
Chapter 3 Coding best practice on page 3-37
Chapter 2 Compiler options on page 2-20
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-12
reserved.
Non-Confidential
1 Get started
1.2 Using the compiler
You can also specify multiple source files on a single line. Each source file is compiled individually and
then linked into a single executable binary. For example:
armclang -o example1 example1a.c example1b.c
To compile each of your source files individually into an object file, specify the -c (compile-only)
option, and then pass the resulting object files into another invocation of armclang to link them into an
executable binary.
armclang -c -o example1a.o example1a.c
armclang -c -o example1b.o example1b.c
armclang -o example1 example1a.o example1b.o
The optimization option can also be specified when generating an object file:
armclang -O3 -c -o example1a.o example1a.c
armclang -O3 -c -o example1b.o example1b.c
This option enables the compiler to automatically detect the architecture and processor type of the CPU
you are running the compiler on, and optimize accordingly.
This option supports a range of Armv8-A based SoCs, including ThunderX2 and Neoverse N1.
Note
The optimization performed according to the auto-detected architecture and processor is independent of
the optimization level denoted by the -O<level> option.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-13
reserved.
Non-Confidential
1 Get started
1.2 Using the compiler
-S
Outputs assembly code, rather than object code. Produces a text .s file containing annotated
assembly code.
-c
Performs the compilation step, but does not perform the link step. Produces an ELF object .o
file. To later link object files into an executable binary, run armclang again, passing in the
object files.
-o <file>
Targets an architecture profile, generating generic code that runs on any processor of that
architecture. For example -march=armv8-a, -march=armv8-a+sve, or -march=armv8-a+sve2.
-mcpu=native
Enables the compiler to automatically detect the CPU you are running the compiler on and
optimize accordingly. This supports a range of Armv8-A-based System-on-Chips (SoCs),
including ThunderX2 and Neoverse N1.
-Olevel
Specifies the level of optimization to use when compiling source files. The default is -O0.
--config /path/to/<config-file>.cfg
Passes the location of a configuration file to the compile command. Use a configuration file to
specify a set of compile options to be run at compile time. The configuration file can be passed
at compile time, or an environment variable can be set for it to be used for every invocation of
the compiler. For more information about creating and using a configuration file, see Configure
Arm Compiler for Linux.
--help
Describes the most common options supported by Arm C/C++ Compiler. To see more detailed
descriptions of all the options, use man armclang.
--version
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-14
reserved.
Non-Confidential
1 Get started
1.3 Generate annotated assembly code from C and C++ code
Prerequisites
• Install Arm Compiler for Linux. For information about installing Arm Compiler for Linux, see Install
Arm Compiler for Linux.
• Load the module for Arm Compiler for Linux, run:
module load <architecture>/<linux_variant>/<linux_version>/suites/arm-linux-compiler/
<version>
Procedure
1. Compile your source and specify an assembly code output:
armclang -O<level> -S -o <assembly-filename>.s <source-filename>.c
This example compiles an example application source into assembly code without auto-vectorization,
then re-compiles it with auto-vectorization enabled. You can compare the assembly code to see the effect
the auto-vectorization has.
The following C application subtracts corresponding elements in two arrays, writing the result to a third
array. The three arrays are declared using the restrict keyword, indicating to the compiler that they do
not overlap in memory.
// example1.c
#define ARRAYSIZE 1024
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
{
for (int i = 0; i < ARRAYSIZE; i++)
{
a[i] = b[i] - c[i];
}
}
int main()
{
subtract_arrays(a, b, c);
}
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-15
reserved.
Non-Confidential
1 Get started
1.3 Generate annotated assembly code from C and C++ code
1. Compile the example source without auto-vectorization (-O1) and specify an assembly code
output (-S``):
The output assembly code is saved as example1.s. The section of the generated assembly language
file that contains the compiled subtract_arrays function is as follows:
subtract_arrays: // @subtract_arrays
// BB#0:
mov x8, xzr
.LBB0_1: // =>This Inner Loop Header: Depth=1
ldr w9, [x1, x8]
ldr w10, [x2, x8]
sub w9, w9, w10
str w9, [x0, x8]
add x8, x8, #4 // =4
cmp x8, #1, lsl #12 // =4096
b.ne .LBB0_1
// BB#2:
ret
This code shows that the compiler has not performed any vectorization, because we specified the -O1
(low optimization) option. Array elements are iterated over one at a time. Each array element is a 32-
bit or 4-byte integer, so the loop increments by 4 each time. The loop stops when it reaches the end of
the array (1024 iterations * 4 bytes later).
2. Recompile the application with auto-vectorization enabled (-O2):
armclang -O2 -S -o example1.s example1.c
The output assembly code is saved as example1.s. The section of the generated assembly language
file that contains the compiled subtract_arrays function is as follows:
subtract_arrays: // @subtract_arrays
// BB#0:
mov x8, xzr
add x9, x0, #16 // =16
.LBB0_1: // =>This Inner Loop Header: Depth=1
add x10, x1, x8
add x11, x2, x8
ldp q0, q1, [x10]
ldp q2, q3, [x11]
add x10, x9, x8
add x8, x8, #32 // =32
cmp x8, #1, lsl #12 // =4096
sub v0.4s, v0.4s, v2.4s
sub v1.4s, v1.4s, v3.4s
stp q0, q1, [x10, #-16]
b.ne .LBB0_1
// BB#2:
ret
This time, we can see that Arm C/C++ Compiler has done something different. SIMD (Single
Instruction Multiple Data) instructions and registers have been used to vectorize the code. Notice that
the LDP instruction is used to load array values into the 128-bit wide Q registers. Each vector
instruction is operating on four array elements at a time, and the code is using two sets of Q registers
to double up and operate on eight array elements in each iteration. Consequently each loop iteration
moves through the array by 32 bytes (2 sets * 4 elements * 4 bytes) at a time.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-16
reserved.
Non-Confidential
1 Get started
1.4 Compile C/C++ code for Arm SVE and SVE2 architectures
1.4 Compile C/C++ code for Arm SVE and SVE2 architectures
Arm C/C++ Compiler supports compiling for Scalable Vector Extension (SVE) and Scalable Vector
Extension version two (SVE2)-enabled target architectures.
SVE and SVE2 support enables you to:
• Assemble source code containing SVE and SVE2 instructions.
• Disassemble ELF object files containing SVE and SVE2 instructions.
• Compile C and C++ code for SVE and SVE2-enabled targets, with an advanced auto-vectorizer that
is capable of taking advantage of the SVE and SVE2 features.
This tutorial shows you how to compile code to take advantage of SVE (or SVE2) functionality. The
executable that is generated during the tutorial can only be run on SVE-enabled (or SVE2-enabled)
hardware, or with Arm Instruction Emulator.
Prerequisites
• Install Arm Compiler for Linux. For information about installing Arm Compiler for Linux, see Install
Arm Compiler for Linux.
• Load the module for Arm Compiler for Linux, run:
module load <architecture>/<linux_variant>/<linux_version>/suites/arm-linux-compiler/
<version>
Procedure
1. Compile your SVE or SVE2 source and specify an SVE-enabled (or SVE2-enabled) architecture:
• To compile without linking to Arm Performance Libraries, set -march to the architecture and
feature set you want to target:
For SVE:
armclang -O<level> -march=armv8-a+sve -o <binary-filename> <source-filename>.c
For SVE2:
armclang -O<level> -march=armv8-a+sve2 -o <binary-filename> <source-filename>.c
• To compile and link to the SVE version of Arm Performance Libraries, set -march to the
architecture and feature set you want to target and add the -armpl=sve option to your command
line:
For SVE:
armclang -O<level> -march=armv8-a+sve -armpl=sve -o <binary-filename> <source-
filename>.c
For SVE2:
armclang -O<level> -march=armv8-a+sve2 -armpl=sve -o <binary-filename> <source-
filename>.c
For more information about the supported options for -armpl, see the -armpl description in
Linker options on page 2-33.
There are several SVE2 Cryptographic extensions available: sve2-aes, sve2-bitperm, sve2-sha3,
and sve2-sm4. Each extension is enabled using the march compiler option. For a full list of supported
-march options, see ../compiler-options/optimization-options.
Note
sve2 also enables sve.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-17
reserved.
Non-Confidential
1 Get started
1.4 Compile C/C++ code for Arm SVE and SVE2 architectures
This example compiles an example application source into assembly with auto-vectorization enabled.
The following C program subtracts corresponding elements in two arrays and writes the result to a third
array. The three arrays are declared using the restrict keyword, telling the compiler that they do not
overlap in memory.
// example1.c
#define ARRAYSIZE 1024
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
{
for (int i = 0; i < ARRAYSIZE; i++)
{
a[i] = b[i] - c[i];
}
}
int main()
{
subtract_arrays(a, b, c);
}
SVE instructions operate on the z and p register banks. In this example, the inner loop is almost
entirely composed of SVE instructions. The auto-vectorizer has converted the scalar loop from the
original C source code into a vector loop, that is independent of the width of SVE vector registers.
3. Run the executable:
./example1
Related information
Porting and Optimizing HPC Applications for Arm SVE
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-18
reserved.
Non-Confidential
1 Get started
1.5 Get help
or
man armclang
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 1-19
reserved.
Non-Confidential
Chapter 2
Compiler options
Command-line options supported by armclang and armclangc++ within Arm C/C++ Compiler.
The supported options are also available in the man pages in the tool. To view them, use:
man armclang
Note
For simplicity, we have shown usage with armclang. The options can also be used with armclang++,
unless otherwise stated.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-20
reserved.
Non-Confidential
2 Compiler options
2.1 Actions
2.1 Actions
Options that control what action to perform on the input.
Option Description
-E Only run the preprocessor.
Usage
armclang -E
-S Only run the preprocess and compile steps. The preprocess step is not run on files that do not need it.
Usage
armclang -S
-c Only run the preprocess, compile, and assemble steps. The preprocess step is not run on files that do not need it.
Usage
armclang -c
Usage
armclang -fsyntax-only
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-21
reserved.
Non-Confidential
2 Compiler options
2.2 File options
Option Description
--config Passes the location of a configuration file to the compile command.
Use a configuration file to specify a set of compile options to be run at compile time. The configuration file can
be passed at compile time, or you can set an environment variable for it to be used for every invocation of the
compiler. For more information about creating and using a configuration file, see Configure Arm Compiler for
Linux.
Usage
armclang –config /path/to/this/<filename>.cfg
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-22
reserved.
Non-Confidential
2 Compiler options
2.3 Basic driver options
Option Description
--gcc-toolchain=<arg> Use the gcc toolchain at the given directory.
Usage
armclang --gcc-toolchain=<arg>
--help-hidden Display hidden options. Only use these options if advised to do so by your Arm representative.
Usage
armclang --help-hidden
--vsn Show the version number and some other basic information about the compiler.
Usage
armclang --version
armclang --vsn
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-23
reserved.
Non-Confidential
2 Compiler options
2.4 Optimization options
Option Description
-O0 Minimum optimization for the performance of the compiled binary. Turns off most
optimizations. When debugging is enabled, this option generates code that directly
corresponds to the source code. Therefore, this might result in a significantly larger image.
This is the default optimization level.
Usage
armclang -O0
-O1 Restricted optimization. When debugging is enabled, this option gives the best debug view
for the trade-off between image size, performance, and debug.
Usage
armclang -O1
-O2 High optimization. When debugging is enabled, the debug view might be less satisfactory
because the mapping of object code to source code is not always clear. The compiler might
perform optimizations that cannot be described by debug information.
Usage
armclang -O2
-O3 Very high optimization. When debugging is enabled, this option typically gives a poor
debug view. Arm recommends debugging at lower optimization levels.
Usage
armclang -O3
-Ofast Enable all the optimizations from level 3, including those performed with the
‑ffp‑mode=fast armclang option.
This level also performs other aggressive optimizations that might violate strict compliance
with language standards.
Usage
armclang -Ofast
-ffinite-math-only Enable optimizations that ignore the possibility of NaN and +/‑Inf.
Usage
armclang -ffinite-math-only
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-24
reserved.
Non-Confidential
2 Compiler options
2.4 Optimization options
Option Description
-ffp-contract={fast|on|off} Controls when the compiler is permitted to form fused floating-point operations (such as
FMAs).
fast: Always (default).
on: Only in the presence of the FP_CONTRACT pragma.
off: Never.
Usage
armclang -ffp-contract={fast|on|off}
-fsimdmath Enable (-fsimdmath) or disable (-fno-simdmath) the vectorized libm library (libamath)
to help the vectorization of loops containing calls to libm.
-fno-simdmath
For more information, see https://developer.arm.com/products/software-development-
tools/hpc/documentation/vector-math-routines.
Default is -fno-simdmath.
Usage
armclang -fsimdmath
armclang -fno-simdmath
-fstrict-aliasing Tells the compiler to adhere to the aliasing rules defined in the source language.
In some circumstances, this flag allows the compiler to assume that pointers to different
types do not alias. Enabled by default when using -Ofast.
Usage
armclang -fstrict-aliasing
-funsafe-math-optimizations This option enables reassociation and reciprocal math optimizations, and does not honor
trapping nor signed zero.
-fno-
unsafe-math-optimizations Usage
armclang -funsafe-math-optimizations
(enable)
armclang-fno-unsafe-math-optimizations
(disable)
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-25
reserved.
Non-Confidential
2 Compiler options
2.4 Optimization options
Option Description
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-26
reserved.
Non-Confidential
2 Compiler options
2.4 Optimization options
Option Description
-mcpu=<arg> Select which CPU architecture to optimize for. Choose from:
• native: Auto-detect the CPU architecture from the build computer.
• neoverse-n1: Optimize for Neoverse N1-based computers.
• thunderx2t99: Optimize for Cavium ThunderX2-based computers.
• generic: Generates portable output suitable for any Armv8-A computer.
Usage
armclang -mcpu=<arg>
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-27
reserved.
Non-Confidential
2 Compiler options
2.4 Optimization options
Option Description
-march=<arg> Specifies the base architecture and extensions available on the target.
-march=<arg> where <arg> is constructed as name[+[no]feature+…]:
name
armv8-a: Armv8-A application architecture profile.
armv8.1-a: Armv8.1 application architecture profile.
armv8.2-a: Armv8.2 application architecture profile.
feature
Is the name of an optional architectural feature that can be explicitly enabled with
+feature and disabled with +nofeature.
For AArch64, the following features can be specified:
• crc - Enable CRC extension. On by default for -march=armv8.1-a or
higher.
• crypto - Enable Cryptographic extension.
• fullfp16 - Enable FP16 extension.
• lse - Enable Large System Extension instructions. On by default for -
march=armv8.1-a or higher.
• sve - Scalable Vector Extension (SVE). This feature also enables fullfp16.
See Scalable Vector Extension for more information.
• sve2- Scalable Vector Extension version two (SVE2). This feature also
enables sve. See Arm A64 Instruction Set Architecture for SVE and SVE2
instructions.
• sve2-aes - SVE2 Cryptographic extension. This feature also enables sve2.
• sve2-bitperm - SVE2 Cryptographic Extension. This feature also enables
sve2.
• sve2-sha3 - SVE2 Cryptographic Extension. This feature also enables sve2.
• sve2-sm4 - SVE2 Cryptographic Extension. This feature also enables sve2.
Note
When enabling either the sve2 or sve features, to link to the SVE-enabled
version of Arm Performance Libraries, you must also include the -armpl=sve
option. For more information about the supported options for -armpl, see the -
armpl description.
Usage
armclang -march=<arg>
Examples
armclang -march=armv8-a
armclang -march=armv8-a+sve
armclang -march=armv8-a+sve2
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-28
reserved.
Non-Confidential
2 Compiler options
2.5 Workload compilation options
Option Description
-std=<arg> Language standard to compile for. The list of valid standards depends on the input language, but adding -std=<arg>
to a build line will generate an error message listing valid choices.
--std=<arg>
Usage
armclang -std=<arg>
armclang --std=<arg>
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-29
reserved.
Non-Confidential
2 Compiler options
2.6 Development options
Option Description
-g -g, -g0, and -gline-tables-only control the generation of source-level debug information:
• -g enables debug generation.
-g0 (default)
• -g0 disables generation of debug and is the default setting.
-gline-tables-only • -gline-tables-only enables DWARF line information for location tracking only (not for
variable tracking).
Note
If more than one of these options are specified on the command line, the option specified last overrides
any before it.
Usage
armclang -g
Or
armclang -g0
Or
armclang -gline-tables-only
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-30
reserved.
Non-Confidential
2 Compiler options
2.7 Warning options
Option Description
-Warm-extensions Enable warnings about the use of non-standard language features supported by Arm Compiler.
Usage
armclang -Warm-extensions
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-31
reserved.
Non-Confidential
2 Compiler options
2.8 Pre-processor options
Option Description
-D <macro>=<value> Define <macro> to <value> (or 1 if <value> is omitted).
Usage
armclang -D<macro>=<value>
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-32
reserved.
Non-Confidential
2 Compiler options
2.9 Linker options
Option Description
-Wl,<arg> Pass the comma separated arguments in <arg> to the linker.
Usage
armclang -Wl,<arg>, <arg2>...
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-33
reserved.
Non-Confidential
2 Compiler options
2.9 Linker options
Option Description
-armpl Instructs the compiler to load the optimum version of Arm Performance Libraries for your target architecture and
implementation. This option also enables optimized versions of the C mathematical functions declared in the
math.h library, tuned scalar and vector implementations of Fortran math intrinsics, and auto-vectorization of
mathematical functions (disable this using -fno-simdmath).
Supported arguments are:
• sve: Use the SVE library from Arm Performance Libraries.
Note
-armpl=sve,<arg2>,<arg3> should be used in combination with -march=armv8-a+sve.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-34
reserved.
Non-Confidential
2 Compiler options
2.9 Linker options
Option Description
-l<library> Search for the library named <library> when linking.
-larmflang At link time, include this option to use the default Fortran libarmflang runtime library for both serial and parallel
(OpenMP) Fortran workloads.
Note
• This option is set by default when linking using armflang.
• You need to explicitly include this option if you are linking with armclang instead of armflang at link
time.
• This option only applies to link time operations.
Usage
armclang -larmflang
See notes in description.
-larmflang- At link time, use this option to avoid linking against the OpenMP Fortran runtime library.
nomp Note
• Enabled by default when compiling and linking using armflang with the -fno-openmp option.
• You need to explicitly include this option if you are linking with armclang instead of armflang at link
time.
• Should not be used when your code has been compiled with the -lomp or -fopenmp options.
• Use this option with care. When using this option, do not link to any OpenMP-utilizing Fortran runtime
libraries in your code.
• This option only applies to link time operations.
Usage
armclang -larmflang-nomp
See notes in description.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-35
reserved.
Non-Confidential
2 Compiler options
2.9 Linker options
Option Description
To link serial or parallel Fortran workloads using armclang instead of armflang, include the -
larmflang option to link with the default Fortran runtime library for serial and parallel Fortran
workloads. You also need to pass any options required to link using the required mathematical routines
for your code.
To statically link, in addition to passing -larmflang and the mathematical routine options, you also need
to pass:
• -static
• -lomp
• -lrt
To link serial or parallel Fortran workloads using armclang instead of armflang, without linking against
the OpenMP runtime libraries, instead pass -armflang-nomp, at link time. For example, pass:
• -larmflang-nomp
• Any mathematical routine options, for example: -lm or -lamath.
Again, to statically link, in addition to -larmflang-nomp and the mathematical routine options, you also
need to pass:
• -static
• -lrt
Warning
• Do not link against any OpenMP-utlizing Fortran runtime libraries when using this option.
• All lockings and thread local storage will be disabled.
• Arm does not recommend using the -larmflang-nomp option for typical workloads. Use this option
with caution..
Note
The -lompstub option (for linking against libompstub) might still be needed if you have imported
omp_lib in your Fortran code but not compiled with -fopenmp.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 2-36
reserved.
Non-Confidential
Chapter 3
Coding best practice
Discusses best practices when writing C/C++ code for Arm C/C++ Compiler.
If you encounter a problem when developing your application and compiling with the Arm C/C++
Compiler, see the troubleshooting topics on the Arm Developer website.
If you have problems and would like to contact our support team, get in touch:
Contact Arm Support
It contains the following sections:
• 3.1 Coding best practice for auto-vectorization on page 3-38.
• 3.2 Use pragmas to control auto-vectorization on page 3-39.
• 3.3 Optimizing C/C++ code with Arm SIMD (Neon™) on page 3-42.
• 3.4 Optimizing C/C++ code with SVE on page 3-43.
• 3.5 Writing inline SVE assembly on page 3-44.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-37
reserved.
Non-Confidential
3 Coding best practice
3.1 Coding best practice for auto-vectorization
Use restrict
Use the restrict keyword if appropriate when using C/C++ code. The C99 restrict keyword (or the
non-standard C/C++ __restrict__ keyword) indicates to the compiler that a specified pointer does not
alias with any other pointers, for the lifetime of that pointer. restrict allows the compiler to vectorize
loops more aggressively because it becomes possible to prove that loop iterations are independent and
can be executed in parallel.
Note
C code might use either the restrict or __restrict__ keywords. C++ code must use the
__restrict__ keyword.
If the restrict keywords are used incorrectly (that is, if another pointer is used to access the same
memory) then the behavior is undefined. It is possible that the results of optimized code will differ from
that of its unoptimized equivalent.
Use pragmas
The compiler supports pragmas. Use pragmas to explicitly indicate that loop iterations are completely
independent from each other.
For more information, see Use pragmas to control auto-vectorization on page 3-39.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-38
reserved.
Non-Confidential
3 Coding best practice
3.2 Use pragmas to control auto-vectorization
The pragma above indicates to the compiler that the following loop contains no data dependencies
between loop iterations that would prevent vectorization. The compiler might be able to use this
information to vectorize a loop, where it would not typically be possible.
Note
The vectorize pragma does not guarantee auto-vectorization. There might be other reasons why auto-
vectorization is not possible or worthwhile for a particular loop.
Warning
Ensure that you only use this pragma when it is safe to do so. Using the vectorize pragma when there
are data dependencies between loop iterations might result in incorrect behavior.
For example, consider the following loop, that processes an array indices. Each element in indices
specifies the index into a larger histogram array. The referenced element in the histogram array is
incremented.
void update(int *restrict histogram, int *restrict indices, int count)
{
for (int i = 0; i < count; i++)
{
histogram[ indices[i] ]++;
}
}
The compiler is unable to vectorize this loop, because the same index could appear more than once in the
indices array. Therefore, a vectorized version of the algorithm would lose some of the increment
operations if two identical indices are processed in the same vector load/increment/store sequence.
However, if you know that the indices array only ever contains unique elements, then it is useful to be
able to force the compiler to vectorize this loop. This is accomplished by placing the vectorize pragma
before the loop:
void update_unique(int *restrict histogram, int *restrict indices, int count)
{
#pragma clang loop vectorize(assume_safety)
for (int i = 0; i < count; i++)
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-39
reserved.
Non-Confidential
3 Coding best practice
3.2 Use pragmas to control auto-vectorization
{
histogram[ indices[i] ]++;
}
}
You can also suppress SVE instructions while allowing Arm Neon instructions by adding a
vectorize_style hint:
vectorize_style(fixed_width)
Prefer fixed-width vectorization, resulting in Arm Neon instructions. For a loop with
vectorize_style(fixed_width), the compiler prefers to generate Arm Neon instructions,
though SVE instructions might still be used with a fixed-width predicate (such as gather loads or
scatter stores).
vectorize_style(scaled_width) (default)
Unrolling
Unrolling a scalar loop, for example:
for (int i = 0; i < 64; i++) {
data[i] = input[i] * other[i];
}
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-40
reserved.
Non-Confidential
3 Coding best practice
3.2 Use pragmas to control auto-vectorization
For the example above, the unrolling factor (UF) is two. To unroll to the internal limit, the unroll
pragma is inserted before the loop:
#pragma clang loop unroll(enable)
Interleaving
To interleave, an Interleaving Factor (IF) is used instead of a UF. To accurately generate interleaved
code, the loop vectorizer models the cost on the register pressure and the generated code size. When a
loop is vectorized, the interleaved code can be more optimal than unrolled code.
Like the UF, the IF can be the internal limit or a user-defined integer. To interleave to the internal limit,
the interleave pragma is inserted before the loop:
#pragma clang loop interleave(enable)
Note
Interleaving performed on a scalar loop will not unroll the loop correctly.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-41
reserved.
Non-Confidential
3 Coding best practice
3.3 Optimizing C/C++ code with Arm SIMD (Neon™)
The instruction specifies that an addition (ADD) operation is performed on two 64-bit data lanes (2D). D
specifies the width of the data lane (doubleword, or 64 bits) and 2 specifies that two lanes are used (that
is the full 128-bit register). Each lane in V1 is added to the corresponding lane in V2 and the result stored
in V0. Each lane is added separately. There are no carries between the lanes.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-42
reserved.
Non-Confidential
3 Coding best practice
3.4 Optimizing C/C++ code with SVE
The instruction specifies that an addition (ADD) operation is performed on a SVE vector register, split
into 64-bit data lanes. D specifies the width of the data lane (doubleword, or 64 bits). The width of each
vector register is some multiple of 128 bits, between 128 and 2048, but is not specified by the
architecture. The predicate register P0 specifies which lanes must be active. Each active lane in Z1 is
added to the corresponding lane in Z2 and the result stored in Z0. Each lane is added separately. There are
no carries between the lanes. The merge flag /M on the predicate specifies that inactive lanes retain their
prior value.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-43
reserved.
Non-Confidential
3 Coding best practice
3.5 Writing inline SVE assembly
Note
This information assumes that you are familiar with details of the SVE Architecture, including vector-
length agnostic registers, predication, and WHILE operations.
Using inline assembly rather than writing a separate .s file has the following advantages:
• Inline assembly code shifts the burden of handling the procedure call standard (PCS) from the
programmer to the compiler. This includes allocating the stack frame and preserving all necessary
callee-saved registers.
• Inline assembly code gives the compiler more information about what the assembly code does.
• The compiler can inline the function that contains the assembly code into its callers.
• Inline assembly code can take immediate operands that depend on C-level constructs, such as the size
of a structure or the byte offset of a particular structure field.
Where:
instructions
is a text string that contains AArch64 assembly instructions, with at least one newline sequence
n between consecutive instructions.
outputs
is a comma-separated list of effects that the assembly instructions have, besides reading from
inputs and writing to outputs.
Additionally, the asm keyword might need to be followed by the volatile keyword.
Outputs
Each entry in outputs has one of the following forms:
[name] "=®ister-class" (destination)
[name] "=register-class" (destination)
The first form has the register class preceded by =&. This specifies that the assembly instructions might
read from one of the inputs (specified in the asm statement’s inputs section) after writing to the output.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-44
reserved.
Non-Confidential
3 Coding best practice
3.5 Writing inline SVE assembly
The second form has the register class preceded by =. This specifies that the assembly instructions never
read from inputs in this way. Using the second form is an optimization. It allows the compiler to allocate
the same register to the output as it allocates to one of the inputs.
Both forms specify that the assembly instructions produce an output that the compiler should store in the
C object specified by destination. This can be any scalar value that is valid for the left-hand side of a C
assignment. The register-class field specifies the type of register that the assembly instructions require. It
can be one of:
r
if the register for this output when used within the assembly instructions is a general-purpose
register (x0-x30)
w
if the register for this output when used within the assembly instructions is a SIMD and floating-
point register (v0-v31).
It is not possible at present for outputs to contain an SVE vector or predicate value. All uses of SVE
registers must be internal to the inline assembly block.
It is the responsibility of the compiler to allocate a suitable output register and to copy that register into
the destination after the asm statement is executed. The assembly instructions within the instructions
section of the asm statement can use one of the following forms to refer to the output value:
%[name]
In optimized output the compiler picks the return register (0) for res, resulting in the following assembly
code:
movz w0, #10
ret
Inputs
Within an asm statement, each entry in the inputs section has the form:
[name] "operand-type" (value)
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-45
reserved.
Non-Confidential
3 Coding best practice
3.5 Writing inline SVE assembly
This construct specifies that the asm statement uses the scalar C expression value as an input, referred to
within the assembly instructions as name. The operand-type field specifies how the input value is
handled within the assembly instructions. It can be one of the following:
r
if the input is to be placed in the same register as output output-name. In this case the [name]
part of the input specification is redundant and can be omitted. The assembly instructions can
use the forms described in the Outputs section above (%[name], %w[name], %s [name],
%d[name]) to refer to both the input and the output.
if the input is an integer constant and is used as an immediate operand. The assembly
instructions use %[name] in place of immediate operand #N, where N is the numerical value of
value.
In the first two cases, it is the responsibility of the compiler to allocate a suitable register and to ensure
that it contains value on entry to the assembly instructions. The assembly instructions must refer to these
registers using the same syntax as for the outputs (%[name], %w[name], %s [name], %d[name]).
It is not possible at present for inputs to contain an SVE vector or predicate value. All uses of SVE
registers must be internal to instructions.
This example shows an asm directive with the same effect as the previous example, except that an i-form
input is used to specify the constant to be assigned to the result.
int f()
{
int result;
asm("movz %w[res], %[value]" : [res] "=r" (result) : [value] "i" (10));
return result;
}
Side effects
Many asm statements have effects other than reading from inputs and writing to outputs. This is
particularly true of asm statements that implement vectorized loops, since most such loops read from or
write to memory. The side-effects section of an asm statement tells the compiler what these additional
effects are. Each entry must be one of the following:
"memory"
if the asm statement reads from or writes to memory. This is necessary even if inputs contain
pointers to the affected memory.
"cc"
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-46
reserved.
Non-Confidential
3 Coding best practice
3.5 Writing inline SVE assembly
"zN"
if the asm statement modifies SVE vector register N. Since SVE vector registers extend the
SIMD and floating-point registers, this is equivalent to writing “vN”.
"pN"
Use of volatile
Sometimes an asm statement might have dependencies and side effects that cannot be captured by the
asm statement syntax. For example, suppose there are three separate asm statements (not three lines
within a single asm statement), that do the following:
• The first sets the floating-point rounding mode.
• The second executes on the assumption that the rounding mode set by the first statement is in effect.
• The third statement restores the original floating-point rounding mode.
It is important that these statements are executed in order, but the asm statement syntax provides no direct
method for representing the dependency between them. Instead, each statement must add the keyword
volatile after asm. This prevents the compiler from removing the asm statement as dead code, even if
the asm statement does not modify memory and if its results appear to be unused. The compiler always
executes asm volatile statements in their original order.
For example:
asm volatile ("msr fpcr, %[flags]" :: [flags] "r" (new_fpcr_value));
Note
An asm volatile statement must still have a valid side effects list. For example, an asm volatile
statement that modifies memory must still include "memory" in the side-effects section.
Labels
The compiler might output a given asm statement more than once, either as a result of optimizing the
function that contains the asm statement or as a result of inlining that function into some of its callers.
Therefore, asm statements must not define named labels like .loop, since if the asm statement is written
more than once, the output contains more than one definition of label .loop. Instead, the assembler
provides a concept of relative labels. Each relative label is simply a number and is defined in the same
way as a normal label. For example, relative label 1 is defined by:
1:
The assembly code can contain many definitions of the same relative label. Code that refers to a relative
label must add the letter f to refer the next definition (f is for forward) or the letter b (backward) to refer
to the previous definition. A typical assembly loop with a pre-loop test would therefore have the
following structure. This allows the compiler output to contain many copies of this code without creating
any ambiguity.
...pre-loop test...
b.none 2f
1:
...loop...
b.any 1b
2:
Example
The following example shows a simple function that performs a fused multiply-add operation (x=a·b+c)
across four passed-in arrays of a size specified by n:
void f(double *restrict x, double *restrict a, double *restrict b, double *restrict c,
unsigned long n)
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-47
reserved.
Non-Confidential
3 Coding best practice
3.5 Writing inline SVE assembly
{
for (unsigned long i = 0; i < n; ++i)
{
x[i] = fma(a[i], b[i], c[i]);
}
}
An asm statement that exploited SVE instructions to achieve equivalent behavior might look like the
following:
void f(double *x, double *a, double *b, double *c, unsigned long n)
{
unsigned long i;
asm ("whilelo p0.d, %[i], %[n] \n\
1: \n\
ld1d z0.d, p0/z, [%[a], %[i], lsl #3] \n\
ld1d z1.d, p0/z, [%[b], %[i], lsl #3] \n\
ld1d z2.d, p0/z, [%[c], %[i], lsl #3] \n\
fmla z2.d, p0/m, z0.d, z1.d \n\
st1d z2.d, p0, [%[x], %[i], lsl #3] \n\
uqincd %[i] \n\
whilelo p0.d, %[i], %[n] \n\
b.any 1b"
: [i] "=&r" (i)
: "[i]" (0),
[x] "r" (x),
[a] "r" (a),
[b] "r" (b),
[c] "r" (c),
[n] "r" (n)
: "memory", "cc", "p0", "z0", "z1", "z2");
}
Note
Keeping the restrict qualifiers would be valid but would have no effect.
The input specifier "[i]" (0) indicates that the assembly statements take an input 0 in the same register
as output [i]. In other words, the initial value of [i] must be zero. The use of =& in the specification of
[i] indicates that [i] cannot be allocated to the same register as [x], [a], [b], [c], or [n] (because the
assembly instructions use those inputs after writing to [i]).
In this example, the C variable i is not used after the asm statement. In effect the asm statement is simply
reserving a register that it can use as scratch space. Including "memory" in the side effects list indicates
that the asm statement reads from and writes to memory. The compiler must therefore keep the asm
statement even though i is not used.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 3-48
reserved.
Non-Confidential
Chapter 4
Standards support
The support status of Arm C/C++ Compiler with the OpenMP standards.
It contains the following sections:
• 4.1 OpenMP 4.0 on page 4-50.
• 4.2 OpenMP 4.5 on page 4-51.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 4-49
reserved.
Non-Confidential
4 Standards support
4.1 OpenMP 4.0
Device constructs No
Cancellation Yes
OMP_DISPLAY_ENV Yes
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 4-50
reserved.
Non-Confidential
4 Standards support
4.2 OpenMP 4.5
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 4-51
reserved.
Non-Confidential
Chapter 5
Arm Optimization Report
Arm Optimization Report is a new feature of Arm Compiler for Linux version 20.0 that builds upon the
llvm-opt-report tool available in open-source LLVM. Arm Optimization Report makes it easier to see
what optimization decisions the compiler is making, in-line with your source code.
Arm Optimization Report helps you answer questions regarding unrolling, vectorization, and
interleaving:
Unrolling
Example questions: Was a loop unrolled? If so, what was the unroll factor?
Unrolling is when a scalar loop is transformed to perform multiple iterations at once, but still as scalar
instructions.
The unroll factor is the number of iterations of the original loop that are performed at once. Sometimes,
loops with known small iteration counts are completely unrolled, such that no loop structure remains. In
completely unrolled cases, the unroll factor is the total scalar iteration count.
Vectorization
Example questions: Was a loop vectorized? If so, what was the vectorization factor?
Vectorization is when multiple iterations of a scalar loop are replaced by a single iteration of vector
instructions.
The vectorization factor is the number of lanes in the vector unit, and corresponds to the number of scalar
iterations performed by each vector instruction
Note
The true vectorization factor is unknown at compile-time for SVE, because SVE supports scalable
vectors.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-52
reserved.
Non-Confidential
5 Arm Optimization Report
For this reason, when SVE is enabled, Arm Optimization Report reports a vectorization factor that
corresponds to a 128-bit SVE implementation.
If you are working with an SVE implementation with a larger vector width (for example, 256 or 512
bits), the number of scalar iterations performed by each vector instruction increases proportionally.
SVE scaling factor = <true SVE vector width> / 128
Interleaving
Example question: What was the interleave count?
Interleaving is a combination of vectorization followed by unrolling; multiple streams of vector
instructions are performed in each iteration of the loop.
This combination of vectorization and unrolling information lets you know how many iterations of the
original scalar loop are performed in each iteration of the generated code.
Number of scalar iterations = <unroll factor> x <vectorization factor> x <interleave count>
x <SVE scaling factor>
Reference
The annotations Arm Optimization Report uses to annotate the source code, and the options that can be
passed to arm-opt-report are described in the Arm Optimization Report reference.
It contains the following sections:
• 5.1 How to use Arm Optimization Report on page 5-54.
• 5.2 arm-opt-report reference on page 5-56.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-53
reserved.
Non-Confidential
5 Arm Optimization Report
5.1 How to use Arm Optimization Report
Prerequisites
Download and install Arm Compiler for Linux version 20.0+. For more information, see Download Arm
Compiler for Linux and Installation.
Procedure
1. To generate a machine-readable .opt.yaml report, at compile time add -fsave-optimization-
record to your command line.
This generates a file, example.opt.yaml, in the same directory as the built object.
For compilations that create multiple object files, there is a report for each build object.
Note
This example compiles to a shared object, however, you could also compile to a static object or to a
binary.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-54
reserved.
Non-Confidential
5 Arm Optimization Report
5.1 How to use Arm Optimization Report
10 | }
11 |
12 U16 | for (i = 0; i < 16; i++) {
13 | res[i] = (p[i] == 0) ? res[i] : res[i] + d[i];
14 | }
15 |
16 I | foo();
17 |
18 | foo(); bar(); foo();
I | ^
I | ^
19 | }
Related references
5.2 arm-opt-report reference on page 5-56
Related information
Arm Compiler for Linux and Arm Allinea Studio
Take a trial
Help and tutorials
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-55
reserved.
Non-Confidential
5 Arm Optimization Report
5.2 arm-opt-report reference
Annotation Description
I A function was inlined.
U<N> A loop was unrolled <N> times.
Syntax
arm-opt-report [options] <input>
Options
Generic Options:
--help
Hides remarks about vectorization being forced despite the cost-model indicating that it is not
beneficial.
--hide-inline-hints
Hides remarks about the calls to library functions that are preventing vectorization.
--hide-vectorization-cost-info
Hides remarks about the cost of loops that are not beneficial for vectorization.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-56
reserved.
Non-Confidential
5 Arm Optimization Report
5.2 arm-opt-report reference
--no-demangle
Outputs
Annotated source code.
Related tasks
5.1 How to use Arm Optimization Report on page 5-54
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 5-57
reserved.
Non-Confidential
Chapter 6
Optimization remarks
Optimization remarks provide you with information about the choices made by the compiler. You can use
them to see which code has been inlined or they can help you understand why a loop has not been
vectorized. By default, Arm C/C++ Compiler prints compilation information to stderr. Optimization
remarks prints this optimization information to the terminal, or you can choose to pipe them to an output
file.
To enable optimization remarks, choose from following Rpass options:
• -Rpass=<regex>: Information about what the compiler has optimized.
• -Rpass-analysis=<regex>: Information about what the compiler has analyzed.
• -Rpass-missed=<regex>: Information about what the compiler failed to optimize.
For each option, replace <regex> with an expression for the type of remarks you wish to view.
Recommended <regexp> queries are:
• -Rpass=\(loop-vectorize\|inline\|loop-unroll)
• -Rpass-missed=\(loop-vectorize\|inline\|loop-unroll)
• -Rpass-analysis=\(loop-vectorize\|inline\|loop-unroll)
where loop-vectorize filters remarks regarding vectorized loops, inline for remarks regarding
inlining, and loop-unroll for remarks about unrolled loops.
Note
To search for all remarks, use the expression .*. Use this expression with caution; depending on the size
of code, and the level of optimization, a lot of information can print.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 6-58
reserved.
Non-Confidential
6 Optimization remarks
To compile with optimization remarks enabled and pipe the information to an output file, pass the
selected above options and debug information to armclang, and use > <output_filename>.txt. For
example:
armclang -O<level> -Rpass[-<option>]=<remark> <filename>.c 2> <output_filename>.txt
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 6-59
reserved.
Non-Confidential
6 Optimization remarks
6.1 Enable Optimization remarks
Procedure
1. Compile your code. Use the -Rpass=<regex>, -Rpass-missed=<regex>, or Rpass-
analysis=<regex> options:
Result:
example.c:8:18: remark: hoisting zext [-Rpass=licm]
for (int i=0;i<K; i++)
^
example.c:8:4: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-
Rpass=loop-vectorize]
for (int i=0;i<K; i++)
^
example.c:7:1: remark: 28 instructions in function [-Rpass-analysis=asm-printer]
void foo(int K) {
^
2. Pipe the loop vectorization optimization remarks to a file. For example, to pipe to a file called
vecreport.txt, use:
armclang -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize
-Rpass-missed=loop-vectorize example.c 2> vecreport.txt
Alternatively, to enable optimization remarks and pipe the output information to a file, use:
armclang -O<level> -Rpass[-<option>]=<remark> <example>.c 2> <output_filename>.txt
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 6-60
reserved.
Non-Confidential
Chapter 7
Vector math routines
Describes how to use the libamath library which contains the SIMD implementation of the routines
provided by libm.
It contains the following sections:
• 7.1 Vector math routines in Arm® C/C++ Compiler on page 7-62.
• 7.2 Interface user vector functions with serial code on page 7-63.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-61
reserved.
Non-Confidential
7 Vector math routines
7.1 Vector math routines in Arm® C/C++ Compiler
Examples
The following examples show loops with math function calls that can be vectorized by invoking the
compiler with:
armclang -fsimdmath -c -O2 source.c``
How it works
Arm C/C++ Compiler contains libamath, a library with SIMD implementations of the routines provided
by libm, along with a math.h file that declares the availability of these SIMD functions to the compiler.
During loop vectorization, the compiler is aware of these vectorized routines, and can replace a call to a
scalar function (for example, a double-precision call to sin) with a call to a libamath function that takes
a vector of double precision arguments, and returns a result vector of doubles.
The libamath library is built using code based on SLEEF, an open source math library available from
the SLEEF website.
Limitations
This is an experimental feature which can lead to performance degradations in some cases. Arm
encourages users to test the applicability of this feature on their non-production code, and will address
any possible inefficiency in a future release.
Contact Arm Support
Related information
SLEEF website
Vector function ABI specification for AArch64
Arm C/C++ Compiler
Help and tutorials
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-62
reserved.
Non-Confidential
7 Vector math routines
7.2 Interface user vector functions with serial code
To compile the code, invoke armclang with either the -fopenmp or the -fopenmp-simd options
(automatic loop vectorization is activated starting from optimization level -O2):
$> armclang -fopenmp -O2 -c user_code.c -o objfile.o
You must link the output object file against an object file or library that provides the symbol neon_foo.
The following example shows the basic functionality for SVE vectorization:
// declarations or definitions visible at compile time in myvecroutines.h
#include <arm_sve.h>
svint32_t sve_foo(svfloat64_t, svbool_t);
#pragma omp declare variant(sve_foo) \
match(construct = {simd(notinbranch)}, \
device = {isa("sve")}, \
implementation = {extension("scalable")})
int foo(double);
// loop in the user code, in user_code.c
#include "path/to/myvecroutines.h"
void do_something(int * a, double * b, unsigned N) {
for (unsigned i = 0; i < N; ++i)
a[i] = foo(b[i]);
}
To compile the code, invoke armclang with either the -fopenmp or the -fopenmp-simd options
(automatic loop vectorization is activated starting from optimization level -O2):
armclang -march=armv8-a+sve -fopenmp -O2 -c user_code.c -o objfile.o
You must link the output object file against an object file or library that provides the symbol sve_foo.
The vector function that is associated to the scalar function must have a signature that obeys to the rules
of the chapter on USER DEFINED VECTOR FUNCTIONS of the Vector Function Application Binary
Interface (VFABI) Specification for AArch64. The rules are summarized in section Mapping rules.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-63
reserved.
Non-Confidential
7 Vector math routines
7.2 Interface user vector functions with serial code
The linear clause in the simd trait is only supported for pointers with a linear step of 1. There is no
support for linear modifiers.
• To allow scalable vectorization when targeting SVE, you must omit the simdlen clause, and you
must specify the implementation trait extension extension("scalable").
• The supported scalar function signature in C and C++ are in the forms:
1. void (Ty1, Ty2,..., TyN)
2. Ty1 (Ty2, Ty3,..., TyN)
where Ty#n are:
1. Any of the integral type values of size 1, 2, 4, or 8 (in bytes), signed and unsigned.
2. Floating-point type values of half, single or double precision.
3. Pointers to any of the previous types.
There is no support for variadic functions or C++ templates.
Mapping rules
Common mapping rules
1. Each parameter and the return value of the scalar function, maps to a correspondent parameter and
return value in the vector signature, in the same order.
2. A parameter that is marked with linear is left unchanged in the vector signature.
3. The void return type is left unchanged in the vector signature.
Mapping rules for Advanced SIMD
1. Each parameter type Ty#n maps to the correspondent Neon ACLE type <Ty#n>x<N>_t, where N is the
value specified in the simdlen(N) clause. Values of N that do not correspond to NEON ACLE types
are unsupported.
2. If you specify inbranch, an additional mask parameter is added as the last parameter of the vector
signature. The type of the parameter is the NEON ACLE type uint<BITS>x<N>_t, where:
a. N is the value specified in the simdlen(N) clause.
b. BITS is the size (in bits) of the Narrowest Data Size (NDS) associated to the scalar function, as
defined in the VFABI.
c. To select active or inactive lanes, set all bits to 1 (active) or 0 (inactive) in the corresponding
uint<BITS>_t integer in the mask vector.
Mapping rules for SVE
1. Each parameter type Ty#n is mapped to the correspondent SVE ACLE type sv<Ty#n>_t.
2. An extra mask parameter of type svbool_t is always added to the signature of the vector function,
whether inbranch or notinbranch is used. Active and inactive lanes of the mask are set as described
in the section SVE Masking of the VFABI:
“The logical lane subdivision of the predicate corresponds to the lane subdivision of the vector data
type generated for the Widest Data Type (WDS), with one bit in the predicate lane for each byte of
the data lane. Active logical lanes of the predicate have the least significant bit set to 1, and the rest
set to zero. The bits of the inactive logical lanes of the predicate are set to zero.”
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-64
reserved.
Non-Confidential
7 Vector math routines
7.2 Interface user vector functions with serial code
For example, in the function svfloat64_t F(svfloat32_t vx, svbool_t), the WDS is 8, therefore
the lane subdivision of the mask is 8-bit. Active lanes are set by the bit sequence 00000001, inactive
lanes are set with 00000000.
Examples
The following examples show you how to vectorize with the custom user vector function. The examples
use:
• -O2 to enable the minimal level of optimizations to allow the loop auto-vectorization process.
• -fopenmp to enable the parsing of the OpenMP directives.
Note
• The same functionality for declare variant can also be achieved with -fopenmp-simd.
• -mllvm -force-vector-interleave=1 simplifies the output and can be omitted for regular
compiler invocations.
The code in these examples has been produced by Arm Compiler for Linux 20.0.
For both Advanced SIMD and SVE, the linear clause can improve the vectorization of functions
accessing memory through contiguous pointers. For example, in the function double sincos(double,
double *, double *), the memory pointed to by the pointer parameters is contiguous across loop
iterations. To improve the vectorization of this function, use the linear clause:
#include <arm_sve.h>
void CustomSinCos(svfloat64_t, double *, double *);
#pragma omp declare variant(CustomSinCos) \
match(construct = {simd(notinbranch, linear(sinp), linear(cosp))}, \
device = {isa("sve")}, \
implementation = {extension("scalable")})
double sincos(double in, double *sinp, double *cosp);
void f(double *in, double *sin, double *cos, unsigned N) {
for (unsigned i = 0; i < N; ++i)
sincos(in[i], &sin[i], &cos[i]);
}
To produce a vector loop that invokes user_vector_foo, compile the example code with armclang -
fopenmp -O2 -c -S -o - example01.c -mllvm -force-vector-interleave=1:
//...
.LBB0_4: // =>This Inner Loop Header: Depth=1
ldr q0, [x25], #16
bl user_vector_foo
subs x23, x23, #2 // =2
str q0, [x24], #16
b.ne .LBB0_4
With linear:
// filename: example02.c
#include <arm_neon.h>
__attribute__((aarch64_vector_pcs)) float64x2_t user_vector_foo_linear(float64x2_t, float *);
#pragma omp declare variant(user_vector_foo_linear) \
match(construct = {simd(simdlen(2), notinbranch, linear(b))}, \
device = {isa("simd")})
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-65
reserved.
Non-Confidential
7 Vector math routines
7.2 Interface user vector functions with serial code
To produce a vector loop that invokes user_vector_foo_linear, compile this code with armclang -
fopenmp -O2 -c -S -o - example02.c -mllvm -force-vector-interleave=1:
Examples: SVE
Simple:
// filename: example03.c
#include <arm_sve.h>
svfloat16_t user_vector_foo_sve(svfloat64_t a, svbool_t mask);
#pragma omp declare variant(user_vector_foo_sve) \
match(construct = {simd(notinbranch)}, \
device = {isa("sve")}, \
implementation = {extension("scalable")})
float16_t foo(double);
void do_something(float16_t * restrict a, double * b, unsigned N) {
for (unsigned i = 0; i < N; ++i)
a[i] = foo(b[i]);
}
With linear:
// filename: example04.c
#include <arm_sve.h>
svfloat64_t user_vector_foo_linear_sve(svfloat64_t, float *, svbool_t);
#pragma omp declare variant(user_vector_foo_linear_sve) \
match(construct = {simd(notinbranch, linear(b))}, \
device = {isa("sve")}, \
implementation = {extension("scalable")})
double foo_linear(double a, float* b);
void do_something_linear(double * restrict a, double * b, float * x, unsigned N) {
for (unsigned i = 0; i < N; ++i)
a[i] = foo_linear(b[i], &x[i]);
}
To generate an invocation to the user vector function user_vector_foo_linear in the vector loop,
compile the code with armclang example04.c -march=armv8-a+sve -O2 -o - -S -fopenmp:
.LBB0_2: // %vector.body
// =>This Inner Loop Header: Depth=1
ld1d { z0.d }, p4/z, [x20, x22, lsl #3]
add x0, x19, x22, lsl #2
mov p0.b, p4.b
bl user_vector_foo_linear_sve
st1d { z0.d }, p4, [x21, x22, lsl #3]
incd x22
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-66
reserved.
Non-Confidential
7 Vector math routines
7.2 Interface user vector functions with serial code
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 7-67
reserved.
Non-Confidential
Chapter 8
Further resources
Describes where to find more resources about Arm C/C++ Compiler (part of Arm Compiler for Linux).
It contains the following section:
• 8.1 Further resources for Arm® C/C++ Compiler on page 8-69.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 8-68
reserved.
Non-Confidential
8 Further resources
8.1 Further resources for Arm® C/C++ Compiler
Note
An HTML version of this guide is available in the <install_location>/<package_name>/share
directory of your product installation.
101458_2000_00_en Copyright © 2018, 2019 Arm Limited or its affiliates. All rights 8-69
reserved.
Non-Confidential