C Ug lnx2
C Ug lnx2
C Ug lnx2
Most Linux* distributions include the GNU* C library, assembler, linker, and others. The Intel C++
Compiler includes the Dinkumware* C++ library. See Libraries Overview.
Please look at the individual sections within each main section of this User's Guide to gain an overview
of the topics presented. For the latest information, visit the Intel Web site: http://developer.intel.com/.
Disclaimer
This Intel® C++ Compiler User's Guide as well as the software described in it, is furnished under
license and may only be used or copied in accordance with the terms of the license. The information in
this manual is furnished for informational use only, is subject to change without notice, and should not
be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or
liability for any errors or inaccuracies that may appear in this document or any software that may be
provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced,stored in a retrieval
system, or transmitted in any form or by any means without the express written consent of Intel
Corporation.
Information in this document is provided in connection with Intel products. No license, express or
implied, by estoppel or otherwise, to any intellectual property rights is granted by this document.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may
make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked
"reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility
whatsoever for conflicts or incompatibilities arising from future changes to them.
The Intel C++ Compiler may contain design defects or errors known as errata which may cause the
product to deviate from published specifications. Current characterized errata are available on request.
Celeron, Dialogic, i386, i486, iCOMP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4,
IntelSX2, Intel Inside, Intel Inside logo, Intel NetBurst, Intel NetStructure, Intel Xeon, Intel XScale,
Itanium, MMX, MMX logo, Pentium, and VTune are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
Page 1 of 431
Intel® C++ Compiler User's Guide
Feature Benefit
Processor dispatch Taking advantage of the latest Intel architecture features while maintaining
object code compatibility with previous generations of Intel® Pentium®
processors (for IA-32-based systems only).
For specific details on the Itanium® architecture, visit the web site at
http://developer.intel.com/design/itanium/under_lnx.htm.
Page 2 of 431
Intel® C++ Compiler User's Guide
System Requirements
IA-32 Processor System Requirements
z A computer based on a Pentium® processor or subsequent IA-32 based processor (Pentium 4
processor recommended).
z 128 MB of RAM (256 MB recommended).
z 100 MB of disk space.
Software Requirements
For a complete list of system requirements, see the Release Notes.
If you require a counted license, see "Using the Intel® License Manager for
FLEXlm*" (flex_ug.pdf).
Page 3 of 431
Intel® C++ Compiler User's Guide
This documentation assumes that you are familiar with the C and C++ programming languages and
with the Intel processor architecture. You should also be familiar with the host computer's operating
system.
Note
This document explains how information and instructions apply differently to each targeted
architecture. If there is no specific indication to either architecture, the description is applicable to both
architectures.
Conventions
This documentation uses the following conventions:
This type style Indicates an element of syntax, reserved word, keyword, filename,
computer output, or part of a program example. The text appears in
lowercase unless uppercase is significant.
This type style Indicates the exact characters you type as input.
... (ellipses) Indicate that you can repeat the preceding item.
Page 4 of 431
Intel® C++ Compiler User's Guide
_mm_<intrin_op>_<suffix>
<intrin_op> Indicates the intrinsics basic operation; for example, add for addition and sub for
subtraction.
<suffix> Denotes the type of data operated on by the instruction. The first one or two letters
of each suffix denotes whether the data is packed (p), extended packed (ep), or
scalar (s). The remaining letters denote the type:
A number appended to a variable name indicates the element of a packed object. For example, r0 is
the lowest word of r. Some intrinsics are "composites" because they require more than one instruction
to implement them.
The packed values are represented in right-to-left order, with the lowest value being used for scalar
operations. Consider the following example operation:
In other words, the xmm register that holds the value t will look as follows:
The "scalar" element is 1.0. Due to the nature of the instruction, some intrinsics require their
arguments to be immediates (constant integer literals).
Page 5 of 431
Intel® C++ Compiler User's Guide
<type><signedness><bits>vec<elements>
{ F | I } { s | u } { 64 | 32 | 16 | 8 } vec { 8 | 4 | 2 | 1 }
where
Page 6 of 431
Intel® C++ Compiler User's Guide
Related Publications
The following documents provide additional information relevant to the Intel® C++ Compiler:
Most Intel documents are also available from the Intel Corporation Web site at http://www.intel.com.
Page 7 of 431
Intel® C++ Compiler User's Guide
Convention Definition
[-] If an option includes "[-]" as part of the definition, then the option can be
used to enable or disable the feature. For example, the -c99[-] option
can be used as -c99 (enable c99 support) or -c99- (disable c99 support).
[n] Indicates that the value n in [] can be omitted or have various values.
Values in {} with Are used for option's version; for example, option -i{2|4|8} has these
vertical bars versions: -i2, -i4, -i8.
{n} Indicates that option must include one of the fixed values for n.
Words in this Indicate option's required argument(s). Arguments are separated by comma
style following an if more than one are required.
option
Page 8 of 431
Intel® C++ Compiler User's Guide
New Options
z Options specific to IA-32 architecture
z Options specific to the Itanium® architecture (Itanium-based systems only)
All other options are supported on both IA-32 and Itanium-based systems.
Page 9 of 431
Intel® C++ Compiler User's Guide
Page 10 of 431
Intel® C++ Compiler User's Guide
z 0: Disables inlining.
z 1: Enables (default) inlining of functions
declared with the __inline keyword. Also
enables inlining according to the C++
language.
z 2: Enables inlining of any function. However,
the compiler decides which functions to inline.
Enables interprocedural optimizations and
has the same effect as -ip.
Page 11 of 431
Intel® C++ Compiler User's Guide
All other options are supported on both IA-32 and Itanium-based systems.
Page 12 of 431
Intel® C++ Compiler User's Guide
Page 13 of 431
Intel® C++ Compiler User's Guide
Page 14 of 431
Intel® C++ Compiler User's Guide
Page 15 of 431
Intel® C++ Compiler User's Guide
Page 16 of 431
Intel® C++ Compiler User's Guide
Page 17 of 431
Intel® C++ Compiler User's Guide
Page 18 of 431
Intel® C++ Compiler User's Guide
z 0: Disables inlining.
z 1: Enables (default) inlining of
functions declared with the
__inline keyword. Also
enables inlining according to
the C++ language.
z 2: Enables inlining of any
function. However, the compiler
decides which functions to
inline. Enables interprocedural
optimizations and has the
same effect as -ip.
z min
z med
z max
Page 19 of 431
Intel® C++ Compiler User's Guide
z -par_report0: no diagnostic
information is displayed.
z -par_report1: indicates
loops successfully auto-
parallelized (default).
z -par_report2: loops
successfully and unsccessfully
auto-parallelized.
z -par_report3: same as 2
plus additional information
about any proven or assumed
dependences inhibiting auto-
parallelization.
Page 20 of 431
Intel® C++ Compiler User's Guide
Page 21 of 431
Intel® C++ Compiler User's Guide
Page 22 of 431
Intel® C++ Compiler User's Guide
z n = 0 no diagnostic information
z n = 1 indicates vectorized loops
(DEFAULT)
z n = 2 indicates vectorized/non-
vectorized loops
z n = 3 indicates vectorized/non-
vectorized loops and prohibiting
data dependence information
z n = 4 indicates non-vectorized
loops
z n = 5 indicates non-vectorized
loops and prohibiting data
Page 23 of 431
Intel® C++ Compiler User's Guide
z c - C source file
z c++ - C++ source file
z c-header - C header file
z cpp-output - C preprocessed
file
z assembler - assemblable file
z assembler-with-cpp -
Assemblable file that needs to
be preprocessed.
z none - Disable recognition and
revert to file extension.
Page 24 of 431
Intel® C++ Compiler User's Guide
Preprocessing Options
Option Description
-E Directs the preprocessor to expand your source module and write the
result to standard output.
-EP Directs the preprocessor to expand your source module and write the
result to standard output. Does not include #line directives in the
output.
-P Directs the preprocessor to expand your source module and store the
result in a .i file in the current directory.
-Uname Suppresses any automatic definition for the specified macro name .
Page 25 of 431
Intel® C++ Compiler User's Guide
-sox[-] Enables [disables] the saving of compiler options and version information in the
(IA-32 only) executable file. Default is -sox-.
-Zp Specifies the strictest alignment constraint for structure and union types as one
{1|2|4|8|16} of the following: 1, 2, 4, 8, or 16 bytes.
-0f_check Avoids the incorrect decoding of certain 0f instructions for code targeted at older
(IA-32 only) processors.
Page 26 of 431
Intel® C++ Compiler User's Guide
Debugging Options
Option(s) Result
Conformance Options
Option Description
-ansi Enables assumption of the program's ANSI conformance.
-ansi_alias[-] -ansi_alias directs the compiler to assume the following:
If your program satisfies the above conditions, setting the -ansi_alias flag
will help the compiler better optimize the program. However, if your program
does not satisfy one of the above conditions, the -ansi_alias flag may lead
the compiler to generate incorrect code.
-mp Favors conformance to the ANSI C and IEEE 754 standards for floating-point
arithmetic. Behavior for NaN comparisons does not conform.
Page 27 of 431
Intel® C++ Compiler User's Guide
Optimization-level Options
Option Description
-O1 Enables optimizations. Optimizes for speed. -O1 disables inline expansion of library
functions. For Itanium® compiler, -O1 turns off software pipelining to reduce code size.
Processor Optimizations
Processor Optimization for IA-32 only
The -tpp{5|6|7} options optimize your application's performance for a specific Intel processor. The
resulting binary will also run on the other processors listed in the table below. The Intel® C++ Compiler
includes gcc*-compatible versions of the -tpp options. These options are listed in the gcc* Version
column.
Note
The -tpp7 option is ON by default when you invoke icc or icpc.
Example
The invocations listed below all result in a compiled binary optimized for Pentium 4 and Intel® Xeon(TM)
processors. The same binary will also run on Pentium, Pentium Pro, Pentium II, and Pentium III
processors.
prompt>icc prog.c
Page 28 of 431
Intel® C++ Compiler User's Guide
The -tpp{1|2} options optimize your application's performance for a specific Intel® Itanium®
processor. The resulting binary will also run on the processors listed in the table below. The Intel® C++
Compiler includes gcc*-compatible versions of the -tpp options. These options are listed in the gcc*
Version column.
Note
Example
The invocations listed below all result in a compiled binary optimized for the Intel Itanium 2 processor.
The same binary will also run on Intel Itanium processors.
prompt>ecc prog.c
Page 29 of 431
Intel® C++ Compiler User's Guide
Interprocedural Optimizations
Option Description
-ip Enables interprocedural optimizations for single file compilation.
-ip_no_inlining Disables inlining that would result from the -ip interprocedural
optimization, but has no effect on other interprocedural optimizations.
-ipo Enables interprocedural optimizations across files.
-ipo_c Generates a multifile object file that can be used in further link steps.
-ipo_obj Forces the compiler to create real object files when used with -ipo.
-ipo_S Generates a multifile assemblable file named ipo_out.asm that can be
used in further link steps.
-inline_debug_info Preserve the source position of inlined code instead of assigning the call-
site source position to inlined code.
-nolib_inline Disables inline expansion of standard library functions.
Profile-guided Optimizations
Option Description
-prof_gen[x] Instructs the compiler to produce instrumented code in your object files in
preparation for instrumented execution. NOTE: The dynamic information files
are produced in phase 2 when you run the instrumented executable.
Page 30 of 431
Intel® C++ Compiler User's Guide
z 0 - no information
z 1 - loops, regions, and sections parallelized (default)
z 2 - same as 1 plus master construct, single construct, etc.
-unroll[n] Set maximum number (n) of times to unroll loops. Omit n to use
default heuristics. Use n =0 to disable loop unrolling. For Itanium®-
based applications, -unroll[0] used only for compatibility.
Optimization Reports
Option Description
-opt_report Generates optimizations report and directs to stderr.
-opt_report_filefilename Specifies the filename for the optimizations report.
-opt_report_level Specifies the detail level of the optimizations report.
{min|med|max} Default: -opt_report_levelmin
-opt_report_phasephase Specifies the optimization to generate the report for. Can be
specified multiple times on the command line for multiple
optimizations.
-opt_report_help Prints to the screen all available phases for
-opt_report_phase.
-opt_report_routinesubstring Generates reports from all routines with names containing
the substring as part of their name. If not specified,
reports from all routines are generated.
Page 31 of 431
Intel® C++ Compiler User's Guide
z M = Intel® Pentium®
processors with MMX
(TM) technology
z i = Intel Pentium Pro and
Intel Pentium II
processors
z K = Intel Pentium III
processors
z W = Intel Pentium 4
processors, Intel® Xeon
(TM) processors, and
Intel® Pentium® M
processors
Page 32 of 431
Intel® C++ Compiler User's Guide
Page 33 of 431
Intel® C++ Compiler User's Guide
-O2 -O2 ON
-P -EP Preprocess to file. OFF
-pc32 -Qpc 32 Set internal FPU precision to OFF
24-bit significand.
-pc64 -Qpc 64 Set internal FPU precision to OFF
53-bit significand.
-pc80 -Qpc 80 Set internal FPU precision to ON
64-bit significand.
-prec_div -Qprec_div Improve precision of floating- OFF
point divides (some speed
impact).
-prof_dir directory -Qprof_dir directory Specify directory for profiling OFF
output files (*.dyn and
*.dpi).
-prof_file filename -Qprof_filefilename Specify file name for profiling OFF
summary file.
-prof_gen[x] -Qprof_genx Instrument program for OFF
profiling; with the x qualifier,
extra information is
gathered.
-prof_use -Qprof_use Enable use of profiling OFF
information during
optimization.
-Qinstall dir NA Set dir as root of compiler OFF
installation.
-Qlocation,str,dir -Qlocation, tool, path Set dir as the location of OFF
tool specified by str.
-Qoption,str,opts -Qoption, tool, list Pass options opts to tool OFF
specified by str.
Page 34 of 431
Intel® C++ Compiler User's Guide
Page 35 of 431
Intel® C++ Compiler User's Guide
z M = Intel® Pentium®
processors with MMX
(TM) technology
z i = Intel Pentium Pro and
Intel Pentium II
processors
z K = Intel Pentium III
processors
z W = Intel Pentium 4
processors, Intel® Xeon
(TM) processors, and
Intel® Pentium® M
processors
Page 36 of 431
Intel® C++ Compiler User's Guide
Note
You can also invoke the compiler with icpc and ecpc for C++ source files on IA-32 and Itanium®-
based systems respectively. The icc and ecc compiler examples in this documentation apply to C
and C++ source files.
To run the iccvars.sh script on IA-32, enter the following on the command line:
prompt>source /opt/intel/compiler70/ia32/bin/iccvars.sh
If you want the iccvars.sh to run automatically when you start Linux*, edit your startup file and add
the same line to the end of your file:
The procedure is similar for running the eccvars.sh shell script on Itanium-based systems.
Page 37 of 431
Intel® C++ Compiler User's Guide
Syntax Description
file1, file2 . . . Indicates one or more files to be processed by the compilation system.
You can specify more than one file. Use a space as a delimiter for
multiple files.
Page 38 of 431
Intel® C++ Compiler User's Guide
Note
To use the Intel compiler, your makefile must include the setting CC=icc. Use the same setting on the
command line to instruct the makefile to use the Intel compiler. If your makefile is written for gcc, the
GNU* C compiler, you will need to change those command line options not recognized by the Intel
compiler.
prompt>make -f my_makefile
Page 39 of 431
Intel® C++ Compiler User's Guide
All other options are supported on both IA-32 and Itanium-based systems.
Option Description
-c99 Enables C99 support for C programs
-falias Assume aliasing in program.
-ffnalias Assume aliasing within functions
-fverbose-asm Produce assemblable file with compiler components.
-KPIC, -Kpic Generate position independent code.
-mcpu=pentium4 Optimizes for Pentium® 4 processor (IA-32 systems only).
Page 40 of 431
Intel® C++ Compiler User's Guide
If the compiler does not recognize a command-line option, that option is ignored and a warning is
displayed. See Diagnostic Messages for detailed descriptions about system messages.
Page 41 of 431
Intel® C++ Compiler User's Guide
Compilation Phases
To produce an executable file, the compiler performs by default the compile and link phases. When
invoked, the compiler driver determines which compilation phases to perform based on the file name
extension and the compilation options specified in the command line.
The compiler passes object files and any unrecognized file name to the linker. The linker then
determines whether the file is an object file (.o) or a library (.a). The compiler driver handles all types
of input files correctly, thus it can be used to invoke any phase of compilation.
The relationship of the compiler to system-specific programming support tools is presented in the
diagram below:
Page 42 of 431
Intel® C++ Compiler User's Guide
z Environment Variables -- the paths where the compiler and other tools can search for specific
files.
z Configuration Files -- the options to use with each compilation.
z Response Files -- the options and files to use for individual projects.
z Include Files -- the names and locations of source header files.
Environment Variables
You can customize your environment by specifying paths where the compiler can search for special
files such as libraries and include files.
To run the iccvars.sh script, enter the following on the command line:
prompt>source /opt/intel/compiler70/ia32/bin/iccvars.sh
If you want the iccvars.sh to run automatically when you start Linux, edit your startup script
(.bash_profile for a bash shell) and add the same line to the end of your file:
Page 43 of 431
Intel® C++ Compiler User's Guide
Configuration Files
You can decrease the time you spend entering command-line options and ensure consistency by using
the configuration file to automate often-used command-line entries. You can insert any valid
command-line option into the configuration file. The compiler processes options in the configuration file
in the order they appear followed by the command-line options that you specify when you invoke the
compiler.
Note
Options in the configuration file will be executed every time you run the compiler. If you have varying
option requirements for different projects, see Response Files.
Page 44 of 431
Intel® C++ Compiler User's Guide
Response Files
Use response files to specify options used during particular compilations, and to save this information
in individual files. Response files are invoked as an option in the command line. Options in a response
file are inserted in the command line at the point where the response file is invoked.
Response files are used to decrease the time spent entering command-line options, and to ensure
consistency by automating command-line entries. Use individual response files to maintain options for
specific projects to avoid editing the configuration file when changing projects.
Any number of options or file names can be placed on a line in the response file. Several response
files can be referenced in the same command line. Use the pound character( #) to treat the rest of the
line as a comment.
Note
An "at" sign (@) must precede the name of the response file on the command line.
Include Files
Include directories are searched in the default system areas and whatever is specified by the -
Idirectory option. For multiple search directories, multiple -Idirectory commands must be
used. The compiler searches directories for include files in the following order:
For example, to direct the compiler to search the path /alt/include instead of the default path, do
the following:
Page 45 of 431
Intel® C++ Compiler User's Guide
z Preprocessing
z Compiling
z Linking
z Debugging
Page 46 of 431
Intel® C++ Compiler User's Guide
Use -Qlocation to specify an alternate path for a tool. This option accepts two arguments using the
following syntax:
-Qlocation,tool,path
tool Description
cpp Specifies the compiler front-end preprocessor.
c Specifies the C++ compiler.
asm Specifies the assembler.
ld Specifies the linker.
Use -Qoption to pass an option specified by optlist to a tool, where optlist is a comma-
separated list of options. The syntax for this command is the following:
-Qoption,tool,optlist
tool Description
cpp Specifies the compiler front-end preprocessor.
c Specifies the C++ compiler.
asm Specifies the assembler.
ld Specifies the linker.
optlist Indicates one or more valid argument strings for the designated program. If the argument is
a command-line option, you must include the hyphen. If the argument contains a space or tab
character, you must enclose the entire argument in quotation characters (""). You must separate
multiple arguments with commas. The following example directs the linker to create a memory map
when the compiler produces the executable file from the source.
Page 47 of 431
Intel® C++ Compiler User's Guide
The -Qoption,link option in the preceding example is passing the -map option to the linker. This is
an explicit way to pass arguments to other tools in the compilation process. Also, you can use the -
Xlinker val to pass values (val) to the linker.
Page 48 of 431
Intel® C++ Compiler User's Guide
Overview: Preprocessing
This section describes the options you can use to direct the operations of the preprocessor.
Preprocessing performs such tasks as macro substitution, conditional compilation, and file inclusion.
Preprocessor Options
Option Description
-Dname[=text] Defines the macro name and associates it with the specified text. The
default (-Dname) defines a macro with where text = 1.
-E Directs the preprocessor to expand your source module and write the result to
stdout. Output includes #line directives.
-EP Directs the preprocessor to expand your source module and write the result to
standard output. The output does not include #line directives.
-P Directs the preprocessor to expand your source module and store the result in
a .i file in the current directory. Output does not include #line directives.
-Uname Suppresses any automatic definition for the specified macro name.
-X Remove standard directories from include file search path.
-H Outputs the full path names of all included files to stdout in order.
Indentation is used to designate the #include dependencies.
-M Generate makefile dependency information.
-MD Preprocess and compile. Generate output file (.d extension) containing
dependency information.
-MFfile Generate makefile dependency information in file. Must specify -M or -MM.
-MG Similar to -M, but treats missing header files as generated files.
-MM Similar to -M, but does not include system header files.
-MMD Similar to -MD, but does not include system header files.
-MX Generate dependency file (.o.dep extension) containing information used for
the Intel wb tool.
-dM Output macro definitions in effect after preprocessing (use with -E).
-MD Preprocess and compile. Generate output file (.d extension) containing
dependency information.
-Idirectory Specifies an additional directory to search for include files.
Page 49 of 431
Intel® C++ Compiler User's Guide
Preprocessing Only
Use the -E, -P or -EP option to preprocess your source files without compiling them. When using
these options, only the preprocessing phase of compilation is activated.
Using -E
When you specify the -E option, the compiler's preprocessor expands your source module and writes
the result to stdout. The preprocessed source contains #line directives, which the compiler uses to
determine the source file and line number. For example, to preprocess two source files and write them
to stdout, enter the following command:
Using -P
When you specify the -P option, the preprocessor expands your source module and directs the output
to a .i file instead of stdout. Unlike the -E option, the output from -P does not include #line
number directives. By default, the preprocessor creates the name of the output file using the prefix of
the source file name with a .i extension. You can change this by using the -ofile option. For
example, the following command creates two files named prog1.i and prog2.i, which you can use
as input to another compilation:
Caution
When you use the -P option, any existing files with the same name and extension are overwritten.
Using -EP
Using the -EP option directs the preprocessor to not include #line directives in the output. -EP is
equivalent to -E -P.
Use the -C option to preserve comments in your preprocessed source output. Comments following
preprocessing directives, however, are not preserved.
Page 50 of 431
Intel® C++ Compiler User's Guide
Using -A
Argument Description
name Indicates an identifier for the assertion
value Indicates a value for the assertion. If a value is specified, it should be quoted, along
with the parentheses delimiting it.
For example, to make an assertion for the identifier fruit with the associated values orange and
banana use the following command:
Using -D
Argument Description
name The name of the macro to define.
value Indicates a value to be substituted for name. If you do not enter a value, name is set to
1. The value should be quoted if it contains non-alphanumerics.
For example, to define a macro called SIZE with the value 100 use the following command:
The -D option can also be used to define functions. For example, icc -D"f(x)=x" prog1.c.
Page 51 of 431
Intel® C++ Compiler User's Guide
Using -U
Argument Description
name The name of the macro to undefine.
Note
If you use -D and -U in the same compilation, the compiler processes the -D option before -U, rather
than processing them in the order they appear on the command line.
Page 52 of 431
Intel® C++ Compiler User's Guide
Predefined Macros
Intel-specific predefined macros are described in the table below. The Default column indicates
whether the macro is enabled (ON) or disabled (OFF) by default. The Architecture column indicates
which Intel architecture supports the predefined macro. Predefined macros specified by the ISO/ANSI
standard are not listed in the table. For a list of all macro definitions in effect, use the -E -dM options.
For example:
Predefined Macros
Page 53 of 431
Intel® C++ Compiler User's Guide
Page 54 of 431
Intel® C++ Compiler User's Guide
Use the -Uname option to suppress any macro definition currently in effect for the specified name. The
-U option performs the same function as an #undef preprocessor directive.
Page 55 of 431
Intel® C++ Compiler User's Guide
Overview: Compilation
This section describes the Intel® C++ Compiler options that determine the compilation process and
output. By default, the compiler converts source code directly to an executable file. Appropriate options
allow you to control the process by directing the compiler to produce:
You can also name the output file or designate a set of options that are passed to the linker. If you
specify a phase-limiting option, the compiler produces a separate output file representing the output of
the last phase that completes for each primary input file.
Controlling Compilation
If no errors occur during processing, you can use the output files from a particular phase as input to a
subsequent compiler invocation. The table below describes the options to control the output:
Compile only -c z Source files Compile to object only (.o), do not link.
z Preprocessed
files
Page 56 of 431
Intel® C++ Compiler User's Guide
Use the -Zp option to determine the alignment constraint for structure declarations. Generally, smaller
constraints result in smaller data sections while larger constraints support faster execution.
n=1 1 byte.
n=2 2 bytes.
n=4 4 bytes.
n=8 8 bytes
n=16 16 bytes.
For example, to specify 2 bytes as the alignment constraint for all structures and unions in the file
prog.c, use the following command:
Note
Changing the alignment may cause problems if you are using system libraries compiled with the
default alignment.
The -ftz switch only needs to be used on the source containing function main(). The effect of the -
ftz switch is to turn on FTZ mode for the process started by main(). The initial thread and any
threads subsequently created by that process will operate in FTZ mode.
Page 57 of 431
Intel® C++ Compiler User's Guide
Note
The -O3 option turns -ftz ON. Use -ftz- to disable flushing denormal results to zero.
Page 58 of 431
Intel® C++ Compiler User's Guide
Linking
This topic describes the options that let you control and customize the linking with tools and libraries
and define the output of the ld linker. See the ld man page for more information on the linker.
Option Description
-Ldirectory Instruct the linker to search directory for libraries.
-Qoption,tool,list Passes an argument list to another program in the compilation sequence,
such as the assembler or linker.
-shared Instructs the compiler to build a Dynamic Shared Object (DSO) instead of
an executable.
-shared-libcxa -shared-libcxa has the opposite effect of -static-libcxa. When
it is used, the Intel-provided libcxa C++ library is linked in dynamically,
allowing the user to override the static linking behavior when the -
static option is used.
-i_dynamic Specifies that all Intel-provided libraries should be linked dynamically.
-static Causes the executable to link all libraries statically, as opposed to
dynamically.
-static-libcxa By default, the Intel-provided libcxa C++ library is linked in dynamically.
Use -static-libcxa on the command line to link libcxa statically,
while still allowing the standard libraries to be linked in by the default
behavior.
-Bstatic This option is placed in the linker command line corresponding to its
location on the user command line. This option is used to control the
linking behavior of any library being passed in via the command line.
z /lib/ld-linux.so.2 is linked in
z libm, libcxa, and libc are linked dynamically
z all other libs are linked statically
-Bdynamic This option is placed in the linker command line corresponding to its
location on the user command line. This option is used to control the
linking behavior of any library being passed in via the command line.
Page 59 of 431
Intel® C++ Compiler User's Guide
Suppressing Linking
Use the -c option to suppress linking. For example, entering the following command produces the
object files file1.o and file2.o:
Note
The preceding command does not link these files to produce an executable file.
Page 60 of 431
Intel® C++ Compiler User's Guide
If you specify -g with -O1, -O2, or -O3, then -fp is disabled and allows the compiler to use the EBP
register as a general purpose register in optimizations. However, most debuggers expect EBP to be
used as a stack frame pointer, and cannot produce a stack backtrace unless this is so. Using the -fp
option can result in slightly less efficient code.
Option(s) Result
-g -O2 Debugging information produced, -O2 optimizations enabled, -fp disabled for IA-
32-targeted compilations.
-g -O3 Debugging information produced, -O3 optimizations enabled, -fp disabled for IA-
32-targeted compilations.
-g -O3 -fp Debugging information produced, -O3 optimizations enabled, -fp enabled for IA-32-
targeted compilations.
-ip Symbols and line numbers produced for debugging.
-ipo Symbols and line numbers produced for debugging.
Page 61 of 431
Intel® C++ Compiler User's Guide
The compiler does not support the generation of debugging information in assemblable files. If you
specify the -g option, the resulting object file will contain debugging information, but the assemblable
file will not.
z If you specify the -O1, -O2, or -O3 options with the -g option, some of the debugging
information returned may be inaccurate as a side-effect of optimization.
z If you specify the -O1, -O2, or -O3 options, the -fp option (IA-32 only) will be disabled.
Page 62 of 431
Intel® C++ Compiler User's Guide
The compiler is set by default to accept extensions and not be limited to the ANSI/ISO standard.
The compiler provides predefined macros in addition to the predefined macros required by the
standard.
Macro Description
__cplusplus The name __cplusplus is defined when compiling a C++ translation unit.
__DATE__ The date of compilation as a string literal in the form Mmm dd yyyy.
__FILE__ A string literal representing the name of the file being compiled.
__LINE__ The current line number as a decimal constant.
__STDC__ The name __STDC__ is defined when compiling a C translation unit.
Page 63 of 431
Intel® C++ Compiler User's Guide
C99 Support
The following C99 features are supported in this version of the Intel C++ Compiler when using the -
c99[-] option:
z Restricted pointers (restrict keyword, available with -restrict). See Note below.
z Variable-length Arrays
z Flexible array members
z Complex number support (_Complex keyword)
z Hexadecimal floating-point constants
z Compound literals
z Designated initializers
z Mixed declarations and code
z Macros with a variable number of arguments
z Inline functions (inline keyword)
z Boolean type (_Bool keyword)
Note
The -restrict option enables the recognition of the restrict keyword as defined by the ANSI
standard. By qualifying a pointer with the restrict keyword, the user asserts that an object
accessed via the pointer is only accessed via that pointer in the given scope. It is the user’s
responsibility to use the restrict keyword only when this assertion is true. In these cases, the use
of restrict will have no effect on program correctness, but may allow better optimization.
Page 64 of 431
Intel® C++ Compiler User's Guide
Optimization
Constant propagation
Copy propagation
Dead-code elimination
Global register allocation
Instruction scheduling
Strength reduction/induction
variable simplification
Variable renaming
Exception handling optimizations
Tail recursions
Peephole optimizations
Page 65 of 431
Intel® C++ Compiler User's Guide
Itanium® Compiler
Option Effect
-O1 Optimizes for code size by turning off software pipelining. Enables the same optimizations
as -O except for loop unrolling and software pipelining. -O and -O2 turn on software
pipelining. Generally, -O or -O2 are recommended over -O1.
IA-32 Compiler
Option Effect
-O, Optimize for speed. Disable option -fp. The -O2 option is ON by default. Intrinsic
-O1, recognition is disabled.
-O2
-O3 Enables -O2 option with more aggressive optimization. Optimizes for maximum speed, but
does not guarantee higher performance unless loop and memory access transformation
take place. In conjunction with -axK and -xK options (IA-32 only), this option causes the
compiler to perform more aggressive data dependency analysis than for -O2. This may
result in longer compilation times.
Page 66 of 431
Intel® C++ Compiler User's Guide
Option Effect
-O2 ON by default. -O2 turns ON intrinsics inlining. Enables the following capabilities for
performance gain:
z Constant propagation
z Copy propagation
z Dead-code elimination
z Global register allocation
z Global instruction scheduling and control speculation
z Loop unrolling
z Optimized code selection
z Partial redundancy elimination
z Strength reduction/induction variable simplification
z Variable renaming
z Exception handling optimizations
z Tail recursions
z Peephole optimizations
z Structure assignment lowering and optimizations
z Dead store elimination
-O3 Enables -O2 option with more aggressive optimization, for example, prefetching, scalar
replacement, and loop transformations. Optimizes for maximum speed, but does not
guarantee higher performance unless loop and memory access transformation take place.
To time your application, see Timing Your Application.
Page 67 of 431
Intel® C++ Compiler User's Guide
Restricting Optimizations
The following options restrict or preclude the compiler's ability to optimize your program:
Option Description
-O0 Disables all optimizations.
-mp1 Improve floating-point precision. Speed impact is less than with
-mp.
-fp Disable using the EBP register as a general purpose register.
IA-32 only
-prec_div Disables the floating point division-to-multiplication optimization.
IA-32 only
-fp_port Round fp results at assignments and casts (some speed
IA-32 only impact).
-ftz[-] Enable [disable] flush denormal results to zero. The -ftz option
Itanium-based is OFF by default, but turned ON with -O3.
systems only
-IPF_fma[-] Enable [disable] the combining of floating point multiplies and
Itanium-based add/subtract operations.
systems only
-IPF_fltacc[-] Enable [disable] optimizations that affect floating point accuracy.
Itanium-based
systems only
-IPF_flt_eval_method0 Floating-point operands evaluated to the precision indicated by
Itanium-based program.
systems only
-IPF_fp_speculation<mode> Enable floating point speculations with the following <mode>
Itanium-based conditions:
systems only
z fast - speculate floating point operations
z safe - speculate only when safe
z strict - same as off
z off - disables speculation of floating-point operations
Page 68 of 431
Intel® C++ Compiler User's Guide
Note
You can turn off all optimizations for specific functions by using #pragma optimize. In the following
example, all optimization is turned off for function foo():
Valid second arguments for #pragma optimize are "on" or "off." With the "on" argument, foo()
is compiled with the same optimization as the rest of the program. The compiler ignores first argument
values.
Page 69 of 431
Intel® C++ Compiler User's Guide
The -mp option restricts optimization to maintain declared precision and to ensure that floating-point
arithmetic conforms more closely to the ANSI and IEEE standards. For most programs, specifying this
option adversely affects performance. If you are not sure whether your application needs this option,
try compiling and running your program both with and without it to evaluate the effects on both
performance and precision. Specifying the -mp option has the following effects on program
compilation:
Note: The -nolib_inline and -mp options are active by default when you choose the -Xc (strict
ANSI C conformance) option.
-mp1 Option
Use the -mp1 option to improve floating-point precision. -mp1 disables fewer optimizations and has
less impact on performance than -mp.
Caution
A change of the default precision control or rounding mode (for example, by using the -pc32 flag or by
user intervention) may affect the results returned by some of the mathematical functions.
Page 70 of 431
Intel® C++ Compiler User's Guide
-long_double Option
Use -long_double to change the size of the long double type to 80 bits. The Intel compiler's default
long double type is 64 bits in size, the same as the double type. This option introduces a number
of incompatibilities with other files compiled without this option and with calls to library routines.
Therefore, Intel recommends that the use of long double variables be local to a single file when you
compile with this option.
-prec_div Option
With some optimizations, such as -xK and -xW, the Intel® C++ Compiler changes floating-point
division computations into multiplication by the reciprocal of the denominator. For example, A/B is
computed as A x (1/B) to improve the speed of the computation. However, for values of B greater than
2126, the value of 1/B is "flushed" (changed) to 0. When it is important to maintain the value of 1/B, use
-prec_div to disable the floating-point division-to-multiplication optimization. The result of -
prec_div is greater accuracy with some loss of performance.
-pcn Option
Use the -pcn option to enable floating-point significand precision control. Some floating-point
algorithms are sensitive to the accuracy of the significand or fractional part of the floating-point value.
For example, iterative operations like division and finding the square root can run faster if you lower the
precision with the -pcn option. Set n to one of the following values to round the significand to the
indicated number of bits:
The default value for n is 80, indicating double precision. This option allows full optimization. Using this
option does not have the negative performance impact of using the -Op option because only the
fractional part of the floating-point value is affected. The range of the exponent is not affected. The -
pcn option causes the compiler to change the floating point precision control when the main()
function is compiled. The program that uses -pcn must use main() as its entry point, and the file
containing main() must be compiled with -pcn.
-rcd Option
The Intel compiler uses the -rcd option to improve the performance of code that requires floating-
point-to-integer conversions. The optimization is obtained by controlling the change of the rounding
mode. The system default floating point rounding mode is round-to-nearest. This means that values
are rounded during floating point calculations. However, the C language requires floating point values
to be truncated when a conversion to an integer is involved. To do this, the compiler must change the
rounding mode to truncation before each floating-point-to-integer conversion and change it back
afterwards. The -rcd option disables the change to truncation of the rounding mode for all floating
point calculations, including floating point-to-integer conversions. Turning on this option can improve
performance, but floating point conversions to integer will not conform to C semantics.
-fp_port Option
The -fp_port option rounds floating-point results at assignments and casts. An impact on speed
may result.
Page 71 of 431
Intel® C++ Compiler User's Guide
z -ftz[-]
z -IPF_fma[-]
z -IPF_fp_speculationmode
z -IPF_flt_eval_method0
z -IPF_fltacc[-](Default:-IPF_fltacc- )
FP Speculation
-IPF_fp_speculationmode sets the compiler to speculate on floating-point operations in one of the
following modes:
FP Operations Evaluation
-IPF_flt_eval_method0 directs the compiler to evaluate the expressions involving floating-point
operands in the precision indicated by the variable types declared in the program.
Page 72 of 431
Intel® C++ Compiler User's Guide
Processor Optimization
Processor Optimization for IA-32 only
The -tpp{5|6|7} options optimize your application's performance for a specific Intel processor. The
resulting binary will also run on the other processors listed in the table below. The Intel® C++ Compiler
includes gcc*-compatible versions of the -tpp options. These options are listed in the gcc* Version
column.
Note
The -tpp7 option is ON by default when you invoke icc or icpc.
Example
The invocations listed below all result in a compiled binary optimized for Pentium 4 and Intel® Xeon(TM)
processors. The same binary will also run on Pentium, Pentium Pro, Pentium II, and Pentium III
processors.
prompt>icc prog.c
The -tpp{1|2} options optimize your application's performance for a specific Intel® Itanium®
processor. The resulting binary will also run on the processors listed in the table below. The Intel® C++
Compiler includes gcc*-compatible versions of the -tpp options. These options are listed in the gcc*
Version column.
Page 73 of 431
Intel® C++ Compiler User's Guide
Note
The -tpp2 option is ON by default when you invoke ecc or ecpc.
Example
The invocations listed below all result in a compiled binary optimized for the Intel Itanium 2 processor.
The same binary will also run on Intel Itanium processors.
prompt>ecc prog.c
Page 74 of 431
Intel® C++ Compiler User's Guide
To execute the program on x86 processors not provided by Intel Corporation, do not specify the -x
{M|i|K|W} option.
Example
The invocation below compiles prog.c for processors that support the K set of instructions. The
optimized binary will require a Pentium III, Pentium 4, Intel Xeon processor, or Intel Pentium M
processor to execute correctly. The resulting binary may not execute correctly on a Pentium, Pentium
Pro, Pentium II, Pentium with MMX technology processors, or on x86 processors not provided by Intel
Corporation.
Caution
If a program compiled with -x{M|i|K|W} is executed on a processor that lacks the specified set of
instructions, it can fail with an illegal instruction exception, or display other unexpected behavior.
Page 75 of 431
Intel® C++ Compiler User's Guide
At run time, one of the two versions is chosen to execute, depending on the processor the program is
currently running on. In this way, the program can benefit from performance gains on more advanced
processors, while still working properly on older processors.
z The size of the compiled binary increases because it contains both a processor-specific version
and a generic version of the code.
z Performance is affected by the run-time checks to determine which code to run.
Note
Programs that you compile with this option will execute on any IA-32 processor. Such compilations are,
however, subject to any exclusive specialized code restrictions you impose during compilation with the
-x option.
Example
Page 76 of 431
Intel® C++ Compiler User's Guide
B N-A -tpp5 -xM -tpp5 -tpp5 -xM -tpp5 -xM -tpp5 -xM
C N-A N-A -tpp6 -xi -tpp6 -xi -tpp6 -xi -tpp6 -xi
Processor Legend
Example
In this example, -xM restricts the application to running on Pentium processors with MMX(TM)
technology or later processors. If you wanted the program to run on earlier generations of IA-32
processors as well, you would use the following command line:
This compilation generates optimized code for processors that support both the i and M extensions,
but the compiled program will run on any IA-32 processor.
Page 77 of 431
Intel® C++ Compiler User's Guide
Interprocedural Optimizations
Use -ip and -ipo to enable interprocedural optimizations (IPO), which allow the compiler to analyze
your code to determine where to apply the optimizations listed in tables that follow.
Multifile optimization Affects the same aspects as -ip, but across multiple files
Inline function expansion is one of the main optimizations performed by the interprocedural optimizer.
For function calls that the compiler believes are frequently executed, the compiler might decide to
replace the instructions of the call with code for the function itself (inline the call).
With -ip, the compiler performs inline function expansion for calls to functions defined within the
current source file. However, when you use -ipo to specify multifile IPO, the compiler performs inline
function expansion for calls to functions defined in separate files. For this reason, it is important to
compile the entire application or multiple, related source files together when you specify -ipo.
Page 78 of 431
Intel® C++ Compiler User's Guide
Building a program is divided into two phases -- compilation and linkage. Multifile IPO performs
different work depending on whether the compilation, linkage, or both are performed.
Compilation Phase
As each source file is compiled, multifile IPO stores an intermediate representation (IR) of the source
code in the object file, which includes summary information used for optimization.
By default, the compiler produces "mock" object files during the compilation phase of multifile IPO.
Generating mock files instead of real object files reduces the time spent in the multifile IPO compilation
phase. Each mock object file contains the IR for its corresponding source file, but no real code or data.
These mock objects must be linked using the -ipo option and icc, or using the xild tool.
Note
Failure to link "mock" objects with icc, -ipo, or xild will result in linkage errors. There are situations
where mock object files cannot be used. See Compilation with Real Object Files for more information.
Linkage Phase
When you specify -ipo, the compiler is invoked a final time before the linker. The compiler performs
multifile IPO across all object files that have an IR.
Note
The compiler does not support multifile IPO for static libraries (.a files). See Compilation with Real
Object Files for more information.
-ipo enables the driver and compiler to attempt detecting a whole program automatically. If a whole
program is detected, the interprocedural constant propagation, stack frame alignment, data layout and
padding of common blocks optimizations perform more efficiently, while more dead functions get
deleted. This option is safe.
Page 79 of 431
Intel® C++ Compiler User's Guide
z The objects produced by the compilation phase of -ipo will be placed in a static library without
the use of xild or xild -lib. The compiler does not support multifile IPO for static libraries,
so all static libraries are passed to the linker. Linking with a static library that contains "mock"
object files will result in linkage errors because the objects do not contain real code or data.
Specifying -ipo_obj causes the compiler to generate object files that can be used in static
libraries.
z Alternatively, if you create the static library using xild or xild -lib, then the resulting static
library will work as a normal library.
z The objects produced by the compilation phase of -ipo might be linked without the -ipo
option and without the use of xild.
z You want to generate an assemblable file for each source file (using -S) while compiling with -
ipo. If you use -ipo with -S, but without -ipo_obj, the compiler issues a warning and an
empty assemblable file is produced for each compiled source file.
Page 80 of 431
Intel® C++ Compiler User's Guide
Multifile IPO is applied only to modules that have an IR, otherwise the object file passes to the link
stage. For efficiency, combine steps 1 and 2:
Multifile IPO is applied only to modules that have an IR, otherwise the object file passes to link stage.
For efficiency, combine steps 1 and 2:
See Using Profile-Guided Optimization: An Example for a description of how to use multifile IPO with
profile information for further optimization.
Page 81 of 431
Intel® C++ Compiler User's Guide
z Invokes the Intel compiler to perform multifile IPO if objects containing IR are found.
z Invokes the GNU linker, ld, to link the application.
where:
z [<options>] (optional) may include any gcc linker options or options supported only by
xild.
z <LINK_commandline> is the linker command line containing a set of valid arguments to ld.
To place the multifile IPO executable in ipo_file, use the option -ofilename, for example:
xild calls Intel compiler to perform IPO for objects containing IR and creates a new list of object(s) to
be linked. Then xild calls ld to link the object files that are specified in the new list and produce
ipo_file executable specified by the -ofilename option.
Note
The -ipo option can reorder object files and linker arguments on the command line. Therefore, if your
program relies on a precise order of arguments on the command line, -ipo can affect the behavior of
your program.
Usage Rules
You must use the Intel linker xild to link your application if:
z Your source files were compiled with multifile IPO enabled. Multifile IPO is enabled by specifying
the -ipo command-line option
z You normally would invoke ld to link your application.
Page 82 of 431
Intel® C++ Compiler User's Guide
Option Description
-ipo_o[file.s] Produces assemblable files for the multifile IPO compilation. You may
specify an optional name for the listing file, or a directory (with the
backslash) in which to place the file. The default listing name is
ipo_out.s.
-ipo_o[file.o] Produces object file for the multifile IPO compilation. You may specify an
optional name for the object file, or a directory (with the backslash) in which
to place the file. The default object file name is ipo_out.o.
-ipo_fcode-asm Add code bytes to assemblable files
-ipo_fsource-asm Add high-level source code to assemblable files
-ipo_fverbose- Enable and disable, respectively, inserting comments containing version
asm, and options used in the assemblable file for xild
-ipo_fnoverbose-
asm
If, however, the objects have been created using -ipo -c, then the objects will not contain a valid
object but only the intermediate representation (IR) for that object file. For example:
will produce a.o and b.o that only contains IR to be used in a link time compilation. The library
manager will not allow these to be inserted in a library.
In this case you must use the Intel library driver xild -ar. This program will invoke the compiler on
the IR saved in the object file and generate a valid object that can be inserted in a library.
Page 83 of 431
Intel® C++ Compiler User's Guide
Use the -ipo_c option to optimize across files and produce an object file. This option performs
optimizations as described for -ipo, but stops prior to the final link stage, leaving an optimized object
file. The default name for this file is ipo_out.o.
Use the -ipo_S option to optimize across files and produce an assemblable file. This option performs
optimizations as described for -ipo, but stops prior to the final link stage, leaving an optimized
assemblable file. The default name for this file is ipo_out.s.
Page 84 of 431
Intel® C++ Compiler User's Guide
where tool is C++ (c) and opts are -Qoption specifiers (see below).
-option Specifiers
If you specify -ip or -ipo without any -Qoption qualification, the compiler
You can refine interprocedural optimizations by using the following -Qoption specifiers. To have an
effect, the -Qoption option must be entered with either -ip or -ipo also specified, as in this
example:
Specifer Description
-ip_args_in_regs=0 Disables the passing of arguments in registers. By default,
external functions can pass arguments in registers when called
locally. Normally, only static functions can pass arguments in
registers, provided the address of the function is not taken and
the function does not use a variable number of arguments.
-ip_ninl_max_stats=n Sets the valid max number of intermediate language
statements for a function that is expanded in line. The number
n is a positive integer. The number of intermediate language
statements usually exceeds the actual number of source
language statements. The default value for n is 230. The
compiler uses a larger limit for user inline functions.
-ip_ninl_min_stats=n Sets the valid min number of intermediate language
statements for a function that is expanded in line. The number
n is a positive integer. The default value for
ip_ninl_min_stats is:
Page 85 of 431
Intel® C++ Compiler User's Guide
The following command activates procedural and interprocedural optimizations on source.c and sets
the maximum increase in the number of intermediate language statements to 5 for each function:
Page 86 of 431
Intel® C++ Compiler User's Guide
-ip_no_inlining This option is only useful if -ip is also specified. In this case, -
ip_no_inlining disables inlining that would result from the -ip
interprocedural optimizations, but has no effect on other interprocedural
optimizations.
ip_no_pinlining Disables partial inlining; can be used if -ip or -ipo is also specified.
z The default heuristic focuses on the most frequently executed call sites, based on the profile
information gathered for the program.
z By default, the compiler will not inline functions with more than 230 intermediate statements.
You can change this value by specifying the option -Qoption,c,-
ip_ninl_max_stats=new_value. Note: there is a higher limit for functions declared by the
user as inline or __inline.
z The default inline heuristic will stop inlining when direct recursion is detected.
z The default heuristic will always inline very small functions that meet the minimum inline criteria.
{ Default for Itanium®-based applications: ip_ninl_min_stats=15.
{ Default for IA-32 applications: ip_ninl_min_stats=7. This limit can be modified with
the option -Qoption,c,-ip_ninl_min_stats=new_value.
If you do not use profile-guided optimizations with -ip or -ipo, the compiler uses less aggressive
inlining heuristics:
z Inline a function if the inline expansion will not increase the size of the final program.
z Inline a function if it is declared with the inline or __inline keywords.
Page 87 of 431
Intel® C++ Compiler User's Guide
Instrumented Program
Profile-guided optimization creates an instrumented program from your source code and special code
from the compiler. Each time this instrumented code is executed, the instrumented program generates
a dynamic information file. When you compile a second time, the dynamic information files are merged
into a summary file. Using the profile information in this file, the compiler attempts to optimize the
execution of the most heavily travelled paths in the program.
Unlike other optimizations, such as those used strictly for size or speed, the results of IPO and PGO
vary. This is due to each program having a different profile and different opportunities for optimizations.
The guidelines provided here help you determine if you can benefit by using IPO and PGO.
z Register allocation uses the profile information to optimize the location of spill code.
z For direct function calls, branch prediction is improved by identifying the most likely targets.
With the Pentium® 4 processor's longer pipeline, improved branch prediction translates to
higher performance gains.
z The compiler detects and does not vectorize loops that execute only a small number of
iterations, reducing the run time overhead that vectorization might otherwise add.
PGO Phases
The PGO methodology requires three phases:
A key factor in deciding whether you want to use PGO lies in knowing which sections of your code are
the most heavily used. If the data set provided to your program is very consistent and it elicits a similar
Page 88 of 431
Intel® C++ Compiler User's Guide
behavior on every execution, then PGO can probably help optimize your program execution. However,
different data sets can elicit different algorithms to be called. This can cause the behavior of your
program to vary from one execution to the next.
In cases where your code behavior differs greatly between executions, PGO may not provide
noticeable benefits. You have to ensure that the benefit of the profile information is worth the effort
required to maintain up-to-date profiles.
-prof_gen[x] Instructs the compiler to produce instrumented code in your object files in
preparation for instrumented execution.
-prof_use Instructs the compiler to produce a profile-optimized executable and merges
available dynamic information (.dyn) files into a pgopti.dpi file.
In cases where your code behavior differs greatly between executions, you have to ensure that the
benefit of the profile information is worth the effort required to maintain up-to-date profiles. In the basic
profile-guided optimization, the following options are used in the phases of the PGO:
Note
The dynamic-information files are produced in Phase 2 when you run the instrumented executable.
If you perform multiple executions of the instrumented program, -prof_use merges the dynamic-
information files again and overwrites the previous pgopti.dpi file.
Page 89 of 431
Intel® C++ Compiler User's Guide
You can use -fnsplit- to disable function splitting for the following reasons:
z Most importantly, to get improved debugging capability. In the debug symbol table, it is difficult
to represent a split routine, that is, a routine with some of its code in the hot code section and
some of its code in the cold code section.
z The -fnsplit- option disables the splitting within a routine but enables function grouping, an
optimization in which entire routines are placed either in the cold code section or the hot code
section. Function grouping does not degrade debugging capability.
z Another reason can arise when the profile data does not represent the actual program behavior,
that is, when the routine is actually used frequently rather than infrequently.
Note
For Itanium®-based applications, if you intend to use the -prof_use option with optimizations at the -
O3 level, the -O3 option must be on. If you intend to use the -prof_use option with optimizations at
the -O2 level or lower, you can generate the profile data with the default options.
Page 90 of 431
Intel® C++ Compiler User's Guide
IA-32 Systems
Itanium®-based Systems
In place of the second command, you could use the linker directly to produce the instrumented
program.
Instrumented Execution
Run your instrumented program with a representative set of data to create a dynamic information file.
prompt>./a.o
The resulting dynamic information file has a unique name and .dyn suffix every time you run a.o. The
instrumented file helps predict how the program runs with a particular set of data. You can run the
program more than once with different input data.
Page 91 of 431
Intel® C++ Compiler User's Guide
Feedback Compilation
Compile and link the source files with -prof_use to use the dynamic information to optimize your
program according to its profile:
IA-32 Systems:
Itanium®-based Systems:
Besides the optimization, the compiler produces a pgopti.dpi file. You typically specify the default
optimizations (-O2) for phase 1, and specify more advanced optimizations with -ipo for phase 3. This
example used -O2 in phase 1 and -O2 -ipo in phase 3.
Note
The compiler ignores the -ipo options with -prof_gen[x]. With the x qualifier, extra information is
gathered.
Variable Description
PROF_DIR Specifies the directory in which dynamic information files are created. This
variable applies to all three phases of the profiling process.
PROF_NO_CLOBBER Alters the feedback compilation phase slightly. By default, during the
feedback compilation phase, the compiler merges the data from all dynamic
information files and creates a new pgopti.dpi file if .dyn files are newer
than an existing pgopti.dpi file. When this variable is set, the compiler
does not overwrite the existing pgopti.dpi file. Instead, the compiler issues
a warning and you must remove the pgopti.dpi file if you want to use
additional dynamic information files.
Page 92 of 431
Intel® C++ Compiler User's Guide
z Using the profile summary file (.dpi) if you move your application sources.
z Sharing the profile summary file with another user who is building identical application sources
that are located in a different directory.
Source Relocation
To enable the movement of application sources, as well as the sharing of profile summary files, use
profmerge with the -src_old and -src_new options. For example:
where:
The above command will read the pgopti.dpi file. For each function represented in the
pgopti.dpi file, whose source path begins with the <p2> prefix, profmerge replaces that prefix
with <p3>. The pgopti.dpi file is updated with the new source path information.
Notes
z You can execute profmerge more than once on a given pgopti.dpi file. You may need to
do this if the source files are located in multiple directories. For example:
z In the values specified for -src_old and -src_new, uppercase and lowercase characters are
treated as identical. Likewise, forward slash (/) and backward slash (\) characters are treated
as identical.
z Because the source relocation feature of profmerge modifies the pgopti.dpi file, you may
wish to make a backup copy of the file prior to performing the source relocation.
Page 93 of 431
Intel® C++ Compiler User's Guide
This section includes descriptions of the functions and environment variable that comprise Profile
Information Generation Support. The functions are available by inserting #include <pgouser.h>
at the top of any source file where the functions may be used.
The compiler sets a define for _PGO_INSTRUMENT when you compile with either -prof_gen or -
prof_genx.
Description
This function dumps the profile information collected by the instrumented application. The profile
information is recorded in a .dyn file.
Recommended Usage
Insert a single call to this function in the body of the function which terminates your application.
Normally, _PGOPTI_Prof_Dump should be called just once. It is also possible to use this function in
conjunction with _PGOPTI_Prof_Reset() to generate multiple .dyn files (presumably from multiple
sets of input data).
Example
// Selectively collect profile information for the portion
// of the application involved in processing input data.
input_data = get_input_data();
while(input_data)
{
_PGOPTI_Prof_Reset();
process_data(input_data);
_PGOPTI_Prof_Dump();
input_data = get_input_data();
}
Page 94 of 431
Intel® C++ Compiler User's Guide
Description
Recommended Usage
Use this function to clear the profile counters prior to collecting profile information on a section of the
instrumented application. See the example under PGOPTI_Prof_Dump().
Description
This function may be called more than once. Each call will dump the profile information to a new .dyn
file. The dynamic profile counters are then reset, and execution of the instrumented application
continues.
Recommended Usage
Periodic calls to this function allow a non-terminating application to generate one or more profile
information files. These files are merged during the feedback phase of profile-guided optimization.
The direct use of this function allows your application to control precisely when the profile information
is generated.
Page 95 of 431
Intel® C++ Compiler User's Guide
Description
This function activates Interval Profile Dumping and sets the approximate frequency at which dumps
will occur. The interval parameter is measured in milliseconds and specifies the time interval at
which profile dumping will occur. For example, if interval is set to 5000, then a profile dump and
reset will occur approximately every 5 seconds. The interval is approximate because the time check
controlling the dump and reset is only performed upon entry to any instrumented function in your
application.
Note
z Setting interval to zero or a negative number will disable interval profile dumping.
z Setting interval to a very small value may cause the instrumented application to spend
nearly all of its time dumping profile information. Be sure to set interval to a large enough
value so that the application can perform actual work and collect substantial profile information.
Recommended Usage
Call this function at the start of a non-terminating application to initiate Interval Profile Dumping. Note
that an alternative method of initiating Interval Profile Dumping is by setting the environment variable,
PROF_DUMP_INTERVAL, to the desired interval value prior to starting the application. The intention
of Interval Profile Dumping is to allow a non-terminating application to be profiled with minimal changes
to the application source code.
Environment Variable
PROF_DUMP_INTERVAL
This environment variable may be used to initiate Interval Profile Dumping in an instrumented
application. See the Recommended Usage of _PGOPTI_Set_Interval_Prof_Dump for more
information.
Page 96 of 431
Intel® C++ Compiler User's Guide
HLO Overview
High-level optimizations (HLO) exploit the properties of source code constructs, such as loops and
arrays, in the applications developed in high-level programming languages, such as C++. They include
loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, data prefetch,
scalar replacement, data layout optimizations, and others. The option that turns on the high-level
optimizations is -O3.
IA-32 applications
-O3 In addition, in conjunction with the vectorization options, -ax{M|K|W} and -x{M|K|W}, -O3
causes the compiler to perform more aggressive data dependency analysis than for -O2. This
may result in longer compilation times.
Loop Transformations
All these transformations are supported by data dependence. These techniques also include induction
variable elimination, constant propagation, copy propagation, forward substitution, and dead code
elimination. The loop transformation techniques include:
z Loop normalization
z Loop reversal
z Loop interchange and permutation
z Loop skewing
z Loop distribution
z Loop fusion
z Scalar replacement
In addition to the loop transformations listed for both IA-32 and Itanium® architectures above, the
Itanium architecture allows collapsing techniques.
Page 97 of 431
Intel® C++ Compiler User's Guide
Loop Unrolling
You can unroll loops and specify the maximum number of times you want the compiler to do so.
When specifying high values to unroll loops, be aware that your application may exhaust certain
resources, such as registers, that can slow program performance. You should consider timing your
application (see Timing Your Application) if you specify high values to unroll loops.
Page 98 of 431
Intel® C++ Compiler User's Guide
Example
#pragma ivdep
for(i=1; i<n; i++)
{
e[ix[2][i]]=e[ix[2][i]]+1.0;
e[ix[3][i]]=e[ix[3][i]]+2.0;
}
The following example shows that using this option and the IVDEP directive ensures there is no loop-
carried dependency for the store into a().
Example
#pragma ivdep
Page 99 of 431
Intel® C++ Compiler User's Guide
Option Description
-openmp Enables the parallelizer to generate multithreaded code based on the
OpenMP directives. Default: OFF.
-openmp_report{0|1|2} Controls the OpenMP parallelizer's diagnostic levels. Default: -
openmp_report1.
-openmp_stubs Enables compilation of OpenMP programs in sequential mode. The
OpenMP directives are ignored and a stub OpenMP library is linked.
Default: OFF.
-parallel Enables the auto-parallelizer to generate multithreaded code for loops
that can be safely executed in parallel. Default: OFF.
-par_threshold{n} Sets a threshold for the auto-parallelization of loops based on the
probability of profitable execution of the loop in parallel, n=0 to 100.
n=0 implies "always." Default: -par_threshold75.
-par_report{0|1|2|3} Controls the auto-parallelizer's diagnostic levels.
Default: -par_report1
Note
When both -openmp and -parallel are specified on the command line, the -parallel option is
honored only in routines that do not contain OpenMP directives. For routines that contain OpenMP
directives, only the -openmp option is honored.
z Relieves the user from having to deal with the low-level details of iteration space partitioning,
data sharing, and thread scheduling and synchronization.
z Provides the benefit of the performance available from shared memory, multiprocessor
systems.
The Intel C++ Compiler performs transformations to generate multithreaded code based on the user's
placement of OpenMP directives in the source program making it easy to add threading to existing
software. The Intel compiler supports all of the current industry-standard OpenMP directives, except
WORKSHARE, and compiles parallel programs annotated with OpenMP directives. In addition, the Intel
C++ Compiler provides Intel-specific extensions to the OpenMP C++ version 2.0 specification including
run-time library routines and environment variables.
Note
As with many advanced features of compilers, you must properly understand the functionality of the
OpenMP directives in order to use them effectively and avoid unwanted program behavior.
See parallelization options summary for all of the options of the OpenMP feature in the Intel C++
Compiler.
For complete information on the OpenMP standard, visit the OpenMP Web site at
http://www.openmp.org. For OpenMP* C++ version 2.0 API specifications, see
http://www.openmp.org/specs/.
Performance Analysis
For performance analysis of your program, you can use the Intel® VTune™ Performance Analyzer to
show performance information. You can obtain detailed information about which portions of the code
require the largest amount of time to execute and where parallel performance problems are located.
In the OpenMP C++ API, the #pragma omp parallel directive defines the parallel construct. When
the master thread encounters a parallel construct, it creates a team of threads, with the master thread
becoming the master of the team. The program statements enclosed by the parallel construct are
executed in parallel by each thread in the team. These statements include routines called from within
the enclosed statements.
The statements enclosed lexically within a construct define the static extent of the construct. The
dynamic extent includes the static extent as well as the routines called from within the construct.
When the #pragma omp parallel directive reaches completion, the threads in the team
synchronize, the team is dissolved, and only the master thread continues execution. The other threads
in the team enter a wait state. You can specify any number of parallel constructs in a single program.
As a result, thread teams can be created and dissolved many times during program execution.
In routines called from within parallel constructs, you can also use directives. Directives that are not in
the lexical extent of the parallel construct, but are in the dynamic extent, are called orphaned
directives. Orphaned directives allow you to execute major portions of your program in parallel with
only minimal changes to the sequential version of the program. Using this functionality, you can code
parallel constructs at the top levels of your program and use directives to control execution in any of
the called routines. For example:
int main(void)
{
...
#pragma omp parallel
{
phase1();
}
}
void phase1(void)
{
...
This is an orphaned directive because the parallel region is not lexically present.
A data environment directive controls the data environment during the execution of parallel constructs.
You can control the data environment within parallel and worksharing constructs. Using directives and
data environment clauses on directives, you can:
You can use several directive clauses to control the data scope attributes of variables for the duration
of the construct in which you specify them. If you do not specify a data scope attribute clause on a
directive, the default is SHARED for those variables affected by the directive.
//
#pragma omp barrier // Wait for all team members to arrive
... // More Replicated Code
//
} // End of Parallel Construct;
// disband team and continue
// serial execution
//
... // Possibly more Parallel constructs
//
} // End serial execution
Before you run the multithreaded code, you can set the number of desired threads in the OpenMP
environment variable, OMP_NUM_THREADS. See OpenMP Environment Variables for further
information.
-openmp Option
The -openmp option enables the parallelizer to generate multithreaded code based on the OpenMP
directives. The code can be executed in parallel on both uniprocessor and multiprocessor systems.
The -openmp option works with both -O0 (no optimization) and any optimization level of -O1, -O2
(default) and -O3. Specifying -O0 with -openmp helps to debug OpenMP applications.
where:
OpenMP Diagnostics
The -openmp_report{0|1|2} option controls the OpenMP parallelizer's diagnostic levels 0, 1, or 2
as follows:
Note
OpenMP Clauses
Clause Description
private Declares variables to be private to each thread in a team.
firstprivate Provides a superset of the functionality provided by the private clause.
lastprivate Provides a superset of the functionality provided by the private clause.
shared Shares variables among all the threads in a team.
default Enables you to affect the data-scope attributes of variables.
reduction Performs a reduction on scalar variables.
ordered The structured block following an ordered directive is executed in the order in
which iterations would be executed in a sequential loop.
if If the if(scalar_logical_expression) clause is present, the enclosed
code block is executed in parallel only if the scalar_logical_expression
evaluates to TRUE. Otherwise the code block is serialized.
schedule Specifies how iterations of the for loop are divided among the threads of the
team.
copyin Provides a mechanism to assign the same name to threadprivate variables
for each thread in the team executing the parallel region.
Execution modes
The Intel compiler with OpenMP enables you to run an application under different execution modes
that can be specified at run time. The libraries support the serial, turnaround, and throughput modes.
These modes are selected by using the KMP_LIBRARY environment variable at run time.
Serial
Turnaround
In a dedicated (batch or single user) parallel environment where all processors are exclusively
allocated to the program for its entire run, it is most important to effectively utilize all of the processors
all of the time. The turnaround mode is designed to keep active all of the processors involved in the
parallel computation in order to minimize the execution time of a single job. In this mode, the worker
threads actively wait for more parallel work, without yielding to other threads.
Note
Avoid over-allocating system resources. This occurs if either too many threads have been specified, or
if too few processors are available at run time. If system resources are over-allocated, this mode will
cause poor performance. The throughput mode should be used instead if this occurs.
Throughput
In a multi-user environment where the load on the parallel machine is not constant or where the job
stream is not predictable, it may be better to design and tune for throughput. This minimizes the total
time to run multiple jobs simultaneously. In this mode, the worker threads will yield to other threads
while waiting for more parallel work.
The throughput mode is designed to make the program aware of its environment (that is, the system
load) and to adjust its resource usage to produce efficient execution in a dynamic environment.
Throughput mode is the default.
The following table specifies the interfaces to these routines. The names for the routines are in user
name space. The omp.h and omp_lib.h header files are provided in the INCLUDE directory of your
compiler installation.
There are definitions for two different locks, omp_lock_kind and omp_nest_lock_kind, which are
used by the functions in the table that follows:
Function Description
Lock Routines
omp_init_lock(lock) Initializes the lock associated with lock for use in
subsequent calls.
omp_destroy_lock(lock) Causes the lock associated with lock to become
undefined.
omp_set_lock(lock) Forces the executing thread to wait until the lock
associated with lock is available. The thread is
granted ownership of the lock when it becomes
available.
omp_unset_lock(lock) Releases the executing thread from ownership of the
lock associated with lock. The behavior is undefined
if the executing thread does not own the lock
associated with lock.
omp_test_lock(lock Attempts to set the lock associated with lock. If
successful, returns TRUE, otherwise returns FALSE.
omp_init_nest_lock(lock) Initializes the nested lock associated with lock for use
in the subsequent calls.
omp_destroy_nest_lock(lock) Causes the nested lock associated with lock to
become undefined.
omp_set_nest_lock(lock) Forces the executing thread to wait until the nested
lock associated with lock is available. The thread is
granted ownership of the nested lock when it becomes
available.
omp_unset_nest_lock(lock) Releases the executing thread from ownership of the
nested lock associated with lock if the nesting count
is zero. Behavior is undefined if the executing thread
does not own the nested lock associated with lock.
omp_test_nest_lock(lock) Attempts to set the nested lock associated with lock.
If successful, returns the nesting count, otherwise
returns zero.
Timing Routines
omp_get_wtime() Returns a double-precision value equal to the elapsed
wallclock time (in seconds) relative to an arbitrary
reference time. The reference time does not change
during program execution.
omp_get_wtick() Returns a double-precision value equal to the number
of seconds between successive clock ticks.
Intel Extensions
The Intel® C++ Compiler implements the following groups of functions as extensions to the OpenMP*
run-time library:
The Intel extensions described in this section can be used for low-level debugging to verify that the
library code and application are functioning as intended. It is recommended to use these functions with
caution because using them requires the use of the -Qopenmp_stubs command-line option to
execute the program sequentially. These functions are also generally not recognized by other vendor's
OpenMP-compliant compilers, which may cause the link stage to fail for these other compilers.
Note
Stack Size
In most cases, directives can be used in place of extensions. For example, the stack size of the
parallel threads may be set using the KMP_STACKSIZE environment variable rather than the
kmp_set_stacksize_s()function.
Note
A run-time call to an Intel extension takes precedence over the corresponding environment variable
setting. See the definitions of stack size functions in the Stack Size table below.
Memory Allocation
The Intel® C++ Compiler implements a group of memory allocation functions as extensions to the
OpenMP run-time library to enable threads to allocate memory from a heap local to each thread.
These functions are kmp_malloc(), kmp_calloc(), and kmp_realloc(). The memory allocated
by these functions must also be freed by the kmp_free()function. While it is legal for the memory to
be allocated by one thread and kmp_free()'d by a different thread, this mode of operation has a
slight performance penalty. See the definitions of these functions in the Memory Allocation table below.
Stack Size
Function Description
kmp_get_stacksize_s() Returns the number of bytes that will be allocated for each
parallel thread to use as its private stack. This value can be
changed with kmp_set_stacksize_s() prior to the first
parallel region or with the KMP_STACKSIZE environment
variable.
kmp_get_stacksize() This function is provided for backwards compatibility only. Use
kmp_get_stacksize_s() for compatibility across different
families of Intel processors.
kmp_set_stacksize_s(size) Sets to size the number of bytes that will be allocated for each
parallel thread to use as its private stack. This value can also be
set via the KMP_STACKSIZE environment variable. In order for
kmp_set_stacksize_s() to have an effect, it must be called
before the beginning of the first (dynamically executed) parallel
region in the program.
kmp_set_stacksize(size) This function is provided for backward compatibility only; use
kmp_set_stacksize_s() for compatibility across different
families of Intel processors.
Memory Allocation
Function Description
kmp_malloc(size) Allocate memory block of size bytes from thread-local heap.
kmp_calloc(nelem, elsize) Allocate array of nelem elements of size elsize from thread-
local heap.
kmp_realloc(ptr, size) Reallocate memory block at address ptr and size bytes from
thread-local heap.
kmp_free(ptr) Free memory block at address ptr from thread-local heap.
Memory must have been previously allocated with kmp_malloc
(), kmp_calloc(), or kmp_realloc().
Workqueuing Constructs
taskq Pragma
The taskq pragma specifies the environment within which the enclosed units of work (tasks) are to be
executed. From among all the threads that encounter a taskq pragma, one is chosen to execute it
initially. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread,
and then the code inside the taskq block is executed single-threaded. All the other threads wait for
work to be enqueued on the conceptual queue. The task pragma specifies a unit of work, potentially
executed by a different thread. When a task pragma is encountered lexically within a taskq block,
the code inside the task block is conceptually enqueued on the queue associated with the taskq.
The conceptual queue is disbanded when all work enqueued on it finishes, and when the end of the
taskq block is reached.
Control Structures
Many control structures exhibit the pattern of separated work iteration and work creation, and are
naturally parallelized with the workqueuing model. Some common cases are:
z while loops
z C++ iterators
z Recursive functions.
while Loops
If the computation in each iteration of a while loop is independent, the entire loop becomes the
environment for the taskq pragma, and the statements in the body of the while loop become the
units of work to be specified with the task pragma. The conditional in the while loop and any
modifications to the control variables are placed outside of the task blocks and executed sequentially
to enforce the data dependencies on the control variables.
C++ Iterators
C++ Standard Template Library (STL) iterators are very much like the while loops just described,
whereby the operations on the data stored in the STL are very distinct from the act of iterating over all
the data. If the operations are data-independent, they can be done in parallel as long as the iteration
over the work is sequential. This type of while loop parallelism is a generalization of the standard
OpenMP* worksharing for loops. In the worksharing for loops, the loop increment operation is the
iterator and the body of the loop is the unit of work. However, because the for loop iteration variable
frequently has a closed form solution, it can be computed in parallel and the sequential step avoided.
Recursive Functions
Recursive functions also can be used to specify parallel iteration spaces. The mechanism is similar to
specifying parallelism using the sections pragma, but is much more flexible because it allows
arbitrary code to sit between the taskq and the task pragmas, and because it allows recursive
nesting of the function to build a conceptual tree of taskq queues. The recursive nesting of the
taskq pragmas is a conceptual extension of OpenMP worksharing constructs to behave more like
nested OpenMP parallel regions. Just like nested parallel regions, each nested workqueuing construct
is a new instance and is encountered by exactly one thread. However, the major difference is that
nested workqueuing constructs do not cause new threads or teams to be formed, but rather re-use the
threads from the team. This permits very easy multi-algorithmic parallelism in dynamic environments,
such that the number of threads need not be committed at each level of parallelism, but instead only at
the top level. From that point on, if a large amount of work suddenly appears at an inner level, the idle
threads from the outer level can assist in getting that work finished. For example, it is very common in
server environments to dedicate a thread to handle each incoming request, with a large number of
threads awaiting incoming requests. For a particular request, its size may not be obvious at the time
the thread begins handling it. If the thread uses nested workqueuing constructs, and the scope of the
request becomes large after the inner construct is started, the threads from the outer construct can
easily migrate to the inner construct to help finish the request.
The syntax, semantics, and allowed clauses are designed to resemble OpenMP* worksharing
constructs. Most of the clauses allowed on OpenMP worksharing constructs have a reasonable
meaning when applied to the workqueuing pragmas.
taskq Construct
z private (variable-list)
z firstprivate (variable-list)
z lastprivate (variable-list)
z reduction (operator : variable-list)
z ordered
z nowait
private
The private clause creates a private, default-constructed version for each object in variable-
list for the taskq. It also implies captureprivate on each enclosed task. The original object
referenced by each variable has an indeterminate value upon entry to the construct, must not be
modified within the dynamic extent of the construct, and has an indeterminate value upon exit from the
construct.
firstprivate
The firstprivate clause creates a private, copy-constructed version for each object in variable-
list for the taskq. It also implies captureprivate on each enclosed task. The original object
referenced by each variable must not be modified within the dynamic extent of the construct and has
an indeterminate value upon exit from the construct.
lastprivate
The lastprivate clause creates a private, default-constructed version for each object in
variable-list for the taskq. It also implies captureprivate on each enclosed task. The
original object referenced by each variable has an indeterminate value upon entry to the construct,
must not be modified within the dynamic extent of the construct, and is copy-assigned the value of the
object from the last enclosed task after that task completes execution.
reduction
The reduction clause performs a reduction operation with the given operator in enclosed task
constructs for each object in variable-list. operator and variable-list are defined the
same as in the OpenMP Specifications.
ordered
The ordered clause performs ordered constructs in enclosed task constructs in original sequential
execution order. The taskq directive, to which the ordered is bound, must have an ordered clause
present.
nowait
The nowait clause removes the implied barrier at the end of the taskq. Threads may exit the taskq
construct before completing all the task constructs queued within it.
task Construct
z private( variable-list )
z captureprivate( variable-list )
private
The private clause creates a private, default-constructed version for each object in variable-
list for the task. The original object referenced by the variable has an indeterminate value upon
entry to the construct, must not be modified within the dynamic extent of the construct, and has an
indeterminate value upon exit from the construct.
captureprivate
The captureprivate clause creates a private, copy-constructed version for each object in
variable-list for the task at the time the task is enqueued. The original object referenced by
each variable retains its value but must not be modified within the dynamic extent of the task
construct.
z if(scalar-expression)
z num_threads(integer-expression)
z copyin(variable-list)
z default(shared | none)
z shared(variable-list)
z private(variable-list)
z firstprivate(variable-list)
z lastprivate(variable-list)
z reduction(operator : variable -list)
z ordered
Clause descriptions are the same as for the OpenMP parallel construct or the taskq construct
above as appropriate.
Example Function
The test1 function below is a natural candidate to be parallelized using the workqueuing model. You
can express the parallelism by annotating the loop with a parallel taskq pragma and the work in the
loop body with a task pragma. The parallel taskq pragma specifies an environment for the while
loop in which to enqueue the units of work specified by the enclosed task pragma. Thus, the loop’s
control structure and the enqueuing are executed single-threaded, while the other threads in the team
participate in dequeuing the work from the taskq queue and executing it. The captureprivate
clause ensures that a private copy of the link pointer p is captured at the time each task is being
enqueued, hence preserving the sequential semantics.
void test1(LIST p)
{
#pragma intel omp parallel taskq shared(p)
{
while (p != NULL)
{
#pragma intel omp task captureprivate(p)
{
do_work1(p);
}
p = p->next;
}
}
}
Overview: Auto-parallelization
The auto-parallelization feature of the Intel® C++ Compiler automatically translates serial portions of
the input program into equivalent multithreaded code. The auto-parallelizer analyzes the dataflow of
the program’s loops and generates multithreaded code for those loops which can be safely and
efficiently executed in parallel. This enables the potential exploitation of the parallel architecture found
in symmetric multiprocessor (SMP) systems.
z Having to deal with the details of finding loops that are good worksharing candidates
z Performing the dataflow analysis to verify correct parallel execution
z Partitioning the data for threaded code generation as is needed in programming with OpenMP
directives.
The parallel run-time support provides the same run-time features found in OpenMP*, such as
handling the details of loop iteration modification, thread scheduling, and synchronization.
While OpenMP directives enable serial applications to transform into parallel applications quickly, the
programmer must explicitly identify specific portions of the application code that contain parallelism
and add the appropriate compiler directives. Auto-parallelization triggered by the -parallel option
automatically identifies those loop structures which contain parallelism. During compilation, the
compiler automatically attempts to decompose the code sequences into separate threads for parallel
processing. No other effort by the programmer is needed.
The following example illustrates how a loop’s iteration space can be divided so that it can be executed
concurrently on two threads:
Thread 1
for (i=1; i<50; i++)
{
a[i] = a[i] + b[i] * c[i];
}
Thread 2
for (i=50; i<100; i++)
{
a[i] = a[i] + b[i] * c[i];
}
z The loop is countable at compile time. This means that an expression representing how many
times the loop will execute (also called "the loop trip count") can be generated just before
entering the loop.
z There are no FLOW (READ after WRITE), OUTPUT (WRITE after READ) or ANTI (WRITE after
READ) loop-carried data dependences. A loop-carried data dependence occurs when the same
memory location is referenced in different iterations of the loop. At the compiler's discretion, a
loop may be parallelized if any assumed inhibiting loop-carried dependencies can be resolved
by run-time dependency testing.
The compiler may generate a run-time test for the profitability of executing in parallel for loop with
loop parameters that are not compile-time constants.
Coding Guidelines
Enhance the power and effectiveness of the auto-parallelizer by following these coding guidelines:
z Expose the trip count of loops whenever possible. Specifically use constants where the trip
count is known and save loop parameters in local variables.
z Avoid placing structures inside loop bodies that the compiler may assume to carry dependent
data, for example, function calls, ambiguous indirect references, or global references.
z Data flow analysis: compute the flow of data through the program
z Loop classification: determine loop candidates for parallelization based on correctness and
efficiency as shown by threshold analysis
z Dependence analysis: compute the dependence analysis for references in each loop nest
z High-level parallelization:
{ analyze dependence graph to determine loops which can execute in parallel.
{ compute run-time dependency
z Data partitioning: examine data reference and partition based on the following types of access:
shared, private, and firstprivate.
z Multi-threaded code generation:
{ modify loop parameters
{ generate entry/exit per threaded task
{ generate calls to parallel runtime routines for thread creation and synchronization
Auto-parallelization Options
The -parallel option enables the auto-parallelizer if the -O2 (or -O3) optimization option is also on
(the default is -O2). The -parallel option detects parallel loops capable of being executed safely in
parallel and automatically generates multithreaded code for these loops.
Option Description
-parallel Enables the auto-parallelizer
-parallel_threshold{1-100} Controls the work threshold needed for auto-parallelization,
see later subsection.
-par_report{1|2|3} Controls the diagnostic messages from the auto-parallelizer,
see later subsection.
z Default: -par_threshold is not specified in the command line, which is the same as when -
par_threshold0 is specified. The loops get auto-parallelized regardless of computation work
volume, that is, parallelize always.
z -par_threshold100 - loops get auto-parallelized only if profitable parallel execution is almost
certain.
z The intermediate 1 to 99 values represent the percentage probability for profitable speed-up.
For example, n=50 would mean: parallelize only if there is a 50% probability of the code
speeding up if executed in parallel.
z The default value of n is n=75 (or -par_threshold75). When -par_threshold is used on
the command line without a number, the default value passed is 75.
The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus
the amount of work available to be shared amongst the threads.
Diagnostics
The -par_report{0|1|2|3} option controls the auto-parallelizer's diagnostic levels 0, 1, 2, or 3 as
follows:
Sample Ouput
program prog
procedure: prog
serial loop: line 5: not a parallel candidate due to
statement at line 6
serial loop: line 9
flow data dependence from line 10 to line 10, due to "a"
12 Lines Compiled
Sample prog.c
/* Assumed side effects */
Troubleshooting Tips
z Use -par_threshold0 to see if the compiler assumed there was not enough computational
work
z Use -par_report3 to view diagnostics
z Use -ipo to eliminate assumed side-effects done to function calls
Overview: Vectorization
The vectorizer is a component of the Intel® C++ Compiler that automatically uses SIMD instructions in
the MMX(TM), SSE, and SSE2 instruction sets. The vectorizer detects operations in the program that
can be done in parallel, and then converts the sequential program to process 2, 4, 8, or 16 elements in
one operation, depending on the data type.
This section provides guidelines, option descriptions, and examples for the Intel C++ Compiler
vectorization on IA-32 systems only. The following list summarizes this section's contents.
Vectorizer Options
Option Description
-ax{M|K|W} Enables the vectorizer and generates specialized and generic IA-32 code. The
generic code is usually slower than the specialized code.
-x{M|K|W} Turns on the vectorizer and generates processor-specific specialized code.
-vec_reportn Controls the vectorizer's level of diagnostic messages:
Usage
If you use -c, -ipo with -vec_report{n} option or -c, -x{M|K|W} or -ax{M|K|W} with -
vec_report{n}, the compiler issues a warning and no report is generated.
To produce a report when using the afore mentioned options, you need to add the -ipo_obj option.
The combination of -c and -ipo_obj produces a single file compilation, and hence does generate
object code, and eventually a report is generated.
Note that in some cases successful loop parallelization (either automatically or by means of OpenMP*
directives) may affect the messages reported by the compiler for loop vectorization; for example, under
the -vec_report2 option indicating loops not successfully vectorized.
z Function calls
z Unvectorizable operations
z Mixing vectorizable types in the same loop
z Data-dependent loop exit conditions
To make your code vectorizable, you will often need to make some changes to your loops. However,
you should make only the changes needed to enable vectorization and no others. In particular, you
should avoid these common changes:
Restrictions
Hardware. The compiler is limited by restrictions imposed by the underlying hardware. In the case of
Streaming SIMD Extensions, the vector memory operations are limited to stride-1 accesses with a
preference to 16-byte-aligned memory references. This means that if the compiler abstractly
recognizes a loop as vectorizable, it still might not vectorize it for a distinct target architecture.
Style. The style in which you write source code can inhibit optimization. For example, a common
problem with global pointers is that they often prevent the compiler from being able to prove two
memory references at distinct locations. Consequently, this prevents certain reordering
transformations.
Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures.
The ambiguity arises from the complexity of the keywords, operators, data references, and memory
operations within the loop bodies.
However, by understanding these limitations and by knowing how to interpret diagnostic messages,
you can modify your program to overcome the known limitations and enable effective vectorizations.
The following topics summarize the capabilities and restrictions of the vectorizer with respect to loop
structures.
Data Dependence
Data dependence relations represent the required ordering constraints on the operations in serial
loops. Because vectorization rearranges the order in which operations are executed, any auto-
vectorizer must have at its disposal some form of data dependence analysis. The "Data-dependent
Loop" example shows some code that exhibits data dependence. The value of each element of an
array is dependent on itself and its two neighbors.
Data-dependent Loop
float data[N];
int i;
for (i=1; i<N-1; i++)
{
data[i]=data[i-1]*0.25+data[i]*0.5+data[i+1]*0.25;
}
The loop in the example above is not vectorizable because the write to the current element data[i]
is dependent on the use of the preceding element data[i-1], which has already been written to and
changed in the previous iteration. To see this, look at the access patterns of the array for the first two
iterations as shown in the following example:
read b[0]
write a[0]
read b[1]
write a[1]
In the normal sequential version of the loop shown, the value of data[1] read during the second
iteration was written into the first iteration. For vectorization, the iterations must be done in parallel,
without changing the semantics of the original loop.
Data dependence analysis involves finding the conditions under which two memory accesses may
overlap. Given two references in a program, the conditions are defined by:
z whether the referenced variables may be aliases for the same (or overlapping) regions in
memory,
z for array references, the relationship between the subscripts.
For array references, the Intel® C++ Compiler's data dependence analyzer is organized as a series of
tests that progressively increase in power as well as time and space costs. First, a number of simple
tests are performed in a dimension-by-dimension manner, since independence in any dimension will
exclude any dependence relationship. Multi-dimensional arrays references that may cross their
declared dimension boundaries can be converted to their linearized form before the tests are applied.
Some of the simple tests used are the fast GCD test, proving independence if the greatest common
divisor of the coefficients of loop indices cannot evenly divide the constant term, and the extended
bounds test, which tests potential overlap for the extreme values of subscript expressions.
If all simple tests fail to prove independence, the compiler will eventually resort to a powerful
hierarchical dependence solver that uses Fourier-Motzkin elimination to solve the data dependence
problem in all dimensions.
Loop Constructs
Loops can be formed with the usual for and while-do, or repeat-until constructs or by using a
goto and a label. However, the loops must have a single entry and a single exit to be vectorized.
Correct Usage
while(i<n)
{
// If branch is inside body of loop
a[i]=b[i]*c[i];
if(a[i]<0.0)
{
a[i]=0.0;
}
i++;
}
Incorrect Usage
while(i<n)
{
if (condition) break;
// 2nd exit.
++i;
}
z A constant
z A loop invariant term
z A linear function of outermost loop indices
Loops whose exit depends on computation are not countable. Examples below show countable and
non-countable loop constructs.
...
while(count!=1b)
{
// 1b is not affected within loop
a[i]=b[i]*x;
b[i]=[i]+sqrt(d[i]);
--count;
}
For loops that operate on 32-bit single-precision and 64-bit double-precision floating-point numbers,
the Streaming SIMD Extensions provide SIMD instructions for the arithmetic operators +, -, *, and /.
Also, the Streaming SIMD Extensions provide SIMD instructions for the binary MIN, MAX, and unary
SQRT operators. SIMD versions of several other mathematical operators (like the trigonometric
functions SIN, COS, TAN) are supported in software in a vector mathematical run-time library that is
provided with the Intel® C++ Compiler.
Before Vectorization
i=0;
while(i<n)
{
// Original loop code
a[i]=b[i]+c[i];
++i;
}
After Vectorization
// The vectorizer generates the following two loops
i=0;
while(i<(n-n%4))
{
// Vector strip-mined loop
// Subscript [i:i+3] denotes SIMD execution
a[i:i+3]=b[i:i+3]+c[i:i+3];
i=i+4;
}
while(i<n)
{
// Scalar clean-up loop
a[i]=b[i]+c[i];
}
The statements within the loop body may contain float operations (typically on arrays). Supported
arithmetic operations include addition, subtraction, multiplication, division, negation, square root, max,
and min. Operation on double precision types is not permitted unless optimizing for a Pentium® 4
processor system, using the -xW or -axW compiler option.
The statements within the loop body may contain char, unsigned char, short, unsigned
short, int, and unsigned int. Calls to functions such as sqrt and fabs are also supported.
Arithmetic operations are limited to addition, subtraction, bitwise AND, OR, and XOR operators, division
(16-bit only), multiplication (16-bit only), min, and max. You can mix data types only if the conversion
can be done without a loss of precision. Some example operators where you can mix data types are
multiplication, shift, or unary operators.
Other Operations
No statements other than the preceding floating-point and integer operations are allowed. In particular,
note that the special __m64 and __m128 datatypes are not vectorizable. The loop body cannot contain
any function calls. Use of the Streaming SIMD Extensions intrinsics ( _mm_add_ps) are not allowed.
Language Support
Option Description
__declspec(align(n)) Directs the compiler to align the variable to an n-byte
boundary. Address of the variable is address mod n=0.
__declspec(align(n,off)) Directs the compiler to align the variable to an n-byte boundary
with offset off within each n-byte boundary. Address of the
variable is address mod n = off.
restrict Permits the disambiguator flexibility in alias assumptions,
which enables more vectorization.
__assume_aligned(a,n) Instructs the compiler to assume that array a is aligned on an
n-byte boundary; used in cases where the compiler has failed
to obtain alignment information.
#pragma ivdep Instructs the compiler to ignore assumed vector dependencies.
#pragma vector Specifies how to vectorize the loop and indicates that
{aligned | unaligned | efficiency heuristics should be ignored.
always}
#pragma novector Specifies that the loop should never be vectorized
Multi-version Code
Multi-version code is generated by the compiler in cases where data dependence analysis fails to
prove independence for a loop due to the occurrence of pointers with unknown values. This
functionality is referred to as dynamic dependence testing.
Pragma Scope
See Vectorization Support.
Sample Code
float *p, *q;
for(i=L; I<=U; i++)
{
p[i]=q[i];
}
...
pL=p*4*L;
pH=p+4*U;
qL=q+4*L;
qH=q+4*U;
if(pH<qL || pL>qH)
{
Vectorization Examples
This section contains a few simple examples of some common issues in vector programming.
The restrict keyword in the example below indicates that the pointers refer to distinct objects.
Therefore, the compiler allows vectorization without generation of multi-version code.
Data Alignment
A 16-byte or greater data structure or array should be aligned so that the beginning of each structure
or array element is aligned in a way that its base address is a multiple of sixteen.
The "Misaligned Data Crossing 16-Byte Boundary" figure shows the effect of a data cache unit (DCU)
split due to misaligned data. The code loads the misaligned data across a 16-byte boundary, which
results in an additional memory access causing a six- to twelve-cycle stall. You can avoid the stalls if
you know that the data is aligned and you specify to assume alignment.
For example, if you know that elements a[0] and b[0] are aligned on a 16-byte boundary, then the
following loop can be vectorized with the alignment option on (#pragma vector aligned):
Both the vector iterations a[0:3] = b[0:3]; and a[4:7] = b[4:7]; can be implemented with
aligned moves if both the elements a[0] and b[0] (or, likewise, a[4] and b[4] ) are 16-byte
aligned.
Caution
If you specify the vectorizer with incorrect alignment options, the compiler will generate unexpected
behavior. Specifically, using aligned moves on unaligned data, will result in an illegal instruction
exception.
If you know that lb is a multiple of 4, you can align the loop with #pragma vector aligned as
shown in the example that follows:
The use of b[k][j], is not a stride-1 reference and therefore will not normally be vectorizable. If
the loops are interchanged, however, all the references will become stride-1 as shown in the
"Matrix Multiplication With Stride-1" example.
Caution
Interchanging is not always possible because of dependencies, which can lead to different results.
Compiler Directives
This section discusses the language extended directives used in:
z Software Pipelining
z Loop Count and Loop Distribution
z Loop Unrolling
z Prefetching
z Vectorization
#pragma swp
#pragma noswp
The software pipelining optimization triggered by the swp directive applies instruction scheduling to
certain innermost loops, allowing instructions within a loop to be split into different stages, allowing
increased instruction level parallelism. This can reduce the impact of long-latency operations, resulting
in faster loop execution. Loops chosen for software pipelining are always innermost loops that do not
contain procedure calls that are not inlined. Because the optimizer no longer considers fully unrolled
loops as innermost loops, fully unrolling loops can allow an additional loop to become the innermost
loop. You can request and view the optimization report to see whether software pipelining was applied
(see Optimizer Report Generation).
The loop count (n) directive indicates the loop count is likely to be n. The syntax for this directive
is:
where n is an integer constant. The value of loop count affects heuristics used in software
pipelining, vectorization and loop-transformations.
The distribute point directive indicates to the compiler a preference of performing loop
distribution. The syntax for this directive is:
Loop distribution may cause large loops be distributed into smaller ones. This may enable software
pipelining for more loops. If the directive is placed inside a loop, the distribution is performed after the
directive and any loop-carried dependency is ignored. If the directive is placed before a loop, the
compiler will determine where to distribute and data dependency is observed. Only one distribute
directive is supported when placed inside the loop.
...
d[i]=c[i]+1;
}
for(i=1; i<m; i++)
{
b[i]=a[i]+1;
...
#pragma distribute point
sub(a,n);
c[i]=a[i]+b[i];
...
d[i]=c[i]+1;
}
The unroll directive (unroll(n)|nounroll) tells the compiler how many times to unroll a counted
loop. The syntax for this directive is:
#pragma unroll
#pragma unroll(n)
#pragma nounroll
where n is an integer constant from 0 through 255. The unroll directive must precede the for
statement for each for loop it affects. If n is specified, the optimizer unrolls the loop n times. If n is
omitted, or if it is outside the allowed range, the optimizer assigns the number of times to unroll the
loop. The unroll directive overrides any setting of loop unrolling from the command line. The
directive can be applied only for the innermost nested loop. If applied to the outer loops, it is ignored.
The compiler generates correct code by comparing n and the loop count.
Prefetching Support
prefetch Directive
The prefetch and noprefetch directives assert that the data prefetches are generated or not
generated for some memory references. This affects the heuristics used in the compiler. The syntax
for this directive is:
#pragma noprefetch
#pragma prefetch
#pragma prefetch a,b
If the expression a[j] is used within a loop, by placing prefetch a in front of the loop, the compiler
will insert prefetches for a[j+d] within the loop, where d is determined by the compiler. This directive
is supported when option -O3 is on.
The vector always directive instructs the compiler to override any efficiency heuristic during the
decision to vectorize or not, and will vectorize non-unit strides or very unaligned memory accesses.
ivdep Directive
The ivdep directive instructs the compiler to ignore assumed vector dependences. To ensure correct
code, the compiler treats an assumed dependence as a proven dependence, which prevents
vectorization. This directive overrides that decision. Use ivdep only when you know that the assumed
loop dependences are safe to ignore. The loop in the example below will not vectorize with the ivdep,
since the value of k is not known (vectorization would be illegal if k<0 ).
The vector aligned directive means the loop should be vectorized, if it is legal to do so, ignoring
normal heuristic decisions about profitability. When the aligned or unaligned qualifier is used, the
loop should be vectorized using aligned or unaligned operations. Specify either aligned or
unaligned, but not both.
Caution
If you specify aligned as an argument, you must be absolutely sure that the loop will be vectorizable
using this instruction. Otherwise, the compiler will generate incorrect code. The loop in the example
below uses the aligned qualifier to request that the loop be vectorized with aligned instructions, as
the arrays are declared in such a way that the compiler could not normally prove this would be safe to
do so.
The compiler includes several alignment strategies in case the alignment of data structures is not
known at compile time. A simple example is shown below, but several other strategies are supported
as well. If, in the loop shown below, the alignment of a is unknown, the compiler will generate a prelude
loop that iterates until the array reference that occurs the most hits an aligned address. This makes the
alignment properties of a known, and the vector loop is optimized accordingly.
//Alignment unknown
for(i=0; i<100; i++)
{
a[i]=a[i]+1.0f;
}
novector Directive
The novector directive specifies that the loop should never be vectorized, even if it is legal to do so.
In this example, suppose you know the trip count (ub - lb) is too low to make vectorization
worthwhile. You can use novector to tell the compiler not to vectorize, even if the loop is considered
vectorizable.
z Run program timings when other users are not active. Your timing results can be affected by
one or more CPU-intensive processes also running while doing your timings.
z Try to run the program under the same conditions each time to provide the most accurate
results, especially when comparing execution times of a previous version of the same program.
Use the same system (processor model, amount of memory, version of the operating system,
and so on) if possible.
z If you do need to change systems, you should measure the time using the same version of the
program on both systems, so you know each system's effect on your timings.
z For programs that run for less than a few seconds, run several timings to ensure that the results
are not misleading. Certain overhead functions, like loading external programs, might influence
short timings considerably.
z If your program displays a lot of text, consider redirecting the output from the program.
Redirecting output from the program will change the times reported because of reduced screen
I/O.
Sample Timing
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void)
{
clock_t start, finish;
long loop;
double duration, loop_calc;
start = clock();
for(loop=0; loop <= 2000; loop++)
{
loop_calc = 123.456 * 789;
When one of the above logical names for optimizers is specified, all reports from that optimizer are
generated.
Each of the optimizers can potentially have specific optimizations within them. Each of these
optimizations are prefixed with one of the optimizer logical names. For example:
All optimization reports that have a matching prefix with the specified optimizer are generated. For
example, if -opt_report_phase ilo_co is specified, a report from both the constant propagation
and the copy propagation are generated.
The -opt_report_help option lists the logical names of optimizers available for report generation.
Overview: Libraries
The Intel® C++ Compiler uses the GNU* C Library and Dinkumware* C++ Library. These libraries are
documented at the following Internet locations:
GNU C Library
http://www.gnu.org/manual/glibc-2.2.3/html_chapter/libc_toc.html
http://www.dinkumware.com/htm_cpl/lib_cpp.html
Default Libraries
The following libraries are supplied with the Intel® C++ Compiler:
Library Description
libguide.a for OpenMP* implementation
libguide.so
libsvml.a short vector math library
libirc.a Intel support library for PGO and CPU dispatch
libimf.a Intel math library
libimf.so Intel math library
libcprts.a Dinkumware C++ Library
libcprts.so
libunwind.a Unwinder library
libunwind.so
libcxa.a Intel run time support for C++ features.
libcxa.so
If you want to link your program with alternate or additional libraries, specify them at the end of the
command line. For example, to compile and link prog.c with mylib.a, use the following command:
The mylib.a library appears prior to the libimf.a library in the command line for the ld linker.
Caution
The Linux* system libraries and the compiler libraries are not built with the -align option. Therefore,
if you compile with the -align option and make a call to a compiler distributed or system library, and
have long long, double, or long double types in your interface, you will get the wrong answer
due to the difference in alignment. Any code built with -align cannot make calls to libraries that use
these types in their interfaces unless they are built with -align (in which case they will not work
without -align).
Math Libraries
The Intel math library, libimf.a, is included with the Intel C++ Compiler. This math library contains
optimized versions of the math functions in the standard C run-time library. The functions in libimf.a
are optimized for program execution speed on the Pentium® III and Pentium 4 processors. The
Itanium® compiler also includes a libimf.a designed to optimize execution on Itanium-based
systems.
Note
The -lm switch is used for linking, precede it with -limf so that libimf.a is linked in before the
system libm.a.
z C++, math, and libcprts.a libraries are linked at link time, that is, statically.
z libcxa.so is linked dynamically.
z GNU* and Linux* system libraries are linked dynamically.
z Enables to maintain the same model for both IA-32 and Itanium® compilers.
z Provides a model consistent with the Linux model where system libraries are dynamic and
application libraries are static.
z The users have the option of using dynamic versions of our libraries to reduce the size of their
binaries if desired.
z The users are licensed to distribute Intel-provided libraries.
The -i_dynamic option can be used to specify that all Intel-provided libraries should be linked
dynamically. The comparison of the following commands illustrates the effects of this option.
1. prompt>icc prog.c
z C++, math, libirc.a, and libcprts.a libraries are linked statically (at link time).
z Dynamic version of libcxa.so is linked at run time.
The statically linked libraries increase the size of the application binary, but do not need to be installed
on the systems where the application runs.
This command links all of the above libraries dynamically. This has the advantage of reducing the size
of the application binary, but it requires all the dynamic versions installed on the systems where the
application runs.
The -shared option instructs the compiler to build a Dynamic Shared Object (DSO) instead of an
executable. For more details, refer to the ld man page documentation.
Managing Libraries
The LD_LIBRARY_PATH environment variable contains a colon-separated list of directories in which
the linker will search for library (.a) files. If you want the linker to search additional libraries, you can
add their names to LD_LIBRARY_PATH, to the command line, to a response file, or to the
configuration file. In each case, the names of these libraries are passed to the linker before the names
of the Intel libraries that the driver always specifies.
Modifying LD_LIBRARY_PATH
If you want to add a directory, /libs for example, to the LD_LIBRARY_PATH, you can do either of the
following:
To compile file.c and link it with the library mylib.a, enter the following command:
The compiler passes file names to the linker in the following order:
Library Description
libimf.a Default static math library.
libimf.so Default shared math library.
float fp32bits;
double fp64bits;
long double fp80bits;
long double pi_by_four = 3.141592653589793238/4.0;
// pi/4 radians is about 45 degrees.
return 0;
}
Since the example program above includes the long double data type, be sure to include the -
long_double compiler option:
c64out = cexp(c64in);
c32out = cexpf(c32in);
return 0;
}
Note
Other Considerations
Some math functions are inlined automatically by the compiler. The functions actually inlined may vary
and may depend on any vectorization or processor-specific compilation options used. For more
information, see Criteria for Inline Expansion of Functions.
A change of the default precision control or rounding mode may affect the results returned by some of
the mathematical functions. See Floating-point Arithmetic Precision.
Depending on the data types used, some important compiler options include:
z -long_double: Use this option when compiling programs that require support for the long
double data type (80-bit floating-point). Without this option, compilation will be successful, but
long double data types will be mapped to double data types.
z -c99: Use this option when compiling programs that require support for _Complex data types.
Trigonometric Functions
The Intel Math library supports the following trigonometric functions:
ACOS
Description: The acos function returns the principal value of the inverse cosine of x in the
range [0, pi] radians for x in the interval [-1,1].
Calling interface:
ACOSD
Description: The acosd function returns the principal value of the inverse cosine of x in the
interval [0,180] degrees for x in the interval [-1,1].
Calling interface:
ASIN
Description: The asin function returns the principal value of the inverse sine of x in the range
[-pi/2, +pi/2] radians for x in the interval [-1,1].
Calling interface:
ASIND
Description: The asind function returns the principal value of the inverse sine of x in the
interval [-90,90] degrees for x in the interval [-1,1].
Calling interface:
ATAN
Description: The atan function returns the principal value of the inverse tangent of x in the
range [-pi/2, +pi/2] radians.
Calling interface:
ATAN2
Description: The atan2 function returns the principal value of the inverse tangent of y/x in the
range [-p, +pi] radians.
Calling interface:
ATAND
Description: The atand function returns the principal value of the inverse tangent of x in the
interval [-90,90] degrees.
Calling interface:
ATAND2
Description: The atand2 function returns the principal value of the inverse tangent of y/x in
the range [-180, +180] degrees.
Calling interface:
COS
Calling interface:
COSD
Calling interface:
COT
Calling interface:
COTD
Calling interface:
SIN
Calling interface:
SINCOS
Description: The sincos function returns both the sine and cosine of x measured in radians.
Calling interface:
SINCOSD
Description: The sincosd function returns both the sine and cosine of x measured in
degrees.
Calling interface:
SIND
Calling interface:
TAN
Calling interface:
TAND
Calling interface:
Hyperbolic Functions
The Intel Math library supports the following hyperbolic functions:
ACOSH
Calling interface:
ASINH
Calling interface:
ATANH
Calling interface:
COSH
Calling interface:
SINH
Calling interface:
SINHCOSH
Description: The sinhcosh function returns both the hyperbolic sine and hyperbolic cosine of
x.
Calling interface:
TANH
Calling interface:
Exponential Functions
The Intel Math library supports the following exponential functions:
CBRT
Calling interface:
EXP
Calling interface:
EXP10
Calling interface:
EXP2
Calling interface:
EXPM1
Description: The expm1 function returns e raised to the x power minus 1, ex-1.
Calling interface:
FREXP
Description: The frexp function converts a floating-point number x into signed normalized
fraction in [1/2, 1) multiplied by an integral power of two. The signed normalized fraction is
returned, and the integer exponent stored at location exp.
Calling interface:
HYPOT
Description: The hypot function returns the value of the square root of the sum of the
squares.
Calling interface:
ILOGB
Description: The ilogb function returns the exponent of x base two as a signed int value.
Calling interface:
LDEXP
Description: The ldexp function returns the value of x times 2 raised to the power exp,
x*2exp.
Calling interface:
LOG
Calling interface:
LOG10
Calling interface:
LOG1P
Description: The log1p function returns the natural log of (x+1), ln(x + 1).
Calling interface:
LOG2
Calling interface:
LOGB
Calling interface:
POW
Calling interface:
SCALB
Calling interface:
SCALBN
Calling interface:
SCALBLN
Calling interface:
SQRT
Description: The sqrt function returns the correctly rounded square root.
Calling interface:
Special Functions
The Intel Math library supports the following special functions:
ANNUITY
Description: The annuity function computes the present value factor for an annuity, (1-
(1+x)-y)/x, where x is a rate and y is a period.
Calling interface:
COMPOUND
Description: The compound function computes the compound interest factor, (1+x)y, where
x is a rate and y is a period.
Calling interface:
ERF
Calling interface:
ERFC
Description: The erfc function returns the complementary error function value.
Calling interface:
GAMMA
Description: The gamma function returns the value of the logarithm of the absolute value of
gamma.
Calling interface:
GAMMA_R
Description: The gamma_r function returns the value of the logarithm of the absolute value of
gamma. The sign of the gamma function is returned in the external integer signgam.
Calling interface:
J0
Description: Computes the Bessel function (of the first kind) of x with order 0.
Calling interface:
J1
Description: Computes the Bessel function (of the first kind) of x with order 1.
Calling interface:
JN
Description: Computes the Bessel function (of the first kind) of x with order n.
Calling interface:
LGAMMA
Description: The lgamma function returns the value of the logarithm of the absolute value of
gamma.
Calling interface:
LGAMMA_R
Description: The lgamma_r function returns the value of the logarithm of the absolute value of
gamma. The sign of the gamma function is returned in the external integer signgam.
Calling interface:
TGAMMA
Calling interface:
Y0
Description: Computes the Bessel function (of the second kind) of x with order 0.
Calling interface:
Y1
Description: Computes the Bessel function (of the second kind) of x with order 1.
Calling interface:
YN
Description: Computes the Bessel function (of the second kind) of x with order n.
Calling interface:
CEIL
Description: The ceil function returns the smallest integral value not less than x as a floating-
point number.
Calling interface:
FLOOR
Description: The floor function returns the largest integral value not greater than x as a
floating-point value.
Calling interface:
LRINT
Description: The lrint function returns the rounded integer value as a long int.
Calling interface:
LLRINT
Description: The llrint function returns the rounded integer value as a long long int.
Calling interface:
LROUND
Description: The lround function returns the rounded integer value as a long int.
Calling interface:
LLROUND
Description: The llround function returns the rounded integer value as a long long int.
Calling interface:
MODF
Description: The modf function returns the value of the signed fractional part of x and stores
the integral part in floating-point format in *iptr.
Calling interface:
NEARBYINT
Description: The nearbyint function returns the rounded integral value as a floating-point
number.
Calling interface:
RINT
Description: The rint function returns the rounded integral value as a floating-point number.
Calling interface:
ROUND
Description: The round function returns the nearest integral value as a floating-point number.
Calling interface:
TRUNC
Description: The trunc function returns the truncated integral value as a floating-point
number.
Calling interface:
Remainder Functions
The Intel Math library supports the following remainder functions:
FMOD
Description: The fmod function returns the value x-n*y for integer n such that if y is nonzero,
the result has the same sign as x and magnitude less than the magnitude of y.
Calling interface:
REMAINDER
Calling interface:
REMQUO
Calling interface:
Miscellaneous Functions
The Intel Math library supports the following miscellaneous functions:
COPYSIGN
Description: The copysign function returns the value with the magnitude of x and the sign of
y.
Calling interface:
FABS
Calling interface:
FDIM
Description: The fdim function returns the positive difference value, x-y (for x > y) or +0 (for
x<= y).
Calling interface:
FINITE
Calling interface:
FMA
Calling interface:
FMAX
Description: The fmax function returns the maximum numeric value of its arguments.
Calling interface:
FMIN
Description: The fmin function returns the minimum numeric value of its arguments.
Calling interface:
ISNAN
Description: The isnan function returns a nonzero value if and only if x has a NaN value.
Calling interface:
NEXTAFTER
Description: The nextafter function returns the next representable value in the specified
format after x in the direction of y.
Calling interface:
NEXTTOWARD
Description: The nexttoward function returns the next representable value in the specified
format after x in the direction of y. If x equals y, then the function returns y converted to the
type of the function.
Calling interface:
Complex Functions
The Intel Math library supports the following complex functions:
CABS
Calling interface:
CACOS
Calling interface:
CACOSH
Description: The cacosh function returns the complex inverse hyperbolic cosine of z.
Calling interface:
CARG
Description: The carg function returns the value of the argument in the interval [-pi, +pi].
Calling interface:
CASIN
Calling interface:
CASINH
Description: The casinh function returns the complex inverse hyperbolic sine of z.
Calling interface:
CATAN
Calling interface:
CATANH
Description: The catanh function returns the complex inverse hyperbolic tangent of z.
Calling interface:
CCOS
Calling interface:
CCOSH
Calling interface:
CEXP
Calling interface:
CIMAG
Calling interface:
CIS
Description: The cis function returns the cosine and sine (as a complex value) of z measured
in radians.
Calling interface:
CLOG
Calling interface:
CONJ
Description: The conj function returns the complex conjugate of z, by reversing the sign of its
imaginary part.
Calling interface:
CPOW
Calling interface:
CPROJ
Description: The cproj function returns a projection of z onto the Riemann sphere.
Calling interface:
CREAL
Calling interface:
CSIN
Calling interface:
CSINH
Calling interface:
CSQRT
Calling interface:
CTAN
Calling interface:
CTANH
Calling interface:
This section also describes how to control the severity of diagnostic messages.
Diagnostic Messages
Option Description
-w0,-w Displays error messages only. Both -w0 and -w display exactly the same messages.
-w1,-w2 Displays warnings and error messages. Both -w1 and -w2 display exactly the same
messages.The compiler uses this level as the default.
Language Diagnostics
These messages describe diagnostics that are reported during the processing of the source file. These
diagnostics have the following format:
filename Indicates the name of the source file currently being processed.
linenum Indicates the source line where the compiler detects the condition.
type Indicates the severity of the diagnostic message: warning, remark, error, or
catastrophic error.
[#nn] The number assigned to the error (or warning ) message. Hard errors or catastrophes
are not assigned a number.
message Describes the diagnostic.
The compiler can also display internal error messages on the standard error. If your compilation
produces any internal errors, contact your Intel representative. Internal error messages are in the
following form:
1. /*ARGSUSED*/
2. /*NOTREACHED*/
3. /*VARARGS*/
Like the lint program, the compiler suppresses warnings about certain conditions when you place
these comments at specific points in the source.
Option Description
-w0,-w Displays error messages only. Both -w0 and -w display exactly the same messages.
-w1,- Displays warnings and error messages. Both -w1 and -w2 display exactly the same
w2 messages.The compiler uses this level as the default.
For some compilations, you might not want warnings for known and benign characteristics, such as the
K&R C constructs in your code. For example, the following command compiles newprog.c and
displays compiler errors, but not warnings:
Option Description
-wnn Limit the number of error diagnostics
that will be displayed prior to aborting
compilation to n . Remarks and
warnings do not count towards this
limit.
For example, the following command line specifies that if more than 50 error messages are displayed
during the compilation of a.c, compilation aborts.
Remark Messages
These messages report common, but sometimes unconventional, use of C or C++. The compiler does
not print or display remarks unless you specify level 4 for the -W option, as described in Suppressing
Warning Messages or Enabling Remarks. Remarks do not stop translation or linking. Remarks do not
interfere with any output files. The following are some representative remark messages:
gcc Compatibility
C language object files created with the Intel® C++ Compiler are binary compatible with the GNU* gcc
compiler and glibc, the GNU C language library. C language object files can be linked with either the
Intel compiler or the gcc compiler. However, to correctly pass the Intel libraries to the linker, use the
Intel compiler. See Linking and Default Libraries for more information.
GNU C includes several, non-standard features not found in ISO standard C. Many of these
extensions to the C language are supported in this version of the Intel C++ Compiler. See the GNU
Web site at http://www.gnu.org for more information.
Nested No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Nested-Functions.html#Nested%20
Functions
Constructing No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Constructing-Calls.html#Constructin
Function
Calls
Slightly No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Escaped-Newlines.html#Escaped%
Looser Rules 20Newlines
for Escaped
Newlines
Arithmetic on No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Pointer-Arith.html#Pointer%20Arith
Function-
Pointers
Declaring No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Function-Attributes.html#Function%
Attributes of 20Attributes
Functions
Attribute No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Attribute-Syntax.html#Attribute%20
Syntax
Prototypes No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Function-Prototypes.html#Function
and Old-Style 20Prototypes
Function
Definitions
Specifying No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Variable-Attributes.html#Variable%
Attributes of
Variables
Specifying No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Type-Attributes.html#Type%20Attri
Attributes of
Types
Alternate No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Alternate-Keywords.html#Alternate%
Keywords 20Keywords
Incomplete Yes http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Incomplete-Enums.html#Incomplete
enum Types 20Enums
Function Yes http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Function-Names.html#Function%20
Names as
Strings
Built-in No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Target-Builtins.html#Target%20Bui
Functions
Specific to
Particular
Target
Machines
Pragmas No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Pragmas.html#Pragmas
Accepted by
GCC
Unnamed No http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Unnamed-Fields.html#Unnamed%2
struct/union
fields within
structs/unions
Compiler Limits
The table below shows the size or number of each item that the compiler can process. All capacities
shown in the table are tested values; the actual number can be greater than the number shown.
Types of Intrinsics
The Intel® Pentium® 4 processor and other Intel processors have instructions to enable development
of optimized multimedia applications. The instructions are implemented through extensions to
previously implemented instructions. This technology uses the single instruction, multiple data (SIMD)
technique. By processing data elements in parallel, applications with media-rich bit streams are able to
significantly improve performance using SIMD instructions. The Intel® Itanium® processor also
supports these instructions.
The most direct way to use these instructions is to inline the assembly language instructions into your
source code. However, this can be time-consuming and tedious, and assembly language inline
programming is not supported on all compilers. Instead, Intel provides easy implementation through
the use of API extension sets referred to as intrinsics.
Intrinsics are special coding extensions that allow using the syntax of C function calls and C variables
instead of hardware registers. Using these intrinsics frees programmers from having to program in
assembly language and manage registers. In addition, the compiler optimizes the instruction
scheduling so that executables run faster.
In addition, the native intrinsics for the Itanium processor give programmers access to Itanium
instructions that cannot be generated using the standard constructs of the C and C++ lanugages. The
Intel® C++ Compiler also supports general purpose intrinsics that work across all IA-32 and Itanium-
based platforms.
Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual, Intel
Corporation, doc. number 243191.
Itanium X X N/A X
Processor
Pentium 4 X X X N/A
Processor
The MMX technology and Streaming SIMD Extension instructions use the following new features:
z New Registers--Enable packed data of up to 128 bits in length for optimal SIMD processing.
z New Data Types--Enable packing of up to 16 elements of data in one register.
The Streaming SIMD Extensions 2 intrinsics are defined only for IA-32, not for Itanium®-based
systems. Streaming SIMD Extensions 2 operate on 128 bit quantities–2 64-bit double precision floating
point values. The Itanium architecture does not support parallel double precision computation, so
Streaming SIMD Extensions 2 are not implemented on Itanium-based systems.
New Registers
A key feature provided by the architecture of the processors are new register sets. The MMX
instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the floating-point stack
registers.
The Streaming SIMD Extensions use eight 128-bit registers (xmm0 to xmm7).
These new data registers enable the processing of data elements in parallel. Because each register
can hold more than one data element, the processor can process more than one data element
simultaneously. This processing capability is also known as single-instruction multiple data processing
(SIMD).
For each computational and data manipulation instruction in the new extension sets, there is a
corresponding C intrinsic that implements that instruction directly. This frees you from managing
registers and assembly programming. Further, the compiler optimizes the instruction scheduling so
that your executable runs faster.
Note
The MM and XMM registers are the SIMD registers used by the IA-32 platforms to implement MMX
technology and Streaming SIMD Extensions/Streaming SIMD Extensions 2 intrinsics. On the Itanium-
based platforms, the MMX and Streaming SIMD Extension intrinsics use the 64-bit general registers
and the 64-bit significand of the 80-bit floating-point register.
Intrinsic functions use four new C data types as operands, representing the new registers that are used
as the operands to these intrinsic functions. The table below shows the new data type availability
marked with "X".
The __m64 data type is used to represent the contents of an MMX register, which is the register
that is used by the MMX technology intrinsics. The __m64 data type can hold eight 8-bit values,
four 16-bit values, two 32-bit values, or one 64-bit value.
The __m128 data type is used to represent the contents of a Streaming SIMD Extension
register used by the Streaming SIMD Extension intrinsics. The __m128 data type can hold four
32-bit floating values.
The __m128d data type can hold two 64-bit floating-point values.
The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit integer
values.
The compiler aligns __m128 local and global data to 16-byte boundaries on the stack. To align
integer, float, or double arrays, you can use the declspec statement.
Since these new data types are not basic ANSI C data types, you must observe the following usage
restrictions:
z Use new data types only on either side of an assignment, as a return value, or as a parameter.
You cannot use it with other arithmetic expressions ("+", "-", and so on).
z Use new data types as objects in aggregates, such as unions to access the byte elements and
structures.
z Use new data types only with the respective intrinsics described in this documentation. The new
data types are supported on both sides of an assignment statement: as parameters to a
function call, and as a return value from a function call.
_mm_<intrin_op>_<suffix>
A number appended to a variable name indicates the element of a packed object. For example, r0 is
the lowest word of r. Some intrinsics are "composites" because they require more than one instruction
to implement them.
The packed values are represented in right-to-left order, with the lowest value being used for scalar
operations. Consider the following example operation:
__m128d t = _mm_load_pd(a);
In other words, the xmm register that holds the value t will look as follows:
The "scalar" element is 1.0. Due to the nature of the instruction, some
intrinsics require their arguments to be immediates (constant integer literals).
Intrinsic Syntax
To use an intrinsic in your code, insert a line with the following syntax:
Where,
Passing a constant shift value in the rotate intrinsics results in higher performance.
Intrinsic Description
int abs(int) Returns the absolute value
of an integer.
long labs(long) Returns the absolute value
of a long integer.
unsigned long _lrotl(unsigned long value, int shift) Rotates bits left for an
unsigned long integer.
unsigned long _lrotr(unsigned long value, int shift) Rotates bits right for an
unsigned long integer.
unsigned int __rotl(unsigned int value, int shift) Rotates bits left for an
unsigned integer.
unsigned int __rotr(unsigned int value, int shift) Rotates bits right for an
unsigned integer.
Floating-point Related
Intrinsic Description
double fabs(double) Returns the absolute value of a floating-point value.
double log(double) Returns the natural logarithm ln(x), x>0, with double
precision.
float logf(float) Returns the natural logarithm ln(x), x>0, with single
precision.
double log10(double) Returns the base 10 logarithm log10(x), x>0, with double
precision.
float log10f(float) Returns the base 10 logarithm log10(x), x>0, with single
precision.
double exp(double) Returns the exponential function with double precision.
float expf(float) Returns the exponential function with single precision.
double pow(double, double) Returns the value of x to the power y with double precision.
float powf(float, float) Returns the value of x to the power y with single precision.
double sin(double) Returns the sine of x with double precision.
float sinf(float) Returns the sine of x with single precision.
double cos(double) Returns the cosine of x with double precision.
float cosf(float) Returns the cosine of x with single precision.
double tan(double) Returns the tangent of x with double precision.
float tanf(float) Returns the tangent of x with single precision.
double acos(double) Returns the arccosine of x with double precision
float acosf(float) Returns the arccosine of x with single precision
double acosh(double) Compute the inverse hyperbolic cosine of the argument with
double precision.
float acoshf(float) Compute the inverse hyperbolic cosine of the argument with
single precision.
double asin(double) Compute arc sine of the argument with double precision.
float asinf(float) Compute arc sine of the argument with single precision.
double asinh(double) Compute inverse hyperbolic sine of the argument with
double precision.
float asinhf(float) Compute inverse hyperbolic sine of the argument with
single precision.
double atan(double) Compute arc tangent of the argument with double precision.
float atanf(float) Compute arc tangent of the argument with single precision.
double atanh(double) Compute inverse hyperbolic tangent of the argument with
double precision.
float atanhf(float) Compute inverse hyperbolic tangent of the argument with
single precision.
float cabs(double)** Computes absolute value of complex number.
double ceil(double) Computes smallest integral value of double precision
argument not less than the argument.
float ceilf(float) Computes smallest integral value of single precision
argument not less than the argument.
double cosh(double) Computes the hyperbolic cosine of double precison
argument.
float coshf(float) Computes the hyperbolic cosine of single precison
argument.
float fabsf(float) Computes absolute value of single precision argument.
double floor(double) Computes the largest integral value of the double precision
argument not greater than the argument.
float floorf(float) Computes the largest integral value of the single precision
argument not greater than the argument.
double fmod(double) Computes the floating-point remainder of the division of the
first argument by the second argument with double
precison.
float fmodf(float) Computes the floating-point remainder of the division of the
first argument by the second argument with single precison.
double hypot(double, double) Computes the length of the hypotenuse of a right angled
triangle with double precision.
float hypotf(float) Computes the length of the hypotenuse of a right angled
triangle with single precision.
double rint(double) Computes the integral value represented as double using
the IEEE rounding mode.
float rintf(float) Computes the integral value represented with single
precision using the IEEE rounding mode.
double sinh(double) Computes the hyperbolic sine of the double precision
argument.
float sinhf(float) Computes the hyperbolic sine of the single precision
argument.
float sqrtf(float) Computes the square root of the single precision argument.
double tanh(double) Computes the hyperbolic tangent of the double precision
argument.
** double in this case is a complex number made up of two single precision (32-bit floating point)
elements (real and imaginary parts).
Intrinsic Description
char *_strset(char *, _int32) Sets all characters in a string to a fixed value.
void *memcmp(const void *cs, const Compares two regions of memory. Return <0 if
void *ct, size_t n) cs<ct, 0 if cs=ct, or >0 if cs>ct.
void *memcpy(void *s, const void Copies from memory. Returns s.
*ct, size_t n)
void *memset(void * s, int c, Sets memory to a fixed value. Returns s.
size_t n)
char *strcat(char * s, const char Appends to a string. Returns s.
* ct)
int *strcmp(const char *, const Compares two strings. Return <0 if cs<ct, 0 if
char *) cs=ct, or >0 if cs>ct.
char *strcpy(char * s, const char Copies a string. Returns s.
* ct)
size_t strlen(const char * cs) Returns the length of string cs.
int strncmp(char *, char *, int) Compare two strings, but only specified number of
characters.
int strncpy(char *, char *, int) Copies a string, but only specified number of
characters.
Miscellaneous Intrinsics
Note
Except for _enable() and _disable(), these functions have not been implemented for Itanium®
instructions.
Intrinsic Description
void *_alloca(int) Allocates the buffers.
int _setjmp(jmp_buf)* A fast version of setjmp(), which bypasses the termination
handling. Saves the callee-save registers, stack pointer and
return address.
_exception_code(void) Returns the exception code.
_exception_info(void) Returns the exception information.
_abnormal_termination(void) Can be invoked only by termination handlers. Returns TRUE if
the termination handler is invoked as a result of a premature
exit of the corresponding try-finally region.
void _enable() Enables the interrupt.
void _disable() Disables the interrupt.
int _bswap(int) Intrinsic that maps to the IA-32 instruction BSWAP (swap
bytes). Convert little/big endian 32-bit argument to big/little
endian form
int _in_byte(int) Intrinsic that maps to the IA-32 instruction IN. Transfer data
byte from port specified by argument.
int _in_dword(int) Intrinsic that maps to the IA-32 instruction IN. Transfer
double word from port specified by argument.
int _in_word(int) Intrinsic that maps to the IA-32 instruction IN. Transfer word
from port specified by argument.
int _inp(int) Same as _in_byte
int _inpd(int) Same as _in_dword
int _inpw(int) Same as _in_word
int _out_byte(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer data
byte in second argument to port specified by first argument.
int _out_dword(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer
double word in second argument to port specified by first
argument.
int _out_word(int, int) Intrinsic that maps to the IA-32 instruction OUT. Transfer word
in second argument to port specified by first argument.
int _outp(int, int) Same as _out_byte
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Caution
Failure to empty the multimedia state after using an MMX instruction and before using a floating-point
instruction can result in unexpected execution or poor performance.
z Do not use on Itanium®-based systems. There are no special registers (or overlay) for the MMX
(TM) instructions or Streaming SIMD Extensions on Itanium-based systems even though the
intrinsics are supported.
z Use _mm_empty( ) after an MMX instruction if the next instruction is a floating-point (FP)
instruction–for example, before calculations on float, double or long double. You must be
aware of all situations when your code generates an MMX instruction with the Intel® C++
Compiler, i.e.:
z when using an MMX technology intrinsic
z when using Streaming SIMD Extension integer intrinsics that use the __m64 data type
z when referencing an __m64 data type variable
z when using an MMX instruction through inline assembly
z Do not use _mm_empty() before an MMX instruction, since using _mm_empty() before an
MMX instruction incurs an operation with no benefit (no-op).
z Use different functions for operations that use FP instructions and those that use MMX
instructions. This eliminates the need to empty the multimedia state within the body of a critical
loop.
z Use _mm_empty() during runtime initialization of __m64 and FP data types. This ensures
resetting the register between data type transitions.
z See the "Correct Usage" coding example below.
void _m_empty(void)
__m64 _m_from_int(int i)
Convert the integer object i to a 64-bit __m64 object. The integer value is zero-extended to 64
bits.
int _m_to_int(__m64 m)
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with signed
saturation, and pack the four 16-bit values from m2 into the upper four 8-bit values of the result
with signed saturation.
Pack the two 32-bit values from m1 into the lower two 16-bit values of the result with signed
saturation, and pack the two 32-bit values from m2 into the upper two 16-bit values of the result
with signed saturation.
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with unsigned
saturation, and pack the four 16-bit values from m2 into the upper four 8-bit values of the result
with unsigned saturation.
Interleave the four 8-bit values from the high half of m1 with the four values from the high half of
m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the high half of m1 with the two values from the high half of
m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the high half of m1 with the 32-bit value from the high half of m2.
The interleaving begins with the data from m1.
Interleave the four 8-bit values from the low half of m1 with the four values from the low half of
m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the low half of m1 with the two values from the low half of
m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the low half of m1 with the 32-bit value from the low half of m2.
The interleaving begins with the data from m1.
Add the eight 8-bit values in m1 to the eight 8-bit values in m2.
Add the four 16-bit values in m1 to the four 16-bit values in m2.
Add the two 32-bit values in m1 to the two 32-bit values in m2.
Add the eight signed 8-bit values in m1 to the eight signed 8-bit values in m2 using saturating
arithmetic.
Add the four signed 16-bit values in m1 to the four signed 16-bit values in m2 using saturating
arithmetic.
Add the eight unsigned 8-bit values in m1 to the eight unsigned 8-bit values in m2 and using
saturating arithmetic.
Add the four unsigned 16-bit values in m1 to the four unsigned 16-bit values in m2 using
saturating arithmetic.
Subtract the eight 8-bit values in m2 from the eight 8-bit values in m1.
Subtract the four 16-bit values in m2 from the four 16-bit values in m1.
Subtract the two 32-bit values in m2 from the two 32-bit values in m1.
Subtract the eight signed 8-bit values in m2 from the eight signed 8-bit values in m1 using
saturating arithmetic.
Subtract the four signed 16-bit values in m2 from the four signed 16-bit values in m1 using
saturating arithmetic.
Subtract the eight unsigned 8-bit values in m2 from the eight unsigned 8-bit values in m1 using
saturating arithmetic.
Subtract the four unsigned 16-bit values in m2 from the four unsigned 16-bit values in m1 using
saturating arithmetic.
Multiply four 16-bit values in m1 by four 16-bit values in m2 producing four 32-bit intermediate
results, which are then summed by pairs to produce two 32-bit results.
Multiply four signed 16-bit values in m1 by four signed 16-bit values in m2 and produce the high
16 bits of the four results.
Multiply four 16-bit values in m1 by four 16-bit values in m2 and produce the low 16 bits of the
four results.
Shift four 16-bit values in m left the amount specified by count while shifting in zeros.
Shift four 16-bit values in m left the amount specified by count while shifting in zeros. For the
best performance, count should be a constant.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros. For the
Shift the 64-bit value in m left the amount specified by count while shifting in zeros.
Shift the 64-bit value in m left the amount specified by count while shifting in zeros. For the best
performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in the sign bit.
Shift four 16-bit values in m right the amount specified by count while shifting in the sign bit. For
the best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in the sign bit.
Shift two 32-bit values in m right the amount specified by count while shifting in the sign bit. For
the best performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in zeros.
Shift four 16-bit values in m right the amount specified by count while shifting in zeros. For the
best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in zeros.
Shift two 32-bit values in m right the amount specified by count while shifting in zeros. For the
best performance, count should be a constant.
Shift the 64-bit value in m right the amount specified by count while shifting in zeros.
Shift the 64-bit value in m right the amount specified by count while shifting in zeros. For the
best performance, count should be a constant.
Perform a bitwise AND of the 64-bit value in m1 with the 64-bit value in m2.
Perform a logical NOT on the 64-bit value in m1 and use the result in a bitwise AND with the 64-
bit value in m2.
Perform a bitwise OR of the 64-bit value in m1 with the 64-bit value in m2.
Perform a bitwise XOR of the 64-bit value in m1 with the 64-bit value in m2.
If the respective 8-bit values in m1 are equal to the respective 8-bit values in m2 set the
respective 8-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 16-bit values in m1 are equal to the respective 16-bit values in m2 set the
respective 16-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 32-bit values in m1 are equal to the respective 32-bit values in m2 set the
respective 32-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 8-bit values in m1 are greater than the respective 8-bit values in m2 set the
respective 8-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 16-bit values in m1 are greater than the respective 16-bit values in m2 set the
respective 16-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 32-bit values in m1 are greater than the respective 32-bit values in m2 set the
respective 32-bit resulting values to all ones, otherwise set them all to zeros.
Note
In the following descriptions regarding the bits of the MMX(TM) register, bit 0 is the least significant
and bit 63 is the most significant.
__m64 _mm_setzero_si64()
PXOR
Sets the 64-bit value to zero.
r := 0x0
__m64 _mm_set_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
__m64 _mm_set1_pi32(int i)
__m64 _mm_set1_pi16(short s)
__m64 _mm_set1_pi8(char b)
__m64 _mm_setr_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
Some intrinsics have more than one name. When one intrinsic has two names, both names generate
the same instructions, but the first is preferred as it conforms to a newer naming standard.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Data Types
The C data type __m64 is used when using MMX technology intrinsics. It can hold eight 8-bit values,
four 16-bit values, two 32-bit values, or one 64-bit value.
The __m64 data type is not a basic ANSI C data type. Therefore, observe the following usage
restrictions:
z Use the new data type only on the left-hand side of an assignment, as a return value, or as a
parameter. You cannot use it with other arithmetic expressions (" + ", " - ", and so on).
z Use the new data type as objects in aggregates, such as unions, to access the byte elements
and structures; the address of an __m64 object may be taken.
z Use new data types only with the respective intrinsics described in this documentation.
For complete details of the hardware instructions, see the Intel Architecture MMX Technology
Programmer's Reference Manual. For descriptions of data types, see the Intel Architecture Software
Developer's Manual, Volume 2.
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
z Certain intrinsics, such as _mm_loadr_ps and _mm_cmpgt_ss, are not directly supported by
the instruction set. While these intrinsics are convenient programming aids, be mindful that they
may consist of more than one machine-language instruction.
z Floating-point data loaded or stored as __m128 objects must be generally 16-byte-aligned.
z Some intrinsics require that their argument be immediates, that is, constant integers (literals),
due to the nature of the instruction.
z The result of arithmetic operations acting on two NaN (Not a Number) arguments is undefined.
Therefore, FP operations using NaN arguments will not match the expected behavior of the
corresponding assembly instructions.
r0 := a0 + b0
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := a0 + b0
r1 := a1 + b1
r2 := a2 + b2
r3 := a3 + b3
Subtracts the lower SP FP values of a and b. The upper 3 SP FP values are passed through
from a.
r0 := a0 - b0
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := a0 - b0
r1 := a1 - b1
r2 := a2 - b2
r3 := a3 - b3
Multiplies the lower SP FP values of a and b ; the upper 3 SP FP values are passed through
from a.
r0 := a0 * b0
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := a0 * b0
r1 := a1 * b1
r2 := a2 * b2
r3 := a3 * b3
Divides the lower SP FP values of a and b ; the upper 3 SP FP values are passed through from
a.
r0 := a0 / b0
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := a0 / b0
r1 := a1 / b1
r2 := a2 / b2
r3 := a3 / b3
__m128 _mm_sqrt_ss(__m128 a)
Computes the square root of the lower SP FP value of a ; the upper 3 SP FP values are passed
through.
r0 := sqrt(a0)
r1 := a1 ; r2 := a2 ; r3 := a3
__m128 _mm_sqrt_ps(__m128 a)
r0 := sqrt(a0)
r1 := sqrt(a1)
r2 := sqrt(a2)
r3 := sqrt(a3)
__m128 _mm_rcp_ss(__m128 a)
Computes the approximation of the reciprocal of the lower SP FP value of a; the upper 3 SP FP
values are passed through.
r0 := recip(a0)
r1 := a1 ; r2 := a2 ; r3 := a3
__m128 _mm_rcp_ps(__m128 a)
r0 := recip(a0)
r1 := recip(a1)
r2 := recip(a2)
r3 := recip(a3)
__m128 _mm_rsqrt_ss(__m128 a)
Computes the approximation of the reciprocal of the square root of the lower SP FP value of a;
the upper 3 SP FP values are passed through.
r0 := recip(sqrt(a0))
r1 := a1 ; r2 := a2 ; r3 := a3
__m128 _mm_rsqrt_ps(__m128 a)
Computes the approximations of the reciprocals of the square roots of the four SP FP values of
a.
r0 := recip(sqrt(a0))
r1 := recip(sqrt(a1))
r2 := recip(sqrt(a2))
r3 := recip(sqrt(a3))
Computes the minimum of the lower SP FP values of a and b; the upper 3 SP FP values are
passed through from a.
r0 := min(a0, b0)
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := min(a0, b0)
r1 := min(a1, b1)
r2 := min(a2, b2)
r3 := min(a3, b3)
Computes the maximum of the lower SP FP values of a and b ; the upper 3 SP FP values are
passed through from a.
r0 := max(a0, b0)
r1 := a1 ; r2 := a2 ; r3 := a3
r0 := max(a0, b0)
r1 := max(a1, b1)
r2 := max(a2, b2)
r3 := max(a3, b3)
r0 := a0 & b0
r1 := a1 & b1
r2 := a2 & b2
r3 := a3 & b3
r0 := ~a0 & b0
r1 := ~a1 & b1
r2 := ~a2 & b2
r3 := ~a3 & b3
r0 := a0 | b0
r1 := a1 | b1
r2 := a2 | b2
r3 := a3 | b3
r0 := a0 ^ b0
r1 := a1 ^ b1
r2 := a2 ^ b2
r3 := a3 ^ b3
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
r1 := a1 ; r2 := a2 ; r3 := a3
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned.
Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned.
Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or
equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a greater than b. If a is greater than b are
equal, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than
or equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is
returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is returned.
Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is returned.
Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less than or
equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a greater than b. If a is greater than or equal to
b, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a greater than or equal to b. If a is greater than
or equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not equal, 1 is
returned. Otherwise 0 is returned.
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
int _mm_cvt_ss2si(__m128 a)
Convert the lower SP FP value of a to a 32-bit integer according to the current rounding mode.
r := (int)a0
__m64 _mm_cvt_ps2pi(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integers according to the current rounding
mode, returning the integers in packed form.
r0 := (int)a0
r1 := (int)a1
int _mm_cvtt_ss2si(__m128 a)
r := (int)a0
__m64 _mm_cvtt_ps2pi(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integer with truncation, returning the
integers in packed form.
r0 := (int)a0
r1 := (int)a1
Convert the 32-bit integer value b to an SP FP value; the upper three SP FP values are passed
through from a.
r0 := (float)b
r1 := a1 ; r2 := a2 ; r3 := a3
Convert the two 32-bit integer values in packed form in b to two SP FP values; the upper two
SP FP values are passed through from a.
r0 := (float)b0
r1 := (float)b1
r2 := a2
r3 := a3
Convert the four 16-bit signed integer values in a to four single precision FP values.
r0 := (float)a0
r1 := (float)a1
r2 := (float)a2
r3 := (float)a3
Convert the four 16-bit unsigned integer values in a to four single precision FP values.
r0 := (float)a0
r1 := (float)a1
r2 := (float)a2
r3 := (float)a3
Convert the lower four 8-bit signed integer values in a to four single precision FP values.
r0 := (float)a0
r1 := (float)a1
r2 := (float)a2
r3 := (float)a3
Convert the lower four 8-bit unsigned integer values in a to four single precision FP values.
r0 := (float)a0
r1 := (float)a1
r2 := (float)a2
r3 := (float)a3
Convert the two 32-bit signed integer values in a and the two 32-bit signed integer values in b to
four single precision FP values.
r0 := (float)a0
r1 := (float)a1
r2 := (float)b0
r3 := (float)b1
Convert the four single precision FP values in a to four signed 16-bit integer values.
r0 := (short)a0
r1 := (short)a1
r2 := (short)a2
r3 := (short)a3
Convert the four single precision FP values in a to the lower four signed 8-bit integer values of
the result.
r0 := (char)a0
r1 := (char)a1
r2 := (char)a2
r3 := (char)a3
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
__m128 _mm_load_ss(float * p )
Loads an SP FP value into the low word and clears the upper three words.
r0 := *p
__m128 _mm_load_ps1(float * p )
r0 := *p
r1 := *p
r2 := *p
r3 := *p
__m128 _mm_load_ps(float * p )
r0 := p[0]
r1 := p[1]
r2 := p[2]
r3 := p[3]
__m128 _mm_loadu_ps(float * p)
r0 := p[0]
r1 := p[1]
r2 := p[2]
r3 := p[3]
__m128 _mm_loadr_ps(float * p)
r0 := p[3]
r1 := p[2]
r2 := p[1]
r3 := p[0]
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
__m128 _mm_set_ss(float w )
Sets the low word of an SP FP value to w and clears the upper three words.
r0 := w
r1 := r2 := r3 := 0.0
__m128 _mm_set_ps1(float w )
r0 := r1 := r2 := r3 := w
r0 := w
r1 := x
r2 := y
r3 := z
r0 := z
r1 := y
r2 := x
r3 := w
__m128 _mm_setzero_ps(void)
r0 := r1 := r2 := r3 := 0.0
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
*p := a0
p[0] := a0
p[1] := a0
p[2] := a0
p[3] := a0
p[0] := a0
p[1] := a1
p[2] := a2
p[3] := a3
p[0] := a0
p[1] := a1
p[2] := a2
p[3] := a3
p[0] := a3
p[1] := a2
p[2] := a1
p[3] := a0
Sets the low word to the SP FP value of b. The upper 3 SP FP values are passed through from
a.
r0 := b0
r1 := a1
r2 := a2
r3 := a3
void _mm_pause(void)
The execution of the next instruction is delayed an implementation specific amount of time. The
instruction does not modify the architectural state. This intrinsic provides especially significant
performance gain and described in more detail below.
PAUSE Intrinsic
The PAUSE intrinsic is used in spin-wait loops with the processors implementing dynamic execution
(especially out-of-order execution). In the spin-wait loop, PAUSE improves the speed at which the code
detects the release of the lock. For dynamic scheduling, the PAUSE instruction reduces the penalty of
exiting from the spin-loop.
spin_loop:pause
cmp eax, A
jne spin_loop
In the above example, the program spins until memory location A matches the value in register eax.
The code sequence that follows shows a test-and-test-and-set. In this example, the spin occurs only
after the attempt to get a lock has failed.
jne spin_loop
Critical Section:
<critical_section code>
jmp continue
jne spin_loop
jmp get_lock
Note that the first branch is predicted to fall-through to the critical section in anticipation of successfully
gaining access to the lock. It is highly recommended that all spin-wait loops include the PAUSE
instruction. Since PAUSE is backwards compatible to all existing IA-32 processor generations, a test for
processor type (a CPUID test) is not needed. All legacy processors will execute PAUSE as a NOP, but
in processors which use the PAUSE as a hint there can be significant performance benefit.
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
For these intrinsics you need to empty the multimedia state for the mmx register. See The EMMS
Instruction: Why You Need It and When to Use It topic for more details.
r0 := (n==0) ? d : a0;
r1 := (n==1) ? d : a1;
r2 := (n==2) ? d : a2;
r3 := (n==3) ? d : a3;
r0 := min(a0, b0)
r1 := min(a1, b1)
r2 := min(a2, b2)
r3 := min(a3, b3)
r0 := min(a0, b0)
r1 := min(a1, b1)
...
r7 := min(a7, b7)
r0 := min(a0, b0)
r1 := min(a1, b1)
r2 := min(a2, b2)
r3 := min(a3, b3)
r0 := min(a0, b0)
r1 := min(a1, b1)
...
r7 := min(a7, b7)
int _m_pmovmskb(__m64 a)
Creates an 8-bit mask from the most significant bits of the bytes in a.
Multiplies the unsigned words in a and b, returning the upper 16 bits of the 32-bit intermediate
results.
r0 := hiword(a0 * b0)
r1 := hiword(a1 * b1)
r2 := hiword(a2 * b2)
r3 := hiword(a3 * b3)
r0 := word (n&0x3) of a
r1 := word ((n>>2)&0x3) of a
r2 := word ((n>>4)&0x3) of a
r3 := word ((n>>6)&0x3) of a
Conditionally store byte elements of d to address p. The high bit of each byte in the selector n
determines whether the corresponding byte in d will be stored.
if (sign(n0)) p[0] := d0
if (sign(n1)) p[1] := d1
...
if (sign(n7)) p[7] := d7
Computes the sum of the absolute differences of the unsigned bytes in a and b, returning he
value in the lower word. The upper three words are cleared.
The intrinsics are listed in the following table. Syntax and a brief description are contained the following
topics.
The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin.h header file.
_mm_stream_pi
_mm_stream_ps
_mm_sfence
Loads an SP FP value into the low word and clears the upper three words.
r0 := *a
r1 := 0.0 ; r2 := 0.0 ; r3 := 0.0
r0 := *a
r1 := *a
r2 := *a
r3 := *a
r0 := a[0]
r1 := a[1]
r2 := a[2]
r3 := a[3]
r0 := a[0]
r1 := a[1]
r2 := a[2]
r3 := a[3]
r0 := a[3]
r1 := a[2]
r2 := a[1]
r3 := a[0]
__m128 _mm_set_ss(float a)
Sets the low word of an SP FP value to a and clears the upper three words.
r0 := c
r1 := r2 := r3 := 0.0
__m128 _mm_set_ps1(float a)
r0 := r1 := r2 := r3 := a
r0 := a
r1 := b
r2 := c
r3 := d
r0 := d
r1 := c
r2 := b
r3 := a
__m128 _mm_setzero_ps(void)
r0 := r1 := r2 := r3 := 0.0
*v := a0
v[0] := a0
v[1] := a0
v[2] := a0
v[3] := a0
v[0] := a0
v[1] := a1
v[2] := a2
v[3] := a3
v[0] := a0
v[1] := a1
v[2] := a2
v[3] := a3
v[0] := a3
v[1] := a2
v[2] := a1
v[3] := a0
Sets the low word to the SP FP value of b. The upper 3 SP FP values are passed through from
a.
r0 := b0
r1 := a1
r2 := a2
r3 := a3
(uses PREFETCH) Loads one cache line of data from address a to a location "closer" to the
processor. The value sel specifies the type of prefetch operation: the constants
_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used,
corresponding to the type of prefetch instruction.
(uses MOVNTQ) Stores the data in a to the address p without polluting the caches. This intrinsic
requires you to empty the multimedia state for the mmx register. See The EMMS Instruction:
(see MOVNTPS) Stores the data in a to the address p without polluting the caches. The address
must be 16-byte-aligned.
void _mm_sfence(void)
(uses SFENCE) Guarantees that every preceding store is globally visible before any subsequent
store.
Selects four specific SP FP values from a and b, based on the mask imm8. The mask must be
an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions for a
description of the shuffle semantics.
r0 := a2
r1 := b2
r2 := a3
r3 := b3
r0 := a0
r1 := b0
r2 := a1
r3 := b1
Sets the upper two SP FP values with 64 bits of data loaded from the address p.
r0 := a0
r1 := a1
r2 := *p0
r3 := *p1
*p0 := a2
*p1 := a3
Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result. The upper 2 SP
FP values of a are passed through to the result.
r3 := a3
r2 := a2
r1 := b3
r0 := b2
Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result. The lower 2 SP
FP values of a are passed through to the result.
r3 := b1
r2 := b0
r1 := a1
r0 := a0
Sets the lower two SP FP values with 64 bits of data loaded from the address p; the upper two
values are passed through from a.
r0 := *p0
r1 := *p1
r2 := a2
r3 := a3
*p0 := a0
*p1 := a1
int _mm_movemask_ps(__m128 a)
Creates a 4-bit mask from the most significant bits of the four SP FP values.
To write programs with the intrinsics, you should be familiar with the hardware features provided by the
Streaming SIMD Extensions. Keep the following four important issues in mind:
z Certain intrinsics are provided only for compatibility with previously-defined IA-32 intrinsics.
Using them on Itanium-based systems probably leads to performance degradation. See section
below.
z Floating-point (FP) data loaded stored as __m128 objects must be 16-byte-aligned.
z Some intrinsics require that their arguments be immediates– that is, constant integers (literals),
due to the nature of the instruction.
Data Types
The new data type __m128 is used with the Streaming SIMD Extensions intrinsics. It represents a 128-
bit quantity composed of four single-precision FP values. This corresponds to the 128-bit IA-32
Streaming SIMD Extensions register.
The compiler aligns __m128 local data to 16-byte boundaries on the stack. Global data of these types
is also 16 byte-aligned. To align integer, float, or double arrays, you can use the declspec
alignment.
Because Itanium instructions treat the Streaming SIMD Extensions registers in the same way whether
you are using packed or scalar data, there is no __m32 data type to represent scalar data. For scalar
operations, use the __m128 objects and the "scalar" forms of the intrinsics; the compiler and the
processor implement these operations with 32-bit memory references. But, for better performance the
packed form should be substituting for the scalar form whenever possible.
For more information, see Intel Architecture Software Developer's Manual, Volume 2: Instruction Set
Reference Manual, Intel Corporation, doc. number 243191.
Streaming SIMD Extensions intrinsics are defined for the __m128 data type, a 128-bit quantity
consisting of four single-precision FP values. SIMD instructions for Itanium-based systems operate on
64-bit FP register quantities containing two single-precision floating-point values. Thus, each __m128
operand is actually a pair of FP registers and therefore each intrinsic corresponds to at least one pair
of Itanium instructions operating on the pair of FP register operands.
Many of the Streaming SIMD Extensions intrinsics for Itanium-based systems were created for
compatibility with existing IA-32 intrinsics and not for performance. In some situations, intrinsic usage
that improved performance on IA-32 will not do so on Itanium-based systems. One reason for this is
that some intrinsics map nicely into the IA-32 instruction set but not into the Itanium instruction set.
Thus, it is important to differentiate between intrinsics which were implemented for a performance
advantage on Itanium-based systems, and those implemented simply to provide compatibility with
existing IA-32 code.
The following intrinsics are likely to reduce performance and should only be used to initially port legacy
code or in non-critical code sections:
z Any Streaming SIMD Extensions scalar intrinsic (_ss variety) - use packed (_ps) version if
possible
z comi and ucomi Streaming SIMD Extensions comparisons - these correspond to IA-32
COMISS and UCOMISS instructions only. A sequence of Itanium instructions are required to
implement these.
z Conversions in general are multi-instruction operations. These are particularly expensive:
_mm_cvtpi16_ps, _mm_cvtpu16_ps, _mm_cvtpi8_ps, _mm_cvtpu8_ps,
_mm_cvtpi32x2_ps, _mm_cvtps_pi16, _mm_cvtps_pi8
z Streaming SIMD Extensions utility intrinsic _mm_movemask_ps
If the inaccuracy is acceptable, the SIMD reciprocal and reciprocal square root approximation intrinsics
(rcp and rsqrt) are much faster than the true div and sqrt intrinsics.
You can view the four integers as selectors for choosing which two words from the first input operand
and which two words from the second are to be put into the result word.
_MM_MASK_INEXACT
The following example masks the overflow and underflow exceptions and unmasks all other
exceptions.
Write to and read from bits thirteen and fourteen of the control
register.
_MM_ROUND_TOWARD_ZERO
The following example tests the rounding mode for round toward zero.
if (_MM_GET_ROUNDING_MODE() == _MM_ROUND_TOWARD_ZERO) {
Macro Definition
The arguments row0, row1, row2, and row3 are __m128 values whose elements form the
corresponding rows of a 4 by 4 matrix. The matrix transposition is returned in arguments row0, row1,
row2, and row3 where row0 now holds column 0 of the original matrix, row1 now holds column 1 of
the original matrix, and so on.
The transposition function of this macro is illustrated in the "Matrix Transposition Using the
_MM_TRANSPOSE4_PS" figure.
z Floating-Point Intrinsics -- describes the arithmetic, logical, compare, conversion, memory, and
initialization intrinsics for the double-precision floating-point data type (__m128d).
z Integer Intrinsics -- describes the arithmetic, logical, compare, conversion, memory, and
initialization intrinsics for the extended-precision integer data type (__m128i).
Note
The Pentium 4 processor Streaming SIMD Extensions 2 intrinsics are defined only for IA-32 platforms,
not Itanium®-based platforms. Pentium 4 processor Streaming SIMD Extensions 2 operate on 128 bit
quantities–2 64-bit double precision floating point values. The Itanium processor does not support
parallel double precision computation, so Pentium 4 processor Streaming SIMD Extensions 2 are not
implemented on Itanium-based systems.
For more details, refer to the Pentium® 4 processor Streaming SIMD Extensions 2 External
Architecture Specification (EAS) and other Pentium 4 processor manuals available for download from
the developer.intel.com web site. You should be familiar with the hardware features provided by the
Streaming SIMD Extensions 2 when writing programs with the intrinsics. The following are three
important issues to keep in mind:
z Certain intrinsics, such as _mm_loadr_pd and _mm_cmpgt_sd, are not directly supported by
the instruction set. While these intrinsics are convenient programming aids, be mindful of their
implementation cost.
z Data loaded or stored as __m128d objects must be generally 16-byte-aligned.
z Some intrinsics require that their argument be immediates, that is, constant integers (literals),
due to the nature of the instruction.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
r0 := a0 + b0
r1 := a1
r0 := a0 + b0
r1 := a1 + b1
Subtracts the lower DP FP value of b from a. The upper DP FP value is passed through from a.
r0 := a0 - b0
r1 := a1
r0 := a0 - b0
r1 := a1 - b1
Multiplies the lower DP FP values of a and b. The upper DP FP is passed through from a.
r0 := a0 * b0
r1 := a1
r0 := a0 * b0
r1 := a1 * b1
Divides the lower DP FP values of a and b. The upper DP FP value is passed through from a.
r0 := a0 / b0
r1 := a1
r0 := a0 / b0
r1 := a1 / b1
Computes the square root of the lower DP FP value of b. The upper DP FP value is passed
through from a.
r0 := sqrt(b0)
r1 := a1
__m128d _mm_sqrt_pd(__m128d a)
r0 := sqrt(a0)
r1 := sqrt(a1)
Computes the minimum of the lower DP FP values of a and b. The upper DP FP value is
passed through from a.
r0 := min(a0, b0)
r1 := min(a1, b1)
Computes the maximum of the lower DP FP values of a and b. The upper DP FP value is
passed through from a.
r0 := max(a0, b0)
r1 := max(a1, b1)
(uses ANDPD) Computes the bitwise AND of the two DP FP values of a and b.
r0 := a0 & b0
r1 := a1 & b1
(uses ANDNPD) Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the
128-bit value in a.
r0 := (~a0) & b0
r1 := (~a1) & b1
r0 := a0 | b0
r1 := a1 | b1
(uses XORPD) Computes the bitwise XOR of the two DP FP values of a and b.
r0 := a0 ^ b0
r1 := a1 ^ b1
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
Compares the two DP FP values of a and b for a not less than or equal to b.
Compares the two DP FP values of a and b for a not greater than or equal to b.
Compares the lower DP FP value of a and b for equality. The upper DP FP value is passed
through from a.
Compares the lower DP FP value of a and b for a less than b. The upper DP FP value is
passed through from a.
Compares the lower DP FP value of a and b for a less than or equal to b. The upper DP FP
value is passed through from a.
Compares the lower DP FP value of a and b for a greater than b. The upper DP FP value is
passed through from a.
Compares the lower DP FP value of a and b for a greater than or equal to b. The upper DP FP
value is passed through from a.
Compares the lower DP FP value of a and b for ordered. The upper DP FP value is passed
through from a.
Compares the lower DP FP value of a and b for unordered. The upper DP FP value is passed
through from a.
Compares the lower DP FP value of a and b for inequality. The upper DP FP value is passed
through from a.
Compares the lower DP FP value of a and b for a not less than b. The upper DP FP value is
passed through from a.
Compares the lower DP FP value of a and b for a not less than or equal to b. The upper DP FP
value is passed through from a.
Compares the lower DP FP value of a and b for a not greater than b. The upper DP FP value is
passed through from a.
Compares the lower DP FP value of a and b for a not greater than or equal to b. The upper DP
FP value is passed through from a.
Compares the lower DP FP value of a and b for a equal to b. If a and b are equal, 1 is returned.
Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a less than b. If a is less than b, 1 is returned.
Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a less than or equal to b. If a is less than or
equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a greater than b. If a is greater than b are
equal, 1 is returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a greater than or equal to b. If a is greater than
or equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a not equal to b. If a and b are not equal, 1 is
returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a equal to b. If a and b are equal, 1 is returned.
Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a less than b. If a is less than b, 1 is returned.
Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a less than or equal to b. If a is less than or
Compares the lower DP FP value of a and b for a greater than b. If a is greater than b are
equal, 1 is returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a greater than or equal to b. If a is greater than
or equal to b, 1 is returned. Otherwise 0 is returned.
Compares the lower DP FP value of a and b for a not equal to b. If a and b are not equal, 1 is
returned. Otherwise 0 is returned.
The conversion-operation intrinsics for Streaming SIMD Extensions 2 are listed in the following table
followed by detailed descriptions.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
__m128 _mm_cvtpd_ps(__m128d a)
r0 := (float) a0
r1 := (float) a1
r2 := 0.0 ; r3 := 0.0
__m128d _mm_cvtps_pd(__m128 a)
r0 := (double) a0
r1 := (double) a1
__m128d _mm_cvtepi32_pd(__m128i a)
r0 := (double) a0
r1 := (double) a1
__m128i _mm_cvtpd_epi32(__m128d a)
r0 := (int) a0
r1 := (int) a1
r2 := 0x0 ; r3 := 0x0
int _mm_cvtsd_si32(__m128d a)
r := (int) a0
r0 := (float) b0
r1 := a1; r2 := a2 ; r3 := a3
Converts the signed integer value in b to a DP FP value. The upper DP FP value in a is passed
through.
r0 := (double) b
r1 := a1
r0 := (double) b0
r1 := a1
__m128i _mm_cvttpd_epi32(__m128d a)
r0 := (int) a0
r1 := (int) a1
r2 := 0x0 ; r3 := 0x0
int _mm_cvttsd_si32(__m128d a)
r := (int) a0
__m64 _mm_cvtpd_pi32(__m128d a)
r0 := (int) a0
r1 := (int) a1
__m64 _mm_cvttpd_pi32(__m128d a)
Converts the two DP FP values of a to 32-bit signed integer values using truncate.
r0 := (int) a0
r1 := (int) a1
__m128d _mm_cvtpi32_pd(__m64 a)
r0 := (double) a0
r1 := (double) a1
Note
There is no intrinsic for move operations. To move data from one register to another, a simple
assignment, A = B, suffices, where A and B are the source and target registers for the move
operation.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
(uses MOVAPD) Loads two DP FP values. The address p must be 16-byte aligned.
r0 := p[0]
r1 := p[1]
(uses MOVSD + shuffling) Loads a single DP FP value, copying to both elements. The address p
need not be 16-byte aligned.
r0 := *p
r1 := *p
(uses MOVAPD + shuffling) Loads two DP FP values in reverse order. The address p must be
16-byte aligned.
r0 := p[1]
r1 := p[0]
(uses MOVUPD) Loads two DP FP values. The address p need not be 16-byte aligned.
r0 := p[0]
r1 := p[1]
(uses MOVSD) Loads a DP FP value. The upper DP FP is set to zero. The address p need not
be 16-byte aligned.
r0 := *p
r1 := 0.0
(uses MOVHPD) Loads a DP FP value as the upper DP FP value of the result. The lower DP FP
value is passed through from a. The address p need not be 16-byte aligned.
r0 := a0
r1 := *p
(uses MOVLPD) Loads a DP FP value as the lower DP FP value of the result. The upper DP FP
value is passed through from a. The address p need not be 16-byte aligned.
r0 := *p
r1 := a1
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
__m128d _mm_set_sd(double w)
(composite) Sets the lower DP FP value to w and sets the upper DP FP value to zero.
r0 := w
r1 := 0.0
__m128d _mm_set1_pd(double w)
r0 := w
r1 := w
(composite) Sets the lower DP FP value to x and sets the upper DP FP value to w.
r0 := x
r1 := w
(composite) Sets the lower DP FP value to w and sets the upper DP FP value to x.
r0 := w
r1 := x
__m128d _mm_setzero_pd(void)
r0 := 0.0
r1 := 0.0
(uses MOVSD) Sets the lower DP FP value to the lower DP FP value of b. The upper DP FP
value is passed through from a.
r0 := b0
r1 := a1
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
(uses MOVSD) Stores the lower DP FP value of a. The address dp need not be 16-byte aligned.
*dp := a0
(uses MOVAPD + shuffling) Stores the lower DP FP value of a twice. The address dp must be
16-byte aligned.
dp[0] := a0
dp[1] := a0
(uses MOVAPD) Stores two DP FP values. The address dp must be 16-byte aligned.
dp[0] := a0
dp[1] := a1
(uses MOVUPD) Stores two DP FP values. The address dp need not be 16-byte aligned.
dp[0] := a0
dp[1] := a1
(uses MOVAPD + shuffling) Stores two DP FP values in reverse order. The address dp must be
16-byte aligned.
dp[0] := a1
dp[1] := a0
*dp := a1
*dp := a0
r0 := a1
r1 := b1
r0 := a0
1 := b0
int _mm_movemask_pd(__m128d a)
(uses MOVMSKPD) Creates a two-bit mask from the sign bits of the two DP FP values of a.
(uses SHUFPD) Selects two specific DP FP values from a and b, based on the mask i. The
mask must be an immediate. See Macro Function for Shuffle for a description of the shuffle
semantics.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
Adds the 16 signed or unsigned 8-bit integers in a to the 16 signed or unsigned 8-bit integers in
b.
r0 := a0 + b0
r1 := a1 + b1
...
r15 := a15 + b15
Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in
b.
r0 := a0 + b0
r1 := a1 + b1
...
r7 := a7 + b7
Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit integers in
b.
r0 := a0 + b0
r1 := a1 + b1
r2 := a2 + b2
r3 := a3 + b3
Adds the signed or unsigned 64-bit integer a to the signed or unsigned 64-bit integer b.
r := a + b
Adds the 2 signed or unsigned 64-bit integers in a to the 2 signed or unsigned 64-bit integers in
b.
r0 := a0 + b0
r1 := a1 + b1
Adds the 16 signed 8-bit integers in a to the 16 signed 8-bit integers in b using saturating
arithmetic.
r0 := SignedSaturate(a0 + b0)
r1 := SignedSaturate(a1 + b1)
...
r15 := SignedSaturate(a15 + b15)
Adds the 8 signed 16-bit integers in a to the 8 signed 16-bit integers in b using saturating
arithmetic.
r0 := SignedSaturate(a0 + b0)
r1 := SignedSaturate(a1 + b1)
...
r7 := SignedSaturate(a7 + b7)
Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using saturating
arithmetic.
r0 := UnsignedSaturate(a0 + b0)
r1 := UnsignedSaturate(a1 + b1)
...
r15 := UnsignedSaturate(a15 + b15)
Adds the 8 unsigned 16-bit integers in a to the 8 unsigned 16-bit integers in b using saturating
arithmetic.
r0 := UnsignedSaturate(a0 + b0)
r1 := UnsignedSaturate(a1 + b1)
...
r15 := UnsignedSaturate(a7 + b7)
Computes the average of the 16 unsigned 8-bit integers in a and the 16 unsigned 8-bit integers
in b and rounds.
r0 := (a0 + b0) / 2
r1 := (a1 + b1) / 2
...
r15 := (a15 + b15) / 2
Computes the average of the 8 unsigned 16-bit integers in a and the 8 unsigned 16-bit integers
in b and rounds.
r0 := (a0 + b0) / 2
r1 := (a1 + b1) / 2
...
r7 := (a7 + b7) / 2
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Adds the
signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.
Computes the pairwise maxima of the 8 signed 16-bit integers from a and the 8 signed 16-bit
integers from b.
r0 := max(a0, b0)
r1 := max(a1, b1)
...
r7 := max(a7, b7)
Computes the pairwise maxima of the 16 unsigned 8-bit integers from a and the 16 unsigned 8-
bit integers from b.
r0 := max(a0, b0)
r1 := max(a1, b1)
...
r15 := max(a15, b15)
Computes the pairwise minima of the 8 signed 16-bit integers from a and the 8 signed 16-bit
integers from b.
r0 := min(a0, b0)
r1 := min(a1, b1)
...
r7 := min(a7, b7)
Computes the pairwise minima of the 16 unsigned 8-bit integers from a and the 16 unsigned 8-
bit integers from b.
r0 := min(a0, b0)
r1 := min(a1, b1)
...
r15 := min(a15, b15)
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b. Packs the
upper 16-bits of the 8 signed 32-bit results.
r0 := (a0 * b0)[31:16]
r1 := (a1 * b1)[31:16]
...
r7 := (a7 * b7)[31:16]
Multiplies the 8 unsigned 16-bit integers from a by the 8 unsigned 16-bit integers from b. Packs
the upper 16-bits of the 8 unsigned 32-bit results.
r0 := (a0 * b0)[31:16]
r1 := (a1 * b1)[31:16]
...
r7 := (a7 * b7)[31:16]
__m128i_mm_mullo_epi16(__m128i a, __m128i b)
Multiplies the 8 signed or unsigned 16-bit integers from a by the 8 signed or unsigned 16-bit
integers from b. Packs the lower 16-bits of the 8 signed or unsigned 32-bit results.
r0 := (a0 * b0)[15:0]
r1 := (a1 * b1)[15:0]
...
r7 := (a7 * b7)[15:0]
Multiplies the lower 32-bit integer from a by the lower 32-bit integer from b, and returns the 64-
bit integer result.
r := a0 * b0
Multiplies 2 unsigned 32-bit integers from a by 2 unsigned 32-bit integers from b. Packs the 2
unsigned 64-bit integer results.
r0 := a0 * b0
r1 := a2 * b2
Computes the absolute difference of the 16 unsigned 8-bit integers from a and the 16 unsigned
8-bit integers from b. Sums the upper 8 differences and lower 8 differences, and packs the
resulting 2 unsigned 16-bit integers into the upper and lower 64-bit elements.
Subtracts the 16 signed or unsigned 8-bit integers of b from the 16 signed or unsigned 8-bit
integers of a.
r0 := a0 - b0
r1 := a1 - b1
...
r15 := a15 - b15
__m128i_mm_sub_epi16(__m128i a, __m128i b)
Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit
integers of a.
r0 := a0 - b0
r1 := a1 - b1
...
r7 := a7 - b7
Subtracts the 4 signed or unsigned 32-bit integers of b from the 4 signed or unsigned 32-bit
integers of a.
r0 := a0 - b0
r1 := a1 - b1
r2 := a2 - b2
r3 := a3 - b3
Subtracts the signed or unsigned 64-bit integer b from the signed or unsigned 64-bit integer a.
r := a - b
Subtracts the 2 signed or unsigned 64-bit integers in b from the 2 signed or unsigned 64-bit
integers in a.
r0 := a0 - b0
r1 := a1 - b1
Subtracts the 16 signed 8-bit integers of b from the 16 signed 8-bit integers of a using
saturating arithmetic.
r0 := SignedSaturate(a0 - b0)
r1 := SignedSaturate(a1 - b1)
...
Subtracts the 8 signed 16-bit integers of b from the 8 signed 16-bit integers of a using
saturating arithmetic.
r0 := SignedSaturate(a0 - b0)
r1 := SignedSaturate(a1 - b1)
...
r7 := SignedSaturate(a7 - b7)
Subtracts the 16 unsigned 8-bit integers of b from the 16 unsigned 8-bit integers of a using
saturating arithmetic.
r0 := UnsignedSaturate(a0 - b0)
r1 := UnsignedSaturate(a1 - b1)
...
r15 := UnsignedSaturate(a15 - b15)
Subtracts the 8 unsigned 16-bit integers of b from the 8 unsigned 16-bit integers of a using
saturating arithmetic.
r0 := UnsignedSaturate(a0 - b0)
r1 := UnsignedSaturate(a1 - b1)
...
r7 := UnsignedSaturate(a7 - b7)
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
(uses PAND) Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
r := a & b
(uses PANDN) Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the
128-bit value in a.
r := (~a) & b
(uses POR) Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
r := a | b
(uses PXOR) Computes the bitwise XOR of the 128-bit value in a and the 128-bit value in b.
r := a ^ b
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
Shifts the 128-bit value in a left by imm bytes while shifting in zeros. imm must be an immediate.
r := a << (imm * 8)
Shifts the 8 signed or unsigned 16-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
...
r7 := a7 << count
Shifts the 8 signed or unsigned 16-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
...
r7 := a7 << count
Shifts the 4 signed or unsigned 32-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
r2 := a2 << count
r3 := a3 << count
Shifts the 4 signed or unsigned 32-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
r2 := a2 << count
r3 := a3 << count
Shifts the 2 signed or unsigned 64-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
Shifts the 2 signed or unsigned 64-bit integers in a left by count bits while shifting in zeros.
r0 := a0 << count
r1 := a1 << count
Shifts the 8 signed 16-bit integers in a right by count bits while shifting in the sign bit.
r0 := a0 >> count
r1 := a1 >> count
...
r7 := a7 >> count
Shifts the 8 signed 16-bit integers in a right by count bits while shifting in the sign bit.
r0 := a0 >> count
r1 := a1 >> count
...
r7 := a7 >> count
Shifts the 4 signed 32-bit integers in a right by count bits while shifting in the sign bit.
r0 := a0 >> count
r1 := a1 >> count
r2 := a2 >> count
r3 := a3 >> count
Shifts the 4 signed 32-bit integers in a right by count bits while shifting in the sign bit.
r0 := a0 >> count
r1 := a1 >> count
r2 := a2 >> count
r3 := i3 >> count
Shifts the 128-bit value in a right by imm bytes while shifting in zeros. imm must be an
immediate.
r := srl(a, imm*8)
Shifts the 8 signed or unsigned 16-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
...
r7 := srl(a7, count)
Shifts the 8 signed or unsigned 16-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
...
r7 := srl(a7, count)
Shifts the 4 signed or unsigned 32-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
r2 := srl(a2, count)
r3 := srl(a3, count)
Shifts the 4 signed or unsigned 32-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
r2 := srl(a2, count)
r3 := srl(a3, count)
Shifts the 2 signed or unsigned 64-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
Shifts the 2 signed or unsigned 64-bit integers in a right by count bits while shifting in zeros.
r0 := srl(a0, count)
r1 := srl(a1, count)
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
Compares the 16 signed or unsigned 8-bit integers in a and the 16 signed or unsigned 8-bit
integers in b for equality.
Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit
integers in b for equality.
Compares the 4 signed or unsigned 32-bit integers in a and the 4 signed or unsigned 32-bit
integers in b for equality.
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for greater than.
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for greater than.
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for greater than.
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for less than.
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for less than.
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for less than.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
__m128i _mm_cvtsi32_si128(int a)
(uses MOVD) Moves 32-bit integer a to the least significant 32 bits of an __m128i object. Copies
the sign bit of a into the upper 96 bits of the __m128i object.
r0 := a
r1 := 0x0 ; r2 := 0x0 ; r3 := 0x0
int _mm_cvtsi128_si32(__m128i a)
r := a0
__m128 _mm_cvtepi32_ps(__m128i a)
r0 := (float) a0
r1 := (float) a1
r2 := (float) a2
r3 := (float) a3
__m128i _mm_cvtps_epi32(__m128 a)
r0 := (int) a0
r1 := (int) a1
r2 := (int) a2
r3 := (int) a3
__m128i _mm_cvttps_epi32(__m128 a)
r0 := (int) a0
r1 := (int) a1
r2 := (int) a2
r3 := (int) a3
You can view the two integers as selectors for choosing which two words from the first input operand
and which two words from the second are to be put into the result word.
(uses MOVNTPD) Stores the data in a to the address p without polluting caches. The address p
must be 16-byte aligned. If the cache line containing address p is already in the cache, the
cache will be updated.
p[0] := a0
p[1] := a1
Stores the data in a to the address p without polluting the caches. If the cache line containing
address p is already in the cache, the cache will be updated. Address p must be 16-byte
aligned.
*p := a
Stores the data in a to the address p without polluting the caches. If the cache line containing
address p is already in the cache, the cache will be updated.
*p := a
Cache line containing p is flushed and invalidated from all caches in the coherency domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load fence
instruction is globally visible before any load instruction which follows the fence in program
order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory fence
instruction is globally visible before any memory instruction which follows the fence in program
order.
void _mm_pause(void)
The execution of the next instruction is delayed an implementation specific amount of time. The
instruction does not modify the architectural state. This intrinsic provides especially significant
performance gain and described in more detail below.
PAUSE Intrinsic
The PAUSE intrinsic is used in spin-wait loops with the processors implementing dynamic execution
(especially out-of-order execution). In the spin-wait loop, PAUSE improves the speed at which the code
detects the release of the lock. For dynamic scheduling, the PAUSE instruction reduces the penalty of
exiting from the spin-loop.
spin_loop:pause
cmp eax, A
jne spin_loop
In the above example, the program spins until memory location A matches the value in register eax.
The code sequence that follows shows a test-and-test-and-set. In this example, the spin occurs only
after the attempt to get a lock has failed.
jne spin_loop
<critical_section code>
jmp continue
jne spin_loop
jmp get_lock
Note that the first branch is predicted to fall-through to the critical section in anticipation of successfully
gaining access to the lock. It is highly recommended that all spin-wait loops include the PAUSE
instruction. Since PAUSE is backwards compatible to all existing IA-32 processor generations, a test for
processor type (a CPUID test) is not needed. All legacy processors will execute PAUSE as a NOP, but in
processors which use the PAUSE as a hint there can be significant performance benefit.
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
Packs the 16 signed 16-bit integers from a and b into 8-bit integers and saturates.
r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
...
r7 := SignedSaturate(a7)
r8 := SignedSaturate(b0)
r9 := SignedSaturate(b1)
...
r15 := SignedSaturate(b7)
Packs the 8 signed 32-bit integers from a and b into signed 16-bit integers and saturates.
r0 := SignedSaturate(a0)
r1 := SignedSaturate(a1)
r2 := SignedSaturate(a2)
r3 := SignedSaturate(a3)
r4 := SignedSaturate(b0)
r5 := SignedSaturate(b1)
r6 := SignedSaturate(b2)
r7 := SignedSaturate(b3)
Packs the 16 signed 16-bit integers from a and b into 8-bit unsigned integers and saturates.
r0 := UnsignedSaturate(a0)
r1 := UnsignedSaturate(a1)
...
r7 := UnsignedSaturate(a7)
r8 := UnsignedSaturate(b0)
r9 := UnsignedSaturate(b1)
...
r15 := UnsignedSaturate(b7)
Extracts the selected signed or unsigned 16-bit integer from a and zero extends. The selector
imm must be an immediate.
r := (imm == 0) ? a0 :
( (imm == 1) ? a1 :
...
(imm == 7) ? a7 )
Inserts the least significant 16 bits of b into the selected 16-bit integer of a. The selector imm
must be an immediate.
r0 := (imm == 0) ? b : a0;
r1 := (imm == 1) ? b : a1;
...
r7 := (imm == 7) ? b : a7;
int _mm_movemask_epi8(__m128i a)
Creates a 16-bit mask from the most significant bits of the 16 signed or unsigned 8-bit integers
in a and zero extends the upper bits.
r := a15[7] << 15 |
a14[7] << 14 |
...
a1[7] << 1 |
a0[7]
Shuffles the 4 signed or unsigned 32-bit integers in a as specified by imm. The shuffle value,
imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle
semantics.
Shuffles the upper 4 signed or unsigned 16-bit integers in a as specified by imm. The shuffle
value, imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle
semantics.
Shuffles the lower 4 signed or unsigned 16-bit integers in a as specified by imm. The shuffle
value, imm, must be an immediate. See Macro Function for Shuffle for a description of shuffle
semantics.
Interleaves the upper 8 signed or unsigned 8-bit integers in a with the upper 8 signed or
unsigned 8-bit integers in b.
r0 := a8 ; r1 := b8
r2 := a9 ; r3 := b9
...
r14 := a15 ; r15 := b15
Interleaves the upper 4 signed or unsigned 16-bit integers in a with the upper 4 signed or
unsigned 16-bit integers in b.
r0 := a4 ; r1 := b4
r2 := a5 ; r3 := b5
r4 := a6 ; r5 := b6
r6 := a7 ; r7 := b7
Interleaves the upper 2 signed or unsigned 32-bit integers in a with the upper 2 signed or
unsigned 32-bit integers in b.
r0 := a2 ; r1 := b2
r2 := a3 ; r3 := b3
Interleaves the upper signed or unsigned 64-bit integer in a with the upper signed or unsigned
64-bit integer in b.
r0 := a1 ; r1 := b1
Interleaves the lower 8 signed or unsigned 8-bit integers in a with the lower 8 signed or
unsigned 8-bit integers in b.
r0 := a0 ; r1 := b0
r2 := a1 ; r3 := b1
...
r14 := a7 ; r15 := b7
Interleaves the lower 4 signed or unsigned 16-bit integers in a with the lower 4 signed or
unsigned 16-bit integers in b.
r0 := a0 ; r1 := b0
r2 := a1 ; r3 := b1
r4 := a2 ; r5 := b2
r6 := a3 ; r7 := b3
Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 2 signed or
unsigned 32-bit integers in b.
r0 := a0 ; r1 := b0
r2 := a1 ; r3 := b1
Interleaves the lower signed or unsigned 64-bit integer in a with the lower signed or unsigned
64-bit integer in b.
r0 := a0 ; r1 := b0
__m64 _mm_movepi64_pi64(__m128i a)
r0 := a0 ;
__128i _mm_movpi64_pi64(__m64 a)
Moves the 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.
r0 := a0 ; r1 := 0X0 ;
__128i _mm_move_epi64(__128i a)
Moves the lower 64 bits of the lower 64 bits of the result, zeroing the upper bits.
r0 := a0 ; r1 := 0X0 ;
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
z Load Operations
z Set Operations
z Store Operations
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
r := *p
(uses MOVDQU) Loads 128-bit value. Address p not need be 16-byte aligned.
r := *p
(uses MOVQ) Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the
result, zeroing the upper 64 bits of the result.
r0:= *p[63:0]
r1:=0x0
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
r0 := q0
r1 := q1
r0 := i0
r1 := i1
r2 := i2
r3 := i3
__m128i _mm_set_epi16(short w7, short w6, short w5, short w4, short w3,
short w2, short w1, short w0)
r0 := w0
r1 := w1
...
r7 := w7
__m128i _mm_set_epi8(char b15, char b14, char b13, char b12, char b11, char
b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3, char
b2, char b1, char b0)
r0 := b0
r1 := b1
...
r15 := b15
__m128i _mm_set1_epi64(__m64 q)
r0 := q
r1 := q
__m128i _mm_set1_epi32(int i)
r0 := i
r1 := i
r2 := i
r3 := i
__m128i _mm_set1_epi16(short w)
r0 := w
r1 := w
...
r7 := w
__m128i _mm_set1_epi8(char b)
r0 := b
r1 := b
...
r15 := b
r0 := q0
r1 := q1
r0 := i0
r1 := i1
r2 := i2
r3 := i3
__m128i _mm_setr_epi16(short w0, short w1, short w2, short w3, short w4,
short w5, short w6, short w7)
r0 := w0
r1 := w1
...
r7 := w7
__m128i _mm_setr_epi8(char b15, char b14, char b13, char b12, char b11,
char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3,
r0 := b0
r1 := b1
...
r15 := b15
__m128i _mm_setzero_si128()
r := 0x0
The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin.h header file.
*p := a
(uses MOVDQU) Stores 128-bit value. Address p need not be 16-byte aligned.
*p := a
(uses MASKMOVDQU) Conditionally store byte elements of d to address p. The high bit of each
byte in the selector n determines whether the corresponding byte in d will be stored. Address p
need not be 16-byte aligned.
if (n0[7]) p[0] := d0
if (n1[7]) p[1] := d1
...
if (n15[7]) p[15] := d15
*p[63:0]:=a0
The prototypes for these intrinsics are in the ia64intrin.h header file.
Note
The Intel® C++ Compiler for Itanium-base applications provides intrinsic functions that provide
equivalent functionality as inline assembly without inhibiting compiler optimizations and affecting
instruction scheduling.
Integer Operations
Intrinsic Corresponding
Instruction
__int64 _m64_dep_mr(__int64 r, dep (Deposit)
__int64 s, const int pos, const int
len)
__int64 _m64_dep_mi(const int v, dep (Deposit)
__int64 s, const int p, const int
len)
__int64 _m64_dep_zr(__int64 s, dep.z (Deposit)
const int pos, const int len)
__int64 _m64_dep_zi(const int v, dep.z (Deposit)
const int pos, const int len)
__int64 _m64_extr(__int64 r, const extr (Extract)
int pos, const int len)
__int64 _m64_extru(__int64 r, const extr.u (Extract)
int pos, const int len)
__int64 _m64_xmal(__int64 a, xma.l (Fixed-point multiply add using the low 64
__int64 b, __int64 c) bits of the 128-bit result. The result is signed.)
__int64 _m64_xmalu(__int64 a, xma.lu (Fixed-point multiply add using the low 64
__int64 b, __int64 c) bits of the 128-bit result. The result is unsigned.)
__int64 _m64_xmah(__int64 a, xma.h (Fixed-point multiply add using the high 64
__int64 b, __int64 c) bits of the 128-bit result. The result is signed.)
__int64 _m64_xmahu(__int64 a, xma.hu (Fixed-point multiply add using the high
__int64 b, __int64 c) 64 bits of the 128-bit result. The result is
unsigned.)
__int64 _m64_popcnt(__int64 a) popcnt (Population count)
__int64 _m64_shladd(__int64 a, shladd (Shift left and add)
const int count, __int64 b)
__int64 _m64_shrp(__int64 a, shrp (Shift right pair)
__int64 b, const int count)
FSR Operations
Intrinsic Description
void _fsetc(int Sets the control bits of FPSR.sf0. Maps to the fsetc.sf0 r, r
amask, int omask) instruction. There is no corresponding instruction to read the control bits.
Use _mm_getfpsr().
void _fclrf(void) Clears the floating point status flags (the 6-bit flags of FPSR.sf0). Maps to
the fclrf.sf0 instruction.
The right-justified 64-bit value r is deposited into the value in s at an arbitrary bit position and
the result is returned. The deposited bit field begins at bit position pos and extends to the left
(toward the most significant bit) the number of bits specified by len.
The sign-extended value v (either all 1s or all 0s) is deposited into the value in s at an arbitrary
bit position and the result is returned. The deposited bit field begins at bit position p and extends
to the left (toward the most significant bit) the number of bits specified by len.
The right-justified 64-bit value s is deposited into a 64-bit field of all zeros at an arbitrary bit
position and the result is returned. The deposited bit field begins at bit position pos and extends
to the left (toward the most significant bit) the number of bits specified by len.
The sign-extended value v (either all 1s or all 0s) is deposited into a 64-bit field of all zeros at an
arbitrary bit position and the result is returned. The deposited bit field begins at bit position pos
and extends to the left (toward the most significant bit) the number of bits specified by len.
A field is extracted from the 64-bit value r and is returned right-justified and sign extended. The
extracted field begins at position pos and extends len bits to the left. The sign is taken from
the most significant bit of the extracted field.
A field is extracted from the 64-bit value r and is returned right-justified and zero extended. The
extracted field begins at position pos and extends len bits to the left.
The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit
signed result. The 64-bit value c is zero-extended and added to the product. The least
significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit
unsigned result. The 64-bit value c is zero-extended and added to the product. The least
significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as signed integers and multiplied to produce a full 128-bit
signed result. The 64-bit value c is zero-extended and added to the product. The most
significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as unsigned integers and multiplied to produce a full 128-
bit unsigned result. The 64-bit value c is zero-extended and added to the product. The most
significant 64 bits of the sum are then returned.
__int64 _m64_popcnt(__int64 a)
The number of bits in the 64-bit integer a that have the value 1 are counted, and the resulting
sum is returned.
a is shifted to the left by count bits and then added to b. The result is returned.
a and b are concatenated to form a 128-bit value and shifted to the right count bits. The least
significant 64 bits of the result are returned.
Intrinsic Description
unsigned __int64 _InterlockedExchange8 Map to the xchg1 instruction.
(volatile unsigned char *Target, unsigned Atomically write the least significant
__int64 value) byte of its 2nd argument to address
specified by its 1st argument.
unsigned __int64 Compare and exchange atomically the
_InterlockedCompareExchange8_rel(volatile least significant byte at the address
unsigned char *Destination, unsigned specified by its 1st argument. Maps to
__int64 Exchange, unsigned __int64 the cmpxchg1.rel instruction with
Comparand)
appropriate setup.
unsigned __int64 Same as above, but using acquire
_InterlockedCompareExchange8_acq(volatile semantic.
unsigned char *Destination, unsigned
__int64 Exchange, unsigned __int64
Comparand)
unsigned __int64 _InterlockedExchange16 Map to the xchg2 instruction.
(volatile unsigned short *Target, unsigned Atomically write the least significant
__int64 value) word of its 2nd argument to address
specified by its 1st argument.
unsigned __int64 Compare and exchange atomically the
_InterlockedCompareExchange16_rel(volatile least significant word at the address
unsigned short *Destination, unsigned specified by its 1st argument. Maps to
__int64 Exchange, unsigned __int64 the cmpxchg2.rel instruction with
Comparand)
appropriate setup.
unsigned __int64 Same as above, but using acquire
_InterlockedCompareExchange16_acq(volatile semantic.
unsigned short *Destination, unsigned
__int64 Exchange, unsigned __int64
Comparand)
int _InterlockedIncrement(volatile int Atomically increment by one the value
*addend) specified by its argument. Maps to the
fetchadd4 instruction.
int _InterlockedDecrement(volatile int Atomically decrement by one the value
*addend) specified by its argument. Maps to the
fetchadd4 instruction.
int _InterlockedExchange(volatile int Do an exchange operation atomically.
*Target, int value) Maps to the xchg4 instruction.
int _InterlockedCompareExchange(volatile Maps to the cmpxchg4 instruction with
int *Destination, int Exchange, int appropriate setup. Atomically compare
Comparand) and exchange the value specified by
the first argument (a 32-bit pointer).
Note
Uses cmpxchg to do an atomic sub of the incr value to the target. Maps to a loop with the
cmpxchg instruction to guarantee atomicity.
Intrinsic Description
unsigned __int64 __getReg Gets the value from a hardware register based on the index
(const int whichReg) passed in. Produces a corresponding mov = r instruction.
Provides access to the following registers:
See Register Names for getReg() and setReg().
void __setReg(const int Sets the value for a hardware register based on the index
whichReg, unsigned passed in. Produces a corresponding mov = r instruction.
__int64 value) See Register Names for getReg() and setReg().
unsigned __int64 Return the value of an indexed register. The index is the 2nd
__getIndReg(const int argument; the register file is the first argument.
whichIndReg, __int64
index)
void __setIndReg(const Copy a value in an indexed register. The index is the 2nd
int whichIndReg, __int64 argument; the register file is the first argument.
index, unsigned __int64
value)
void *_rdteb(void) Gets TEB address. The TEB address is kept in r13 and maps
to the move r=tp instruction
void __isrlz(void) Executes the serialize instruction. Maps to the srlz.i
instruction.
void __dsrlz(void) Serializes the data. Maps to the srlz.d instruction.
unsigned __int64 Map the fetchadd4.acq instruction.
__fetchadd4_acq(unsigned
int *addend, const int
increment)
unsigned __int64 Map the fetchadd4.rel instruction.
__fetchadd4_rel(unsigned
int *addend, const int
increment)
unsigned __int64 Map the fetchadd8.acq instruction.
__fetchadd8_acq(unsigned
__int64 *addend, const
int increment)
unsigned __int64 Map the fetchadd8.rel instruction.
__fetchadd8_rel(unsigned
__int64 *addend, const
int increment)
void __fwb(void) Flushes the write buffers. Maps to the fwb instruction.
void __ldfs(const int Map the ldfs instruction. Load a single precision value to the
whichFloatReg, void *src) specified register.
void __ldfd(const int Map the ldfd instruction. Load a double precision value to the
whichFloatReg, void *src) specified register.
void __ldfe(const int Map the ldfe instruction. Load an extended precision value to
whichFloatReg, void *src) the specified register.
void __ldf8(const int Map the ldf8 instruction.
whichFloatReg, void *src)
void __ldf_fill(const int Map the ldf.fill instruction.
whichFloatReg, void *src)
void __stfs(void *dst, Map the sfts instruction.
const int whichFloatReg)
void __stfd(void *dst, Map the stfd instruction.
const int whichFloatReg)
void __stfe(void *dst, Map the stfe instruction.
const int whichFloatReg)
void __stf8(void *dst, Map the stf8 instruction.
const int whichFloatReg)
void __stf_spill(void Map the stf.spill instruction.
*dst, const int
whichFloatReg)
void __mf(void) Executes a memory fence instruction. Maps to the mf
instruction.
void __mfa(void) Executes a memory fence, acceptance form instruction. Maps
to the mf.a instruction.
void __synci(void) Enables memory synchronization. Maps to the sync.i
instruction.
void __thash(__int64) Generates a translation hash entry address. Maps to the
thash r = r instruction.
void __ttag(__int64) Generates a translation hash entry tag. Maps to the ttag r=r
instruction.
void __itcd(__int64 pa) Insert an entry into the data translation cache (Map itc.d
instruction).
void __itci(__int64 pa) Insert an entry into the instruction translation cache (Map
itc.i).
void __itrd(__int64 Map the itr.d instruction.
whichTransReg, __int64
pa)
void __itri(__int64 Map the itr.i instruction.
whichTransReg, __int64
pa)
void __ptce(__int64 va) Map the ptc.e instruction.
void __ptcl(__int64 va, Purges the local translation cache. Maps to the ptc.l r, r
__int64 pagesz) instruction.
void __ptcg(__int64 va, Purges the global translation cache. Maps to the ptc.g r, r
__int64 pagesz) instruction.
void __ptcga(__int64 va, Purges the global translation cache and ALAT. Maps to the
__int64 pagesz) ptc.ga r, r instruction.
void __ptri(__int64 va, Purges the translation register. Maps to the ptr.i r, r
__int64 pagesz) instruction.
void __ptrd(__int64 va, Purges the translation register. Maps to the ptr.d r, r
__int64 pagesz) instruction.
__int64 __tpa(__int64 va) Map the tpa instruction.
void __invalat(void) Invalidates ALAT. Maps to the invala instruction.
void __invala (void) Same as void __invalat(void)
void __invala_gr(const whichGeneralReg = 0-127
int whichGeneralReg)
void __invala_fr(const whichFloatReg = 0-127
int whichFloatReg)
void __break(const int) Generates a break instruction with an immediate.
void __nop(const int) Generate a nop instruction.
void __debugbreak(void) Generates a Debug Break Instruction fault.
void __fc(__int64) Flushes a cache line associated with the address given by the
argument. Maps to the fcr instruction.
void __sum(int mask) Sets the user mask bits of PSR. Maps to the sum imm24
instruction.
void __rum(int mask) Resets the user mask.
void __ssm(int mask) Sets the system mask.
void __rsm(int mask) Resets the system mask bits of PSR. Maps to the rsm imm24
instruction.
__int64 _ReturnAddress Get the caller's address.
(void)
void __lfetch(int lfhint, Generate the lfetch.lfhint instruction. The value of the
void *y) first argument specifies the hint type.
void __lfetch_fault(int Generate the lfetch.fault.lfhint instruction. The value
lfhint, void *y) of the first argument specifies the hint type.
unsigned int __cacheSize __cacheSize(n) returns the size in bytes of the cache at
(unsigned int cacheLevel) level n. 1 represents the first-level cache. 0 is returned for a
non-existent cache level. For example, an application may
query the cache size and use it to select block sizes in
algorithms that operate on matrices.
void __memory_barrier Creates a barrier across which the compiler will not schedule
(void) any data access instruction. The compiler may allocate local
data in registers across a memory barrier, but not global data.
Intrinsic Description
__int64 _m_to_int64(__m64 a) Convert a of type __m64 to type __int64. Translates to
nop since both types reside in the same register on
Itanium-based systems.
__m64 _m_from_int64(__int64 Convert a of type __int64 to type __m64. Translates to
a) nop since both types reside in the same register on
Itanium-based systems.
__int64 Convert its double precision argument to a signed integer.
__round_double_to_int64
(double d)
unsigned __int64 __getf_exp Map the getf.exp instruction and return the 16-bit
(double d) exponent and the sign of its operand.
Name whichReg
_IA64_REG_IP 1016
_IA64_REG_PSR 1019
_IA64_REG_PSR_L 1019
Name whichReg
_IA64_REG_GP 1025
_IA64_REG_SP 1036
_IA64_REG_TP 1037
Application Registers
Name whichReg
_IA64_REG_AR_KR0 3072
_IA64_REG_AR_KR1 3073
_IA64_REG_AR_KR2 3074
_IA64_REG_AR_KR3 3075
_IA64_REG_AR_KR4 3076
_IA64_REG_AR_KR5 3077
_IA64_REG_AR_KR6 3078
_IA64_REG_AR_KR7 3079
_IA64_REG_AR_RSC 3088
_IA64_REG_AR_BSP 3089
_IA64_REG_AR_BSPSTORE 3090
_IA64_REG_AR_RNAT 3091
_IA64_REG_AR_FCR 3093
_IA64_REG_AR_EFLAG 3096
_IA64_REG_AR_CSD 3097
_IA64_REG_AR_SSD 3098
_IA64_REG_AR_CFLAG 3099
_IA64_REG_AR_FSR 3100
_IA64_REG_AR_FIR 3101
_IA64_REG_AR_FDR 3102
_IA64_REG_AR_CCV 3104
_IA64_REG_AR_UNAT 3108
_IA64_REG_AR_FPSR 3112
_IA64_REG_AR_ITC 3116
_IA64_REG_AR_PFS 3136
_IA64_REG_AR_LC 3137
_IA64_REG_AR_EC 3138
Control Registers
Name whichReg
_IA64_REG_CR_DCR 4096
_IA64_REG_CR_ITM 4097
_IA64_REG_CR_IVA 4098
_IA64_REG_CR_PTA 4104
_IA64_REG_CR_IPSR 4112
_IA64_REG_CR_ISR 4113
_IA64_REG_CR_IIP 4115
_IA64_REG_CR_IFA 4116
_IA64_REG_CR_ITIR 4117
_IA64_REG_CR_IIPA 4118
_IA64_REG_CR_IFS 4119
_IA64_REG_CR_IIM 4120
_IA64_REG_CR_IHA 4121
_IA64_REG_CR_LID 4160
_IA64_REG_CR_IVR 4161 *
_IA64_REG_CR_TPR 4162
_IA64_REG_CR_EOI 4163
_IA64_REG_CR_IRR0 4164 *
_IA64_REG_CR_IRR1 4165 *
_IA64_REG_CR_IRR2 4166 *
_IA64_REG_CR_IRR3 4167 *
_IA64_REG_CR_ITV 4168
_IA64_REG_CR_PMV 4169
_IA64_REG_CR_CMCV 4170
_IA64_REG_CR_LRR0 4176
_IA64_REG_CR_LRR1 4177
* getReg only
Name whichReg
_IA64_REG_INDR_CPUID 9000 *
_IA64_REG_INDR_DBR 9001
_IA64_REG_INDR_IBR 9002
_IA64_REG_INDR_PKR 9003
_IA64_REG_INDR_PMC 9004
_IA64_REG_INDR_PMD 9005
_IA64_REG_INDR_RR 9006
_IA64_REG_INDR_RESERVED 9007
* getIndReg only
__int64 _m64_czx1l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to the least
significant element, and the index of the first zero element is returned. The element width is 8
bits, so the range of the result is from 0 - 7. If no zero element is found, the default result is 8.
__int64 _m64_czx1r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to the most
significant element, and the index of the first zero element is returned. The element width is 8
bits, so the range of the result is from 0 - 7. If no zero element is found, the default result is 8.
__int64 _m64_czx2l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to the least
significant element, and the index of the first zero element is returned. The element width is 16
bits, so the range of the result is from 0 - 3. If no zero element is found, the default result is 4.
__int64 _m64_czx2r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to the most
significant element, and the index of the first zero element is returned. The element width is 16
bits, so the range of the result is from 0 - 3. If no zero element is found, the default result is 4.
Interleave 64-bit quantities a and b in 1-byte groups, starting from the left, as shown in Figure 1,
and return the result.
Interleave 64-bit quantities a and b in 1-byte groups, starting from the right, as shown in Figure
2, and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the left, as shown in Figure 3,
and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the right, as shown in Figure
4, and return the result.
Interleave 64-bit quantities a and b in 4-byte groups, starting from the left, as shown in Figure 5,
and return the result.
Interleave 64-bit quantities a and b in 4-byte groups, starting from the right, as shown in Figure
6, and return the result.
Based on the value of n, a permutation is performed on a as shown in Figure 7, and the result is
returned. Table 1 shows the possible values of n.
n
@brcst 0
@mix 8
@shuf 9
@alt 0xA
@rev 0xB
Based on the value of n, a permutation is performed on a as shown in Figure 8, and the result is
returned.
The unsigned data elements (bytes) of b are subtracted from the unsigned data elements
(bytes) of a and the results of the subtraction are then each independently shifted to the right by
one position. The high-order bits of each element are filled with the borrow bits of the
subtraction.
The unsigned data elements (double bytes) of b are subtracted from the unsigned data
elements (double bytes) of a and the results of the subtraction are then each independently
shifted to the right by one position. The high-order bits of each element are filled with the borrow
bits of the subtraction.
Two signed 16-bit data elements of a, starting with the most significant data element, are
multiplied by the corresponding two signed 16-bit data elements of b, and the two 32-bit results
are returned as shown in Figure 9.
Two signed 16-bit data elements of a, starting with the least significant data element, are
multiplied by the corresponding two signed 16-bit data elements of b, and the two 32-bit results
are returned as shown in Figure 10.
The four signed 16-bit data elements of a are multiplied by the corresponding signed 16-bit data
elements of b, yielding four 32-bit products. Each product is then shifted to the right count bits
and the least significant 16 bits of each shifted product form 4 16-bit results, which are returned
as one 64-bit word.
The four unsigned 16-bit data elements of a are multiplied by the corresponding unsigned 16-bit
data elements of b, yielding four 32-bit products. Each product is then shifted to the right count
bits and the least significant 16 bits of each shifted product form 4 16-bit results, which are
returned as one 64-bit word.
a is shifted to the left by count bits and then is added to b. The upper 32 bits of the result are
forced to 0, and then bits [31:30] of b are copied to bits [62:61] of the result. The result is
returned.
The four signed 16-bit data elements of a are each independently shifted to the right by count
bits (the high order bits of each element are filled with the initial value of the sign bits of the data
elements in a); they are then added to the four signed 16-bit data elements of b. The result is
returned.
a is added to b as four separate 16-bit wide elements. The elements of a are treated as
unsigned, while the elements of b are treated as signed. The results are treated as unsigned
and are returned as one 64-bit word.
a is subtracted from b as eight separate byte-wide elements. The elements of a are treated as
unsigned, while the elements of b are treated as signed. The results are treated as unsigned
and are returned as one 64-bit word.
a is subtracted from b as four separate 16-bit wide elements. The elements of a are treated as
unsigned, while the elements of b are treated as signed. The results are treated as unsigned
and are returned as one 64-bit word.
The unsigned byte-wide data elements of a are added to the unsigned byte-wide data elements
of b and the results of each add are then independently shifted to the right by one position. The
high-order bits of each element are filled with the carry bits of the sums.
The unsigned 16-bit wide data elements of a are added to the unsigned 16-bit wide data
elements of b and the results of each add are then independently shifted to the right by one
position. The high-order bits of each element are filled with the carry bits of the sums.
z Alignment Support
z Allocating and Freeing Aligned Memory Blocks
z Inline Assembly
Alignment Support
To improve intrinsics performance, you need to align data. For example, when you are using the
Streaming SIMD Extensions, you should align data to 16 bytes in memory operations to improve
performance. Specifically, you must align __m128 objects as addresses passed to the _mm_load and
_mm_store intrinsics. If you want to declare arrays of floats and treat them as __m128 objects by
casting, you need to ensure that the float arrays are properly aligned.
Use __declspec(align) to direct the compiler to align data more strictly than it otherwise does on
both IA-32 and Itanium®-based systems. For example, a data object of type int is allocated at a byte
address which is a multiple of 4 by default (the size of an int). However, by using __declspec
(align), you can direct the compiler to instead use an address which is a multiple of 8, 16, or 32 with
the following restrictions on IA-32:
You can use this data alignment support as an advantage in optimizing cache line usage. By clustering
small objects that are commonly used together into a struct, and forcing the struct to be allocated
at the beginning of a cache line, you can effectively guarantee that each object is loaded into the cache
as soon as any one is accessed, resulting in a significant performance benefit.
align(n)
where n is an integral power of 2, less than or equal to 32. The value specified is the requested
alignment.
Caution
Note
If a value is specified that is less than the alignment of the affected data type, it has no effect. In other
words, data is aligned to the maximum of its own alignment or the alignment specified with
__declspec(align).
You can request alignments for individual variables, whether of static or automatic storage duration.
(Global and static variables have static storage duration; local variables have automatic storage
duration by default.) You cannot adjust the alignment of a parameter, nor a field of a struct or
class. You can, however, increase the alignment of a struct (or union or class ), in which case
every object of that type is affected.
As an example, suppose that a function uses local variables i and j as subscripts into a 2-
dimensional array. They might be declared as follows:
int i, j;
These variables are commonly used together. But they can fall in different cache lines, which could be
detrimental to performance. You can instead declare them as follows:
The compiler now ensures that they are allocated in the same cache line. In C++, you can omit the
struct variable name (written as sub in the above example). In C, however, it is required, and you
must write references to i and j as sub.i and sub.j.
If you use many functions with such subscript pairs, it is more convenient to declare and use a struct
type for them, as in the following example:
By placing the __declspec(align) after the keyword struct, you are requesting the appropriate
alignment for all objects of that type. However, that allocation of parameters is unaffected by
__declspec(align). (If necessary, you can assign the value of a parameter to a local variable with
the appropriate alignment.)
The _mm_malloc routine takes an extra parameter, which is the alignment constraint. This constraint
must be a power of two. The pointer that is returned from _mm_malloc is guaranteed to be aligned on
the specified boundary.
Note
Memory that is allocated using _mm_malloc must be freed using _mm_free . Calling free on
memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc will
cause unpredictable behavior.
Inline Assembly
By default, the compiler inlines a number of standard C, C++, and math library functions. This usually
results in faster execution of your program.
Sometimes inline expansion of library functions can cause unexpected results. The inlined library
functions do not set the errno variable. So, in code that relies upon the setting of the errno variable,
you should use the -nolib_inline option, which turns off inline expansion of library functions. Also,
if one of your functions has the same name as one of the compiler's supplied library functions, the
compiler assumes that it is one of the latter and replaces the call with the inlined version.
Consequently, if the program defines a function with the same name as one of the known library
routines, you must use the -nolib_inline option to ensure that the program's function is the one
used.
Note
Automatic inline expansion of library functions is not related to the inline expansion that the compiler
does during interprocedural optimizations. For example, the following command compiles the program
sum.c without expanding the library functions, but with inline expansion from interprocedural
optimizations (IPO):
The Intel® C++ Compiler supports MASM style inline assembly with the -use_msasm option. See your
MASM documentation for the proper syntax.
The Intel® C++ Compiler supports GNU-like style inline assembly. The syntax is as follows:
Syntax Description
Element
asm- asm statements begin with the keyword asm. Alternatively, either __asm or __asm__
keyword may be used for compatibility.
volatile- If the optional keyword volatile is given, the asm is volatile. Two volatile asm
keyword statements will never be moved past each other, and a reference to a volatile
variable will not be moved relative to a volatile asm. Alternate keywords
__volatile and __volatile__ may be used for compatibility.
asm- The asm-template is a C language ASCII string which specifies how to output the
template assembly code for an instruction. Most of the template is a fixed string; everything
but the substitution-directives, if any, is passed through to the assembler. The syntax
for a substitution directive is a % followed by one or two characters. The supported
substitution directives are specified in a subsequent section.
asm- The asm-interface consists of three parts:
interface 1. an optional output-list
2. an optional input-list
3. an optional clobber-list
These are separated by colon (:) characters. If the output-list is missing, but an
input-list is given, the input list may be preceded by two colons (::)to take the
place of the missing output-list. If the asm-interface is omitted altogether,
the asm statement is considered volatile regardless of whether a volatile-
keyword was specified.
output- An output-list consists of one or more output-specs separated by commas.
list For the purposes of substitution in the asm-template, each output-spec is
numbered. The first operand in the output-list is numbered 0, the second is 1,
and so on. Numbering is continuous through the output-list and into the
input-list. The total number of operands is limited to 10 (i.e. 0-9).
input-list Similar to an output-list, an input-list consists of one or more input-
specs separated by commas. For the purposes of substitution in the asm-
template, each input-spec is numbered, with the numbers continuing from
those in the output-list.
clobber- A clobber-list tells the compiler that the asm uses or changes a specific
list machine register that is either coded directly into the asm or is changed implicitly by
the assembly instruction. The clobber-list is a comma-separated list of
clobber-specs.
input-spec The input-specs tell the compiler about expressions whose values may be
needed by the inserted assembly instruction. In order to describe fully the input
requirements of the asm, you can list input-specs that are not actually referenced
in the asm-template.
clobber- Each clobber-spec specifies the name of a single machine register that is
spec clobbered. The register name may optionally be preceded by a %. The following are
the valid register names: eax, ebx, ecx, edx, esi, edi, ebp, esp, ax, bx, cx, dx, si, di,
bp, sp, al, bl, cl, dl, ah, bh, ch, dh, st, st(1) – st(7), mm0 – mm7, xmm0 – xmm7, and
cc. It is also legal to specify “memory” in a clobber-spec. This prevents the
compiler from keeping data cached in registers across the asm statement.
z Instrinsics may generate code that does not run on all IA processors. Therefore the programmer
is responsible for using CPUID to detect the processor and generating the appropriate code.
z Implement intrinsics by processor family, not by specific processor. The guiding principle for
which family–IA-32 or Itanium® processors–the intrinsic is implemented on is performance, not
compatibility. Where there is added performance on both families, the intrinsic will be identical.
float powf(float, A A A A A
float)
double sin(double) A A A A A
float sinf(float) A A A A A
double cos(double) A A A A A
float cosf(float) A A A A A
double tan(double) A A A A A
float tanf(float) A A A A A
double acos(double) A A A A A
float acosf(float) A A A A A
double acosh(double) A A A A A
float acoshf(float) A A A A A
double asin(double) A A A A A
float asinf(float) A A A A A
double asinh(double) A A A A A
float asinhf(float) A A A A A
double atan(double) A A A A A
float atanf(float) A A A A A
double atanh(double) A A A A A
float atanhf(float) A A A A A
float cabs(double)* A A A A A
double ceil(double) A A A A A
float ceilf(float) A A A A A
double cosh(double) A A A A A
float coshf(float) A A A A A
float fabsf(float) A A A A A
double floor(double) A A A A A
float floorf(float) A A A A A
double fmod(double) A A A A A
float fmodf(float) A A A A A
double hypot(double, A A A A A
double)
float hypotf(float) A A A A A
double rint(double) A A A A A
float rintf(float) A A A A A
double sinh(double) A A A A A
float sinhf(float) A A A A A
float sqrtf(float) A A A A A
double tanh(double) A A A A A
float tanhf(float) A A A A A
char *_strset(char *, A A A A A
_int32)
void *memcmp(const A A A A A
void *cs, const void
*ct, size_t n)
void *memcpy(void *s, A A A A A
const void *ct, size t
n)
void *memset(void * s, A A A A A
int c, size_t n)
char *Strcat(char * s, A A A A A
const char * ct)
int *strcmp(const char A A A A A
*, const char *)
char *strcpy(char * s, A A A A A
const char * ct)
size_t strlen(const A A A A A
char * cs)
int strncmp(char *, A A A A A
char *, int)
int strncpy(char *, A A A A A
char *, int)
void *__alloca(int) A A A A A
int _setjmp(jmp_buf) A A A A A
_exception_code(void) A A A A A
_exception_info(void) A A A A A
_abnormal_termination A A A A A
(void)
void _enable() A A A A A
void _disable() A A A A A
int _bswap(int) A A A A A
int _in_byte(int) A A A A A
int _in_dword(int) A A A A A
int _in_word(int) A A A A A
int _inp(int) A A A A A
int _inpd(int) A A A A A
int _inpw(int) A A A A A
int _out_byte(int, A A A A A
int)
int _out_dword(int, A A A A A
int)
int _out_word(int, A A A A A
int)
int _outp(int, int) A A A A A
int _outpd(int, int) A A A A A
int _outpw(int, int) A A A A A
You can find the definitions for these operations in three header files: ivec.h, fvec.h, and dvec.h.
The classes themselves are not partitioned like this. The classes are named according to the
underlying type of operation. The header files are partitioned according to architecture:
Streaming SIMD Extensions 2 intrinsics cannot be used on Itanium®-based systems. The mmclass.h
header file includes the classes that are usable on the Itanium architecuture.
This documentation is intended for programmers writing code for the Intel architecture, particularly
code that would benefit from the use of SIMD instructions. You should be familiar with C++ and the use
of C++ classes.
Performing four operations with a single instruction improves efficiency by a factor of four for that
particular instruction.
These new processor instructions can be implemented using assembly inlining, intrinsics, or the C++
SIMD classes. Compare the coding required to add four 32-bit floating-point values, using each of the
available interfaces:
The table above shows an addition of two single-precision floating-point values using assembly
inlining, intrinsics, and the libraries. You can see how much easier it is to code with the Intel C++ SIMD
Class Libraries. Besides using fewer keystrokes and fewer lines of code, the notation is like the
standard notation in C++, making it much easier to implement over other methods.
The following example shows the same results using one operation with Ivec Classes.
Available Classes
The Intel® C++ SIMD classes provide parallelism, which is not easily implemented using typical
mechanisms of C++. The following table shows how the Intel C++ SIMD classes use the classes and
libraries.
Most classes contain similar functionality for all data types and are represented by all available
intrinsics. However, some capabilities do not translate from one data type to another without suffering
from poor performance, and are therefore excluded from individual classes.
Note
Intrinsics that take immediate values and cannot be expressed easily in classes are not implemented.
(For example, _mm_shuffle_ps, _mm_shuffle_pi16, _mm_extract_pi16,
_mm_insert_pi16).
Each succeeding file from the top down includes the preceding class. You only need to include
fvec.h if you want to use both the Ivec and Fvec classes. Similarly, to use all the classes including
those for the Streaming SIMD Extensions 2, you need only to include the dvec.h file.
Usage Precautions
When using the C++ classes, you should follow some general guidelines. More detailed usage rules
for each class are listed in Integer Vector Classes, and Floating-point Vector Classes.
If you use both the Ivec and Fvec classes at the same time, your program could mix MMX
instructions, called by Ivec classes, with Intel x87 architecture floating-point instructions, called by
Fvec classes. Floating-point instructions exist in the following Fvec functions:
z fvec constructors
z debug functions (cout and element access)
z rsqrt_nr
Note
MMX registers are aliased on the floating-point registers, so you should clear the MMX state with the
EMMS instruction intrinsic before issuing an x87 floating-point instruction, as in the following example.
ivecA = ivecA & /* Ivec logical operation that uses MMX instructions
ivecB; */
empty (); /* clear state */
cout << f32vec4a; /* F32vec4 operation that uses x87 floating-point
instructions */
Caution
Failure to clear the MMX registers can result in incorrect execution or poor performance due to an
incorrect register state.
Intel strongly recommends that you follow the guidelines for using the EMMS instruction. Refer to this
topic before coding with the Ivec classes.
Capabilities
The fundamental capabilities of each C++ SIMD class include:
z Computation
z Horizontal data motion
z Branch compression/elimination
z Caching hints
Understanding each of these capabilities and how they interact is crucial to achieving desired results.
Computation
The SIMD C++ classes contain vertical operator support for most arithmetic operations, including
shifting and saturation.
Computation operations include: +, -, *, /, reciprocal ( rcp and rcp_nr ), square root (sqrt),
reciprocal square root ( rsqrt and rsqrt_nr ).
Operations rcp and rsqrt are new approximating instructions with very short latencies that produce
results with at least 12 bits of accuracy. Operations rcp_nr and rsqrt_nr use software refining
techniques to enhance the accuracy of the approximations, with a minimal impact on performance.
(The "nr" stands for Newton-Raphson, a mathematical technique for improving performance using an
approximate result.)
The add_horizontal, unpack_low and pack_sat functions are examples of horizontal data
support. This support enables certain algorithms that cannot exploit the full potential of SIMD
instructions.
Shuffle intrinsics are another example of horizontal data flow. Shuffle intrinsics are not expressed in
the C++ classes due to their immediate arguments. However, the C++ class implementation enables
you to mix shuffle intrinsics with the other C++ functions. For example:
Typically every instruction with horizontal data flow contains some inefficiency in the implementation. If
possible, implement your algorithms without using the horizontal capabilities.
Branch Compression/Elimination
Branching in SIMD architectures can be complicated and expensive, possibly resulting in poor
predictability and code expansion. The SIMD C++ classes provide functions to eliminate branches,
using logical operations, max and min functions, conditional selects, and compares. Consider the
following example:
This operation is independent of the value of i. For each i, the result could be either A or B depending
on the actual values. A simple way of removing the branch altogether is to use the select_gt
function, as follows:
Is16vec4 a, b, c
c = select_gt(a, b, a, b)
Caching Hints
Streaming SIMD Extensions provide prefetching and streaming hints. Prefetching data can minimize
the effects of memory latency. Streaming hints allow you to indicate that certain data should not be
cached. This results in higher performance for data that should be cached.
The M64 and M128 classes define the __m64 and __m128i data types from which the rest of the
Ivec classes are derived. The first generation of child classes are derived based solely on bit sizes of
128, 64, 32, 16, and 8 respectively for the I128vec1, I64vec1, 164vec2, I32vec2, I32vec4,
I16vec4, I16vec8, I8vec16, and I8vec8 classes. The latter seven of the these classes require
specification of signedness and saturation.
Caution
Do not intermix the M64 and M128 data types. You will get unexpected behavior if you do.
Is64vec2
Iu64vec2
Is32vec4
Iu32vec4
Is16vec8
Iu16vec8
Is8vec16
Iu8vec16
Is32vec2
Iu32vec2
Is16vec4
Iu16vec4
Is8vec8
Iu8vec8
<type><signedness><bits>vec<elements>
{ F | I } { s | u } { 64 | 32 | 16 | 8 } vec { 8 | 4 | 2 | 1 }
where
z Nearest Common Ancestor -- This is the intermediate or parent class of two classes of the
same size. For example, the nearest common ancestor of Iu8vec8 and Is8vec8 is I8vec8.
Also, the nearest common ancestor between Iu8vec8 and I16vec4 is M64.
z Casting -- Changes the data type from one class to another. When an operation uses different
data types as operands, the return value of the operation must be assigned to a single data
type. Therefore, one or more of the data types must be converted to a required data type. This
conversion is known as a typecast. Sometimes, typecasting is automatic, other times you must
use special syntax to explicitly typecast it yourself.
z Operator Overloading -- This is the ability to use various operators on the same user-defined
data type of a given class. Once you declare a variable, you can add, subtract, multiply, and
perform a range of operations. Each family of classes accepts a specified range of operators,
and must comply by rules and restrictions regarding typecasting and operator overloading as
defined in the header files. The following table shows the notation used in this documention to
address typecasting, operator overloading, and other rules.
The table that follows shows automatic and explicit sign and size typecasting. "Explicit" means that it is
illegal to mix different types without an explicit typecasting. "Automatic" means that you can mix types
freely and the compiler will do the typecasting for you.
Assignment Operator
Any Ivec object can be assigned to any other Ivec object; conversion on assignment from one Ivec
object to another is automatic.
Is16vec4 A;
Is8vec8 B;
I64vec1 C;
Logical Operators
The logical operators use the symbols and intrinsics listed in the following table.
I64vec1 A;
Is8vec8 B;
Iu8vec8 C;
C = A & B;
/* Same size and signedness operators return the nearest common ancestor.*/
C = Iu8vec8(A&B)+ C;
When A and B are of the same class, they return the same type. When A and B are of different
classes, the return value is the return type of the nearest common ancestor.
The logical operator returns values for combinations of classes, listed in the following tables, apply
when A and B are of different classes.
For logical operators with assignment, the return value of R is always the same data type as the pre-
declared value of R as listed in the table that follows.
Return Type Left Side (R) AND OR XOR Right Side (Any Ivec Type)
I128vec1 I128vec1 R &= |= ^= I[s|u][N]vec[N] A;
Is16vec4 A;
Iu16vec4 B;
I16vec4 C;
C = A + B;
Is16vec4 A;
Iu16vec4 B;
A += B;
B -= A;
Is16vec4 A,C;
Iu32vec24 B;
C = A + C;
C = A + (Is16vec4)B;
Addition + R = A + B _mm_add_epi64
+= R += A _mm_add_epi32
_mm_add_epi16
_mm_add_epi8
_mm_add_pi32
_mm_add_pi16
_mm_add_pi8
Subtraction - R=A-B _mm_sub_epi64
-= R -= A _mm_sub_epi32
_mm_sub_epi16
_mm_sub_epi8
_mm_sub_pi32
_mm_sub_pi16
_mm_sub_pi8
The following table lists addition and subtraction return values for combinations of classes when the
right side operands are of different signedness. The two operands must be the same size, otherwise
you must explicitly indicate the typecasting.
R Add Sub A B
I64vec2 R + - I[s|u]64vec2 A I[s|u]64vec2 B
The following table shows the return data type values for operands of the addition and subtraction
operators with assignment. The left side operand determines the size and signedness of the return
value. The right side operand must be the same size as the left operand; otherwise, you must use an
explicit typecast.
Return Value (R) Left Side (R) Add Sub Right Side (A)
I[x]32vec4 I[x]32vec2 R += -= I[s|u]32vec4 A;
I[x]32vec2 R I[x]32vec2 R += -= I[s|u]32vec2 A;
Multiplication Operators
The multiplication operators can only accept and return data types from the I[s|u]16vec4 or I
[s|u]16vec8 classes, as shown in the following example.
Is16vec4 A,C;
Iu32vec2 B;
C = A * C;
C = A * (Is16vec4)B;
Is16vec4 A;
Iu16vec4 B;
I16vec4 C;
C = A + B;
Is16vec4 A,B,C,D;
C = mul_high(A,B);
D = mul_add(A,B);
* *= R=A*B _mm_mullo_pi16
R *= A _mm_mullo_epi16
The multiplication return operators always return the nearest common ancestor as listed in the table
that follows. The two operands must be 16 bits in size, otherwise you must explicitly indicate
typecasting.
R Mul A B
I16vec4 R * I[s|u]16vec4 A I[s|u]16vec4 B
The following table shows the return values and data type assignments for operands of the
multiplication operators with assignment. All operands must be 16 bytes in size. If the operands are not
the right size, you must use an explicit typecast.
Return Value (R) Left Side (R) Mul Right Side (A)
I[x]16vec8 I[x]16vec8 *= I[s|u]16vec8 A;
Shift Operators
The right shift argument can be any integer or Ivec value, and is implicitly converted to a M64 data
type. The first or left operand of a << can be of any type except I[s|u]8vec[8|16] .
Is16vec4 A,C;
Iu32vec2 B;
C = A;
Is16vec4 A, C;
Iu16vec4 B, R;
R = (Iu16vec4)(A & B) C;
R = (Is16vec4)(A & B) C;
Right shift operations with signed data types use arithmetic shifts. All unsigned and intermediate
classes correspond to logical shifts. The table below shows how the return type is determined by the
first argument type.
Comparison Operators
The equality and inequality comparison operands can have mixed signedness, but they must be of the
same size. The comparison operators for less-than and greater-than must be of the same sign and
size.
Iu8vec8 A;
Is8vec8 B;
I8vec8 C;
C = cmpneq(A,B);
equal/not-equal comparisons */
Iu8vec8 A, C;
Is16vec4 B;
C = cmpeq(A,(Iu8vec8)B);
Iu16vec4 A;
Is16vec4 B, C;
C = cmpge((Is16vec4)A,B);
C = cmpgt(B,C);
Comparison operators have the restriction that the operands must be the size and sign as listed in the
Compare Operator Overloading table.
R Comparison A B
I32vec2 R cmpeq I[s|u]32vec2 B I[s|u]32vec2 B
cmpne
I16vec4 R I[s|u]16vec4 B I[s|u]16vec4 B
I8vec8 R I[s|u]8vec8 B I[s|u]8vec8 B
I32vec2 R cmpgt Is32vec2 B Is32vec2 B
cmpge
I16vec4 R cmplt Is16vec4 B Is16vec4 B
cmple
I8vec8 R Is8vec8 B Is8vec8 B
/* Return the nearest common ancestor data type if third and fourth
operands are of the same size, but different signs */
All conditional select operands must be of the same size. The return data type is the nearest common
ancestor of operands C and D. For conditional select operations using greater-than or less-than
operations, the first and second operands must be signed as listed in the table that follows.
R Comparison A and B C D
I32vec2 R select_eq I[s|u]32vec2 I[s|u]32vec2 I[s|u]32vec2
select_ne
I16vec4 R I[s|u]16vec4 I[s|u]16vec4 I[s|u]16vec4
I8vec8 R I[s|u]8vec8 I[s|u]8vec8 I[s|u]8vec8
The table below shows the mapping of return values from R0 to R7 for any number of elements. The
same return value mappings also apply when there are fewer than four return values.
A0 Available Operators B0
Debug
The debug operations do not map to any compiler intrinsics for MMX(TM) instructions. They are
provided for debugging programs only. Use of these operations may result in loss of performance, so
you should not use them outside of debugging.
Output
The four 32-bit values of A are placed in the output buffer and printed in the following format (default in
decimal):
The two 32-bit values of A are placed in the output buffer and printed in the following format (default in
decimal):
"[1]:A1 [0]:A0"
Corresponding Intrinsics: none
The eight 16-bit values of A are placed in the output buffer and printed in the following format (default
in decimal):
The four 16-bit values of A are placed in the output buffer and printed in the following format (default in
decimal):
The sixteen 8-bit values of A are placed in the output buffer and printed in the following format (default
is decimal):
cout << Is8vec16 A; cout << Iu8vec16 A; cout << hex << Iu8vec8 A;
The eight 8-bit values of A are placed in the output buffer and printed in the following format (default is
decimal):
cout << Is8vec8 A; cout << Iu8vec8 A;cout << hex << Iu8vec8 A;
Access and read element i of A. If DEBUG is enabled and the user tries to access an element
outside of A, a diagnostic message is printed and the program aborts.
Assign R to element i of A. If DEBUG is enabled and the user tries to assign a value to an
element outside of A, a diagnostic message is printed and the program aborts.
Unpack Operators
Interleave the 64-bit value from the high half of A with the 64-bit value from the high half of B.
R0 = A1;
R1 = B1;
Interleave the two 32-bit values from the high half of A with the two 32-bit values from the high half of
B.
R0 = A1;
R1 = B1;
R2 = A2;
R3 = B2;
Interleave the 32-bit value from the high half of A with the 32-bit value from the high half of B.
R0 = A1;
R1 = B1;
Interleave the four 16-bit values from the high half of A with the two 16-bit values from the high half of
B.
R0 = A2;
R1 = B2;
R2 = A3;
R3 = B3;
Corresponding intrinsic: _mm_unpackhi_epi16
Interleave the two 16-bit values from the high half of A with the two 16-bit values from the high half of
B.
R0 = A2;R1 = B2;
R2 = A3;R3 = B3;
Corresponding intrinsic: _mm_unpackhi_pi16
Interleave the four 8-bit values from the high half of A with the four 8-bit values from the high half of B.
R0 = A4;
R1 = B4;
R2 = A5;
R3 = B5;
R4 = A6;
R5 = B6;
R6 = A7;
R7 = B7;
Corresponding intrinsic: _mm_unpackhi_pi8
Interleave the sixteen 8-bit values from the high half of A with the four 8-bit values from the high half of
B.
R0 = A8;
R1 = B8;
R2 = A9;
R3 = B9;
R4 = A10;
R5 = B10;
R6 = A11;
R7 = B11;
R8 = A12;
R8 = B12;
R2 = A13;
R3 = B13;
R4 = A14;
R5 = B14;
R6 = A15;
R7 = B15;
Corresponding intrinsic: _mm_unpackhi_epi16
Interleave the 32-bit value from the low half of A with the 32-bit value from the low half of B.
R0 = A0;
R1 = B0;
Corresponding intrinsic: _mm_unpacklo_epi32
Interleave the 64-bit value from the low half of A with the 64-bit values from the low half of B
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
Corresponding intrinsic: _mm_unpacklo_epi32
Interleave the two 32-bit values from the low half of A with the two 32-bit values from the low half of B
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
Corresponding intrinsic: _mm_unpacklo_epi32
Interleave the 32-bit value from the low half of A with the 32-bit value from the low half of B.
R0 = A0;
R1 = B0;
Corresponding intrinsic: _mm_unpacklo_pi32
Interleave the two 16-bit values from the low half of A with the two 16-bit values from the low half of B.
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
R4 = A2;
R5 = B2;
R6 = A3;
R7 = B3;
Corresponding intrinsic: _mm_unpacklo_epi16
Interleave the two 16-bit values from the low half of A with the two 16-bit values from the low half of B.
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
Corresponding intrinsic: _mm_unpacklo_pi16
Interleave the four 8-bit values from the high low of A with the four 8-bit values from the low half of B.
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
R4 = A2;
R5 = B2;
R6 = A3;
R7 = B3;
R8 = A4;
R9 = B4;
R10 = A5;
R11 = B5;
R12 = A6;
R13 = B6;
R14 = A7;
R15 = B7;
Corresponding intrinsic: _mm_unpacklo_epi8
Interleave the four 8-bit values from the high low of A with the four 8-bit values from the low half of B.
R0 = A0;
R1 = B0;
R2 = A1;
R3 = B1;
R4 = A2;
R5 = B2;
R6 = A3;
R7 = B3;
Corresponding intrinsic: _mm_unpacklo_pi8
Pack Operators
Pack the eight 32-bit values found in A and B into eight 16-bit values with signed saturation.
Pack the four 32-bit values found in A and B into eight 16-bit values with signed saturation.
Pack the sixteen 16-bit values found in A and B into sixteen 8-bit values with signed saturation.
Pack the eight 16-bit values found in A and B into eight 8-bit values with signed saturation.
Pack the sixteen 16-bit values found in A and B into sixteen 8-bit values with unsigned saturation .
Pack the eight 16-bit values found in A and B into eight 8-bit values with unsigned saturation.
void empty(void);
Corresponding intrinsic: _mm_empty
You must include fvec.h header file for the following functionality.
Compute the element-wise maximum of the respective signed integer words in A and B.
Compute the element-wise minimum of the respective signed integer words in A and B.
Create an 8-bit mask from the most significant bits of the bytes in A.
Conditionally store byte elements of A to address p. The high bit of each byte in the selector B
determines whether the corresponding byte in A will be stored.
Store the data in A to the address p without polluting the caches. A can be any Ivec type.
Compute the element-wise average of the respective unsigned 8-bit integers in A and B.
Compute the element-wise average of the respective unsigned 16-bit integers in A and B.
r := (int)A0;
Convert the four floating-point values of A to two the two least significant double-precision floating-point
values.
r0 := (double)A0;
r1 := (double)A1;
Convert the two double-precision floating-point values of A to two single-precision floating-point values.
r0 := (float)A0;
r1 := (float)A1;
Convert the signed int in B to a double-precision floating-point value and pass the upper double-
precision value from A through to the result.
r0 := (double)B;
r1 := A1;
r := (int)A0;
Convert the two lower floating-point values of A to two 32-bit integer with truncation, returning the
integers in packed form.
r0 := (int)A0;
r1 := (int)A1;
Convert the 32-bit integer value B to a floating-point value; the upper three floating-point values are
passed through from A.
r0 := (float)B;
r1 := A1;
r2 := A2;
r3 := A3;
Convert the two 32-bit integer values in packed form in B to two floating-point values; the upper two
floating-point values are passed through from A.
r0 := (float)B0;
r1 := (float)B1;
r2 := A2;
r3 := A3;
The packed floating-point input values are represented with the right-most value lowest as shown in
the following table.
Fvec classes use the syntax conventions shown the following examples:
where
Data Alignment
Memory operations using the Streaming SIMD Extensions should be performed on 16-byte-aligned
data whenever possible.
F32vec4 and F64vec2 object variables are properly aligned by default. Note that floating point arrays
are not automatically aligned. To get 16-byte alignment, you can use the alignment __declspec:
Conversions
__m128d mm = A & B; /* where A,B are F64vec2 object variables */
All Fvec object variables can be implicitly converted to __m128 data types. For example, the results of
computations performed on F32vec4 or F32vec1 object variables can be assigned to __m128 data
types.
Constructor Declaration
F64vec2 A; N/A N/A
F32vec4 B;
F32vec1 C;
Double Initialization
/* Initializes two doubles. */ _mm_set_pd A0 := d0;
F64vec2 A(double d0, double d1); A1 := d1;
F64vec2 A = F64vec2(double d0, double d1);
F64vec2 A(double d0); _mm_set1_pd A0 := d0;
/* Initializes both return values A1 := d0;
with the same double precision value */.
Float Initialization
F32vec4 A(float f3, float f2, _mm_set_ps A0 := f0;
float f1, float f0); A1 := f1;
F32vec4 A = F32vec4(float f3, float f2, A2 := f2;
float f1, float f0); A3 := f3;
F32vec4 A(float f0); _mm_set1_ps A0 := f0;
/* Initializes all return values A1 := f0;
with the same floating point value. */ A2 := f0;
A3 := f0;
F32vec4 A(double d0); _mm_set1_ps(d) A0 := d0;
/* Initialize all return values with A1 := d0;
the same double-precision value. */ A2 := d0;
A3 := d0;
F32vec1 A(double d0); _mm_set_ss(d) A0 := d0;
/* Initializes the lowest value of A A1 := 0;
with d0 and the other values with 0.*/ A2 := 0;
A3 := 0;
F32vec1 B(float f0); _mm_set_ss B0 := f0;
/* Initializes the lowest value of B B1 := 0;
with f0 and the other values with 0.*/ B2 := 0;
B3 := 0;
Arithmetic Operators
The following table lists the arithmetic operators of the Fvec classes and generic syntax. The
operators have been divided into standard and advanced operations, which are described in more
detail later in this section.
Standard Addition + R = A + B;
+= R += A;
Subtraction - R = A - B;
-= R -= A;
Multiplication * R = A * B;
*= R *= A;
Division / R = A / B;
/= R /= A;
R0:= A0 + - * / B0
R1:= A1 + - * / B1 N/A
R0:= += -= *= /= A0
R1:= += -= *= /= A1 N/A
R2:= += -= *= /= A2 N/A N/A
R3:= += -= *= /= A3 N/A N/A
The table below lists standard arithmetic operator syntax and intrinsics.
Square Root
Reciprocal
Horizontal Add
R0 := min(A0,B0);
R1 := min(A1,B1);
Corresponding intrinsic: _mm_min_pd
Compute the minimums of the four single precision floating-point values of A and B.
R0 := min(A0,B0);
R1 := min(A1,B1);
R2 := min(A2,B2);
R3 := min(A3,B3);
Corresponding intrinsic: _mm_min_ps
Compute the minimum of the lowest single precision floating-point values of A and B.
R0 := min(A0,B0);
Corresponding intrinsic: _mm_min_ss
Compute the maximums of the two double precision floating-point values of A and B.
R0 := max(A0,B0);
R1 := max(A1,B1);
Corresponding intrinsic: _mm_max_pd
Compute the maximums of the four single precision floating-point values of A and B.
R0 := max(A0,B0);
R1 := max(A1,B1);
R2 := max(A2,B2);
R3 := max(A3,B3);
Corresponding intrinsic: _mm_max_ps
Compute the maximum of the lowest single precision floating-point values of A and B.
R0 := max(A0,B0);
Corresponding intrinsic: _mm_max_ss
Logical Operators
The table below lists the logical operators of the Fvec classes and generic syntax. The logical
operators for F32vec1 classes use only the lower 32 bits.
OR | R = A | B;
|= R |= A;
XOR ^ R = A ^ B;
^= R ^= A;
The following table lists standard logical operators syntax and corresponding intrinsics. Note that there
is no corresponding scalar intrinsic for the F32vec1 classes, which accesses the lower 32 bits of the
packed vector intrinsics.
Compare Operators
The operators described in this section compare the single precision floating-point values of A and B.
Comparison between objects of any Fvec class return the same class being compared.
The following table lists the compare operators for the Fvec classes.
Compare Operators
The mask is set to 0xffffffff for each floating-point value where the comparison is true and
0x00000000 where the comparison is false. The table below shows the return values for each class
of the compare operators, which use the syntax described earlier in the Return Value Notation section.
The table below shows examples for arithmetic operators and intrinsics.
The following table shows examples for conditional select operations and corresponding intrinsics.
Stores (non-temporal) the four single-precision, floating-point values of A. Requires a 16-byte aligned
address.
Debugging
The debug operations do not map to any compiler intrinsics for MMX(TM) technology or Streaming
SIMD Extensions. They are provided for debugging programs only. Use of these operations may result
in loss of performance, so you should not use them outside of debugging.
Output Operations
The two single, double-precision floating-point values of A are placed in the output buffer and printed in
decimal format as follows:
The four, single-precision floating-point values of A are placed in the output buffer and printed in
decimal format as follows:
The lowest, single-precision floating-point value of A is placed in the output buffer and printed.
Read one of the two, double-precision floating-point values of A without modifying the corresponding
floating-point value. Permitted values of i are 0 and 1. For example:
If DEBUG is enabled and i is not one of the permitted values (0 or 1), a diagnostic message is printed
and the program aborts.
Read one of the four, single-precision floating-point values of A without modifying the corresponding
floating point value. Permitted values of i are 0, 1, 2, and 3. For example:
If DEBUG is enabled and i is not one of the permitted values (0-3), a diagnostic message is printed
and the program aborts.
Modify one of the two, double-precision floating-point values of A. Permitted values of int i are 0 and
1. For example:
Modify one of the four, single-precision floating-point values of A. Permitted values of int i are 0, 1,
2, and 3. For example:
If DEBUG is enabled and int i is not one of the permitted values (0-3), a diagnostic message
is printed and the program aborts.
Stores the two, double-precision floating-point values of A. No assumption is made for alignment.
Loads four, single-precision floating-point values, copying them into the four floating-point values of A.
No assumption is made for alignment.
Stores the four, single-precision floating-point values of A. No assumption is made for alignment.
Selects and interleaves the higher, double-precision floating-point values from A and B.
Selects and interleaves the lower two, single-precision floating-point values from A and B.
Selects and interleaves the higher two, single-precision floating-point values from A and B.
int i = move_mask(F64vec2 A)
i := sign(a1)<<1 | sign(a0)<<0
Corresponding intrinsic: _mm_movemask_pd
Creates a 4-bit mask from the most significant bits of the four, single-precision floating-point values of
A, as follows:
int i = move_mask(F32vec4 A)
i := sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)<<0
Corresponding intrinsic: _mm_movemask_ps
_mm_mul_[x]
Operators Corresponding
Intrinsic
F64vec2ToInt _mm_cvttsd_si32
F32vec4ToF64vec2 _mm_cvtps_pd
F64vec2ToF32vec4 _mm_cvtpd_ps
IntToF64vec2 _mm_cvtsi32_sd
F32vec4ToInt _mm_cvtt_ss2si
F32vec4ToIs32vec2 _mm_cvttps_pi32
IntToF32vec4 _mm_cvtsi32_ss
Is32vec2ToF32vec4 _mm_cvtpi32_ps
Programming Example
This sample program uses the F32vec4 class to average the elements of a 20 element floating point
array. This code is also provided as a sample in the file, AvgClass.cpp.
//*****************************************************************
// Function: Add20ArrayElements
// Add all the elements of a 20 element array
//*****************************************************************
void Add20ArrayElements (F32vec4 *array, float *result)
{
F32vec4 vec0, vec1;
vec0 = _mm_load_ps ((float *) array);
//*****************************************************
// Add all elements of the array, 4 elements at a time
//******************************************************
vec0 += array[1];// Add elements 5-8
vec0 += array[2];// Add elements 9-12
vec0 += array[3];// Add elements 13-16
vec0 += array[4];// Add elements 17-20
//*****************************************************************
// There are now 4 partial sums. Add the 2 lowers to the 2 raises,
// then add those 2 results together
//*****************************************************************