IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
IntelAVX-512 InstructionSetForPacketProcessing TechGuide 633930v2
Intel Corporation
Authors 1 Introduction
Ray Kinsella Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set is a powerful
addition to the packet processing toolkit. As Intel’s latest generation of SIMD instruction
Chris MacNamara
set, Intel® AVX-512 (also known as AVX-512) is a game changer, doubling register width,
Georgii Tkachuk doubling the number of available registers, and generally offering a more flexible
instruction set compared to its predecessors. Intel® AVX-512 has been available since
the 1st Generation Intel® Xeon® Scalable processors and is now optimized in the latest
3rd Generation processors with compelling performance benefits.
Reviewer
This paper is the first in a series of white papers focusing on how to write packet
Konstantin Ananyev
processing software using the Intel® AVX-512 instruction set. It provides a brief overview
of the Intel® AVX-512 instruction set and describes the microarchitecture optimizations
for the instruction set in the latest 3rd Generation Intel® Xeon® Scalable processors.
The next paper in the series describes examples of where Intel® AVX-512 is being used in
packet processing, including performance benefits using Intel® AVX-512 optimizations
that demonstrate performance gains in excess of 300% in microbenchmarks 1. An
executive summary of these papers is also available.
This technology guide is intended for organizations developing or deploying packet
processing software on the latest 3rd Generation Intel® Xeon® Scalable processors.
It is a part of the Network Transformation Experience Kit, which is available at
https://networkbuilders.intel.com/network-technologies/network-transformation-exp-
kits.
1
See section 4.4.1 for details. See backup for workloads and configurations or visit www.Intel.com/PerformanceIndex. Results may vary.
1
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
Table of Contents
1 Introduction ................................................................................................................................................................................................................. 1
1.1 Terminology ..............................................................................................................................................................................................................................................4
1.2 Reference Documentation ...................................................................................................................................................................................................................4
2 Overview ....................................................................................................................................................................................................................... 4
5 Summary .....................................................................................................................................................................................................................18
Figures
Figure 1. Vector Addition with Intel® SSE, Intel® AVX2, and Intel® AVX-512 Instruction Sets.................................................................................................................. 5
Figure 2. Masked Vector Addition, Mask, and Source Register............................................................................................................................................................................ 7
Figure 3. Masked Vector Addition and Mask Register ............................................................................................................................................................................................. 7
Figure 4. Generating 8-bit Masks from 64-bit Packed Values.............................................................................................................................................................................. 8
Figure 5. Packed 64-bit Addition with Vector Length Extensions .................................................................................................................................................................... 10
Figure 6. Ternary Bitwise Operations .......................................................................................................................................................................................................................... 10
Figure 7. Ternary Instruction Truth Table.................................................................................................................................................................................................................. 11
Figure 8. Ternary Operation with Truth Table ......................................................................................................................................................................................................... 11
Figure 9. Processor Core Pipeline Functionality of the 1st Generation Intel® Xeon® Scalable Processor Microarchitecture ..................................................... 12
Figure 10. Workload Mix and Core Frequency......................................................................................................................................................................................................... 13
Figure 11. Processor Core Pipeline Functionality of the 1st Generation Intel® Xeon® Scalable Processor Microarchitecture .................................................. 14
Figure 12. Processor Core Pipeline Functionality of the 1st Generation Intel® Xeon® Scalable Processor Microarchitecture .................................................. 14
Figure 13. Timeline of SIMD Instruction Mix and Power Licensing .................................................................................................................................................................. 15
Figure 14. Power License Level Transitions Over Time by SIMD Implementation ..................................................................................................................................... 17
Figure 15. Single Core, Single Thread, 64-byte Packet Performance with DPDK L3FWD-ACL and DPDK-TEST-ACL Example Applications, 4096 Flows,
4096 ACL Rules on 3rd Generation Intel® Xeon® Scalable Processors ..................................................................................................................................... 17
Tables
Table 1. 64-bit Vector Addition ....................................................................................................................................................................................................................................... 5
Table 2. Packed Data Type Vector Addition ............................................................................................................................................................................................................... 6
Table 3. 64-bit Masked Vector Addition ...................................................................................................................................................................................................................... 6
Table 4. Generation of 8-bit Mask from 64-bit Packed Values ........................................................................................................................................................................... 7
Table 5. 64-bit Vector Length Extensions and Vector Addition .......................................................................................................................................................................... 9
Table 6. 64-bit Ternary Logic ......................................................................................................................................................................................................................................... 10
Table 7. Maximum Intel® Turbo Boost Technology Core Frequency Levels ................................................................................................................................................. 13
Table 8. DPDK L3fwd ACL Power Transitions Test Configuration .................................................................................................................................................................... 17
2
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
3
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
1.1 Terminology
ABBREVIATION DESCRIPTION
AVX Advanced Vector Extensions
DPDK Data Plane Development Kit (dpdk.org)
FD.io Fata Data I/O, an umbrella project for Open Source network projects
FMA Fused multiply add
HPC High Performance Computing
IDS/IPS Intrusion Detection Systems/Intrusion Prevention Systems
ISA Instruction Set Architecture
SIMD Single Instruction Multiple Data: a term used to describe vector instructions sets such as Intel® SSE and Intel® AVX
SSE Streaming SIMD Extensions (predecessor to AVX)
TDP Thermal Design Power
VPP FD.io Vector Packet Processing, an Open Source networking stack (part of FD.io)
2 Overview
While SIMD instructions are perhaps best known for their use in domains such as HPC, image processing, and artificial intelligence,
what may be less appreciated is SIMD’s applicability to packet processing. In fact, SIMD instructions are already commonly used in
packet processing applications to speed up compute-bound algorithms. Intel® AVX-512, Intel's very latest SIMD instruction set, is a
richer and more flexible instruction set compared to its predecessors, introducing new concepts such as masked operations,
broadcasts, instruction set extensions, register modes, and bitwise ternary instructions.
This technology guide has two sections that are structured as follows:
• The first section is a high-level instruction set overview, describing some of the new concepts implemented in the Intel® AVX-
512 instruction set.
• The second section has details on how the latest Intel® Xeon® Scalable processors have been optimized for Intel® AVX-512
instruction set.
4
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
INSTRUCTION SET C INTRINSIC FORM OF INSTRUCTION PACKED 64-BIT INTEGERS ASSEMBLER FORM
The Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction _mm_add_epi64 (paddq) adds two 128-bit registers a and b and
returns the result in a single 128-bit integer register. Each register contains two packed 64-bit integers, thereby completing two
distinct 64-bit additions in a single instruction.
Intel® AVX2 added the instruction _mm256_add_epi64 (vpaddq), that expands integer vector addition to 256-bit registers. This
expansion doubles the work accomplished in a single instruction when compared to SSE2 instructions, completing four distinct 64-
bit additions in a single instruction.
Now Intel® AVX-512 again doubles the number of packed-integer operations performed in a single instruction. Intel® AVX-512 adds
a 512-bit variant in the form of _mm512_add_epi64 (vpaddq) instruction, expanding the packed-integer addition to 512-bit
registers, allowing completion of eight distinct 64-bit additions in a single instruction.
Figure 1 is a graphical representation of the vector addition instructions described above. The figure shows both the register size
and the amount of work performed doubling with each instruction set generation, represented by the packed integers a0 to a7.
_mm[256,512]_add_epi64
a7 a6 a5 a4 a3 a2 a1 a0
zmm2
b7 b6 b5 b4 b3 b2 b1 b0
zmm1
Figure 1. Vector Addition with Intel® SSE, Intel® AVX2, and Intel® AVX-512 Instruction Sets
It is worth mentioning here that the Intel® AVX2 instruction set (AVX2) also introduced three operand non-destructive instructions,
featuring a destination operand in addition to the two source operands. This was an improvement in AVX2 over Intel® SSE
instructions in that it was no longer necessary for the operation to overwrite one of the source operands. Intel® AVX-512
instructions are also non-destructive in this way.
AVX-512 instructions also expand the number of available SIMD registers from 16 to 32, doubling the number of values that can be
concurrently held in registers.
5
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
unsigned integer operands, the pd suffix indicates packed double-precision floating-point operands, and the ps suffix indicates
packed single-precision floating-point operands.
The C data type of the intrinsic operands is declared as either __m512i, indicating packed integers, or __m512d, indicating packed
double precision floating points, or __m512, indicating packed single precision floating points. It follows that AVX-512 intrinsics
with a ps suffix would therefore have operands with a data type of __m512.
To understand this concept in detail, consider the example of vector addition shown in Table 2.
Table 2. Packed Data Type Vector Addition
_pd __m512d _mm512_add_pd (__m512d a, __m512d b) 8 x 64-bit double precision floating point
_ps __m512 _mm512_add_ps (__m512 a, __m512 b) 16 x 32-bit single precision floating point
In Table 2, all the intrinsics are variants of the 512-bit vector addition, with each intrinsic suffix indicating how the operands are
treated during the addition. For example, the epi64 suffix indicates that the intrinsic _mm512_add_epi64 will treat operands a and b
as packed 64-bit integers, with eight 64-bit integers packing into a 512-bit register. Similarly, the epi32 suffix indicates that the
intrinsic _mm512_add_epi32 will treat operands a and b as packed 32-bit integers, with sixteen 32-bit integers packing into a single
512-bit register, and so on.
In Table 3, notice that AVX-512 instructions described in the previous section are included along with two new variants that have no
equivalents in the previous instruction set generations:
• _mm512_mask_add_epi64: The mask register k controls the generation of the packed return value. For each packed 64-bit
integer, it indicates if the value must be read from the src register or must be the result of the packed 64-bit addition of a and b.
6
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
• _mm512_maskz_add_epi64: Similarly, here too the mask register k indicates that the return value must be zero or the result of
the packed addition of a and b.
Figure 2 is a graphical representation of the intrinsic _mm512_mask_add_epi64. It shows a mask register indicating, via single-bit
flags, whether each packed 64-bit integer of the return value is read from corresponding 64-bit values in a source register or is the
result of the addition of 64-bit values in a and b.
_mm512_mask_add_epi64
1 0 1 0 1 0 1 0 - BA - DD - CA - FE
a7 - a5 - a3 - a1 -
b7 - b5 - b3 - b1 -
mask
1 0 1 0 1 0 1 0 _mm512_maskz_add_epi64
a7 - a5 - a3 - a1 -
b7 - b5 - b3 - b1 -
7
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
Details of the two AVX-512 intrinsics listed in Table 4 are given below:
• _mm512_movepi64_mask: This instruction sets each bit of the returned 8-bit mask, based on the most significant bit of the
corresponding 8-packed 64-bit integer in a 512-bit register a.
• _mm512_cmp_epu64_mask: This instruction compares the packed unsigned 64-bit integers in a and b based on the
comparison operand specified in imm8 and stores the results in an 8-bit mask register.
Figure 4 is a graphical representation of the _mm512_cmp_epu64_mask intrinsic. It generates a mask based on a comparison of
two packed 64-bit values a and b. The type of comparison is controlled by the immediate operator imm8 with the range of possible
values as shown in the table within the figure.
_mm512_cmp_epu64_mask
a7 a6 a5 a4 a3 a2 a1 a0
zmm2
⨂
b7 b6 b5 b4 b3 b2 b1 b0
zmm1
mask
1 0 1 0 1 0 1 0
CONSTANT OPERATOR
_MM_CMPINT_EQ Equals
_MM_CMPINT_LT Less than
_MM_CMPINT_LE Less than or Equal
_MM_CMPINT_FALSE False
_MM_CMPINT_NE Not Equal
_MM_CMPINT_NLT Greater than or Equal
_MM_CMPINT_NLE Greater than
_MM_CMPINT_TRUE True
3.3 Extensions
Intel® AVX-512 is a family of instruction set extensions whose first member is AVX-512 Foundation (AVX512F). AVX512F extended
AVX/AVX2 instructions to support 512-bit operands. It also added masked variants of these instructions, as well as instructions for
creating and manipulating masks. A series of extensions follows thereafter on top of AVX512F, each of which added a distinct set of
new functionalities.
For example, all four of five AVX-512 instructions discussed up to this point—_mm512_add_epi64, _mm512_mask_add_epi64,
_mm512_maskz_add_epi64, and _mm512_cmp_epu64_mask — were added as part of AVX512F. The other instruction described—
_mm512_movepi64_mask—was added as part of the AVX512DQ extension. It is important to note that AVX-512 is therefore not a
single instruction set but is, instead, an improved and extended set with each new generation of Intel Xeon Scalable processors.
• The 1st Generation Intel® Xeon® Scalable processor for servers introduced the following extensions:
− AVX-512 Foundation (AVX512F)
− Conflict Detection (AVX512CD)
− Vector Length (AVX512VL)
− Doubleword and Quadword (AVX512DQ)
8
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
The AVX512 vector length extensions add four new intrinsics that have no equivalent in previous instruction set generations:
• _mm_mask_add_epi64 and _mm_maskz_add_epi64: These intrinsics are similar to _mm512_mask_add_epi64 and
_mm512_maskz_add_epi64 intrinsics described in section 3.2; they are masked variants of the 128-bit SSE _mm_add_epi64
intrinsic.
• _mm256_mask_add_epi64 and _mm256_maskz_add_epi64: These intrinsics are similar to _mm_mask_add_epi64 and
_mm_maskz_add_epi64 intrinsics described in section 3.2; they are masked variants of the 256-bit AVX2 _mm256_add_epi64
intrinsic.
Figure 5 is a graphical representation of the _mm_maskz_add_epi64, _mm256_maskz_add_epi64, and _mm512_maskz_add_epi64
intrinsics. It shows masked vector addition extended to support 128-bit, 256-bit, and 512-bit register sizes.
9
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
mask
1 0 1 0 1 0 1 0 _mm[256,512]_maskz_add_epi64
a7 - a5 - a3 - a1 -
b7 - b5 - b3 - b1 -
uint64_t dst, a, b c;
…
dst = (a ^ b) & c;
AVX-512 __m512i _mm512_mask_ternarylogic_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b, int imm8)
The _mm512_ternarylogic_epi64 intrinsic uses the argument imm8 to indicate how the input vectors a, b, and c must be treated,
that is, which bitwise operations must be performed. The value for imm8 is constructed using a truth table, an example of which is
shown in Figure 7.
To complete the truth table, substitute the symbol ⦻ with whichever bitwise operation you want to perform on the operands.
Record the value that is generated, given the inputs A, B, and C, in the final column. Repeat this for all 8-bits, which is then combined
to become the imm8 value.
10
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
_mm512_ternarylogic_epi64
Bit A B C (A ^ B) & C a7 a6 a5 a4 a3 a2 a1 a0
0 0 0 0 0
1 0 0 1 0
^
2 0 1 0 0
3 0 1 1 1 b7 b6 b5 b4 b3 b2 b1 b0
4 1 0 0 0
5 1 0 1 1
6
7
1
1
1
1
0
1
0
0
0x28
& c7 c6 c5 c4 c3 c2 c1 c0
REFERENCE CHAPTER
Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1: Basic Architecture 15
11
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
execution ports have the potential to retire Intel® AVX-512 instructions concurrently yielding a maximum throughput of 1024 bits;
that is, two AVX-512 instructions retiring in parallel producing 512 bits of data each.
The green stars in Figure 9, represent new features in the 1st Generation Intel® Xeon® Scalable processor microarchitecture for
servers compared to the microarchitecture for the client, including a 1 MB L2 cache and an additional AVX-512 execution port. As
shown in Figure 9, when executing AVX-512 instructions, Port 0 and Port 1 fuse into a single execution port. Intel® AVX-512
instructions may also run on Port 5, depending on the type of instruction. Note that shuffles, for example, execute only on Port 5.
The 1st Generation Intel® Xeon® Scalable processor microarchitecture for servers features two 512-bit wide load ports and a single
512-bit store port, enabling two 512-bit load and a single 512-bit store operations to complete concurrently.
In comparison, Ports 0, 1, and 5 may be used to run Intel® AVX2 or Intel® SSE instructions. The maximum throughput of AVX2 is
therefore 768 bits; that is, three AVX2 instructions retiring concurrently yielding 256 bits of data. AVX2 instructions have the
advantage of an additional execution port compared to AVX-512, as Port 0 and Port 1 are unfused. However, AVX-512 still has the
potential, under ideal conditions, to retire an additional 256 bits of data concurrently compared to AVX2.
The load and store ports on the 1st Generation Intel® Xeon® Scalable processor microarchitecture for servers may also be used for
256-bit loads and stores, with no adverse effects beyond that of the reduced load and store throughput with AVX2 when compared
to AVX-512.
Figure 9. Processor Core Pipeline Functionality of the 1st Generation Intel® Xeon® Scalable Processor Microarchitecture
For more information on the 1st Generation Intel® Xeon® Scalable processor microarchitecture for servers, see section 4.5.
4.2 Power
Power is an important consideration when developing software with Intel® AVX2 and Intel® AVX-512 instructions on Intel® Xeon®
Scalable processors. The larger vector SIMD instructions may consume more power per core, with the power increase typically
proportional to the performance increase. Hence, using more power typically results in more performance.
The 1st Generation Intel® Xeon® Scalable processor microarchitecture for servers dynamically manages the voltage based on the
instruction set in use and this may result in the CPU selecting an appropriate frequency at which a given core runs. The selected
frequency depends on several factors, including the instruction mix executing on the core and the number of other cores executing
a similar instruction mix. The instruction mix is comprised of the type, width, and number of instructions, including vector
instructions, that run over a given period of time.
The maximum Intel® Turbo Boost core frequency (P0n) achievable is influenced by the category of instructions executing on that
core and the number of cores sharing that category. The more active cores in the shared category, the lower the frequency of those
cores. These categories are called power levels and are described in Table 7.
12
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
There are three power levels in the Intel® Xeon® Scalable processor that the CPU may apply in non-power limited scenarios. When
power is limited, the microprocessor may also choose to adjust the frequency of the cores to stay within its stated TDP (Thermal
Design Power) limit.
Table 7. Maximum Intel® Turbo Boost Technology Core Frequency Levels
Power Level 0 Maximum frequency Scalar, SSE integer and floating-point, and AVX2 integer instructions, except
Intel® AVX2 Light for AVX2 integer multiplication and fused-multiply-add (FMA) instructions
Power Level 1 Generation & SKU dependent AVX2 integer multiplication, FMA, and floating-point instructions. AVX-512
Intel® AVX2 Heavy/ integer instructions, except for AVX-512 integer multiplication and FMA
AVX-512 Light instructions
Power Level 2 Reduced P0n (Turbo) frequency AVX-512 integer multiplication, FMA, and floating-point instructions
Intel® AVX-512 Heavy compared to Power Level 1
13
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
Figure 11. Processor Core Pipeline Functionality of the 1st Generation Intel® Xeon® Scalable Processor Microarchitecture
14
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
For example, if no additional instructions execute on a given core requiring the AVX-512 Light (Level 1) power level, the core
transitions back to the original power level of AVX2 Light (Level 0). However, if additional instructions that do require the AVX-512
Light (Level 1) power level run during this time period, the power level is maintained, and no transition occurs.
Time interval in
microseconds
t-1000 t t+1000 t+2000 t+3000 t+4000
Power License AVX2 Light AVX2 Heavy / AVX-512 Light AVX2 Light
4.3.1 Power Level Update for 3rd Generation Intel® Xeon® Scalable Processors
Frequency transitions are significantly improved in the 3rd Generation Intel® Xeon® Scalable processors where the block time is
reduced to ~0 µsec, allowing for transitions without a latency cost.
2
See backup for workloads and configurations or visit www.Intel.com/PerformanceIndex. Results may vary.
15
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
The DPDK l3fwd-acl sample application is configured with 4096 forwarding rules, a sample of which is shown below.
Note: The DPDK ACL library is not the only user of Intel® AVX-512 in the packet processing pipeline. Poll-mode-drivers and
memory copy routines may also use Intel® AVX-512 optimizations. Therefore, you should be careful of possible
misconfigurations such as enabling Intel® AVX-512 optimization in the poll-mode-driver and memory copies (with -force-
max-simd-bitwidth), while disabling Intel® AVX-512 optimization in the DPDK ACL Library (with --alg=avx2), and so
on.
After the DPDK l3fwd-acl sample application is up and running and forwarding packets, the power level of the core on which it is
running can be measured using Linux perf and the core_power.* events as follows.
a. The DPDK l3fwd-acl sample application never transitions between power states. When configured to use Intel® AVX-512
optimizations, it continually stays in the AVX-512 Light (Level 1) power level and performance is not adversely impacted
due to the power level transitions as described in section 4.3.
b. Similarly, when configured to use Intel® AVX2 optimizations instead of Intel® AVX-512, the sample application stays
consistently at the AVX2 Light (Level 0) power level.
3
See backup for workloads and configurations or visit www.Intel.com/PerformanceIndex. Results may vary.
16
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
c. The DPDK l3fwd-acl sample application exclusively uses AVX-512 Light (Level 1) power level instructions and entirely
avoids the more impactful frequency shifts associated with using AVX-512 Heavy (Level 2) power level.
Figure 14. Power License Level Transitions Over Time by SIMD Implementation
As shown in Figure 15, the Intel® AVX-512 instruction set improves the flow search performance of the DPDK ACL library up to
300% compared to scalar lookups when tested with an ACL flow lookup microbenchmark 4. When the same DPDK ACL library is
used within a Layer 3 forwarding sample application (l3fwd), the Intel® AVX-512 instruction set improves packet processing
performance by up to 1.35x.
Figure 15. Single Core, Single Thread, 64-byte Packet Performance with DPDK L3FWD-ACL and DPDK-TEST-ACL Example
Applications, 4096 Flows, 4096 ACL Rules on 3rd Generation Intel® Xeon® Scalable Processors
The test configuration used to perform the DPDK l3fwd ACL power transitions are listed in Table 8.
Table 8. DPDK L3fwd ACL Power Transitions Test Configuration
PARAMETERS DESCRIPTION
Test by Intel
Test date 28 Mar 2021
4
See backup for workloads and configurations or visit www.Intel.com/PerformanceIndex. Results may vary.
17
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
PARAMETERS DESCRIPTION
Platform Intel Corporation Reference Platform*
# Nodes 1
# Sockets 2
CPU Intel® Xeon® Gold 6338N CPU
Cores/socket, Threads/socket 32, 64
CPU Frequency 2.2 GHz
L3 Cache 48MiB
microcode 0x8d055260
BIOS version WLYDCRB1.SYS.0020.P86.2103050636
System DDR Mem Config: slots/capacity/run-speed 16/16384 GB/3200 MT/s
Turbo, P-states OFF
Hyper Threads Enabled
NIC Intel Corporation Ethernet Controller E810-2CQDA2 for QSFP
(rev 02)
OS Ubuntu 20.04.1 LTS (Focal Fossa)
Kernel 5.4.0-40-generic
DPDK v21.02
Compiler GCC 9.3.0-10 ubuntu2
* Intel® Reference Platform (RP) for 3rd Generation Intel® Xeon®
Scalable Processor
REFERENCE CHAPTER
Intel® 64 and IA-32 Architectures Optimization Reference Manual (Ice Lake Client microarchitecture) 2.1
Intel® 64 and IA-32 Architectures Optimization Reference Manual (Skylake Server microarchitecture) 2.2
Intel® 64 and IA-32 Architectures Optimization Reference Manual (Skylake Server Power Management) 2.2.3
Intel® Xeon® processor Scalable family specification update (see Technical Documents) NA
New 3rd Gen Intel® Xeon® Scalable Processor Hot Chips 2020
5 Summary
If you are a network software engineer, motivated to solve compute-bound problems in your packet processing software, this
document’s intention is to clearly establish that AVX-512 development skills are a worth-while addition to your skill set and to
provide enough pointers to get started.
To this end, the first section, Intel® AVX-512 Instruction Set, describes improvements in throughput and flexibility in the Intel® AVX-
512 instruction set. It covers the basics of improvements with AVX-512, explaining how it doubles the throughput of each
instruction compared to its predecessors. It then talks about the improvements in flexibility, introducing some new concepts such
as masked and ternary operations and instruction set extensions.
The second section, Intel® AVX-512 and Intel® Xeon® Scalable Processor Microarchitecture, describes how the latest generations of
Intel® Xeon® Scalable processors are optimized to run AVX-512 instructions. It describes how the microprocessors’ execution ports
have been widened to 512 bits, and how AVX-512 execution compares with that of the Intel® AVX2 instruction set. The section also
describes power licensing, with examples using familiar packet processing applications.
18
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
The next white paper in this series describes Intel® AVX-512 usage in packet processing, concluding the series with some simple
examples of how Intel® AVX-512 has been used to create significant performance improvements 5 in some well-known packet
processing software. It describes the different development approaches to vector intrinsics and literals for writing AVX-512 code,
and how Intel® AVX-512 support is detected and implemented in the software. It then walks through simple examples of
optimizations achieved with Intel® AVX-512 instructions and the performance gains they generated.
5
See backup for workloads and configurations or visit www.Intel.com/PerformanceIndex. Results may vary.
19
Technology Guide | Intel® AVX-512 - Instruction Set for Packet Processing
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for
configuration details. No product or component can be absolutely secure.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Intel technologies may require enabled hardware, software or service activation.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. *Other names and brands may
be claimed as the property of others.
0421/DN/WIT/PDF 633930-002US
20