LP 1852
LP 1852
LP 1852
Infinity Fabric
Infinity Fabric (IF) is a proprietary AMD designed architecture that connects and facilitates data and control
transfer between all components. IF is implemented in most of AMD's recent microarchitectures in their
EPYC processors and other products.
In the AMD EPYC Processor, the Core Complex Dies (CCD) connect to memory, I/O, and each other
through an updated I/O Die (Figure 1). Each CCD connects to the I/O die via a dedicated high-speed
Global Memory Interconnect (GMI) link. The I/O die helps maintain cache coherency and additionally
provides the interface to extend the infinity fabric to a potential second processor via its xGMI. AMD EPYC
9004 Series Processors support up to 4 xGMI with speeds up to 32Gbps.
Figure 1. AMD 4th Gen EPYC processor I/O die function logical view (source: AMD)
Access to the tool: AMD XIO is only available under NDA from AMD. If you would like to use this tool,
please contact AMD directly.
Table 2. Single Socket Theoretical Memory BW with different numbers of DIMMs installed
Memory Config Memory Theoretical BW
12 x 4800MHz 12 x 4800 x 64bit / 8 = 460800 MB/s = 460.8 GB/s
10 x 4800MHz 10 x 4800 x 64bit / 8 = 384000 MB/s = 384 GB/s
8 x 4800MHz 8 x 4800 x 64bit / 8 = 307200 MB/s = 307.2 GB/s
6 x 4800MHz 6 x 4800 x 64bit / 8 = 230400 MB/s= 230.4 GB/s
4 x 4800MHz 4 x 4800 x 64bit / 8 = 153600 MB/s = 153.6 GB/s
2 x 4800MHz 2 x 4800 x 64bit / 8 = 76800 MB/s = 76.8 GB/s
1 x 4800MHz 1 x 4800 x 64bit / 8 = 38400 MB/s = 38.4 GB/s
Four xGMI links can support a maximum theoretical bandwidth of 512 GB/s between sockets, which more
than matches the maximum single socket theoretical memory bandwidth of 460.8 GB/s. This means remote
memory access can flow nearly at maximum bandwidth from one CPU to another.
Three xGMI links maximum theoretical bandwidth is 384 GB/s, the same as 10 channels' maximum
theoretical memory bandwidth. This means remote memory access can flow nearly at maximum bandwidth
from one CPU to another when 10 or fewer DIMMs per socket.
The following table shows the stream memory bandwidth result when NPS0. NPS0 effectively means one
NUMA node for the entire system. It is only available on a 2-socket system. Firmware will attempt to
interleave all memory channels in the system. Since there are no local nodes for the application to
leverage, there is far more crosstalk between sockets to transfer data on the xGMI links. The xGMI link
number, speed, and width all limit the bandwidth.
The STREAM Triad test results at NPS0 show the impact of limiting those variables as the results are close
to the xGMI Theoretical BW value in Table 1. Note xGMI Maximum Link Width = x16 and xGMI Max Speed
= 32Gbps in the Maximum Performance Mode. We need to change the Operating Mode to Custom Mode if
we want to change the variables' value.
Memory Latency Checker (MLC) is a tool used to measure memory latencies and bandwidth. It also
provides options for local and cross-socket memory latencies and bandwidth checks.
For more information about MLC, see the following web page:
https://www.intel.com/content/www/us/en/download/736633/763324/intel-memory-latency-checker-intel-
mlc.html
We use the following commands to print a local and cross-socket memory latencies/bandwidth matrix:
mlc --latency_matrix
mlc --bandwidth_matrix
The following table shows the local and cross-socket memory latency and bandwidth on the 4 xGMI config
system with different xGMI link width and speed. The local node latency and bandwidth don’t impact by
xGMI link status , but for the remote node bandwidth and latency, faster speed and larger width result lower
latency and higher bandwidth.
Table 5. 4 xGMI Links System Local and Cross-socket Memory Latencies/Bandwidth with NPS1
xGMI Local Node Remote Node Local Node Bandwidth Remote Node
Width/Speed Latency (ns) Latency (ns) (GB/s) Bandwidth (GB/s)
X4 / 16 Gbps 110.5 351.8 369.6 24.8
X4 / 32 Gbps 110.8 250.7 369.8 50.1
X16 / 16 Gbps 110.4 244.6 369.8 90.6
X16 / 32 Gbps 110.4 199.1 370.2 152.2
The following table shows the result of the 3 xGMI config system, and we can get the same conclusion.
Figures 8-9 are the remote node latency/bandwidth comparison between 3 xGMI links system and 4 xGMI
links system. We can see the remote node latency is very close, but 4 xGMI has much better remote node
bandwidth than 3 xGMI, and the ratio is close to 4:3, which is the same as number of links compared.
Besides the number of xGMI links, the Remote node bandwidth is also scalable with the xGMI width and
speed.
Figure 8. Remote Node Latency compares between 3 xGMI links and 4 xGMI links
Summary
The ThinkSystem SR665 V3 has flexible xGMI inter-processor links allowing one link to be converted to two
x16 PCIe 5.0 connections, which can provide more PCIe connections for greater PCIe/NVMe support.
Four xGMI links maximum theoretical bandwidth is greater than 12 Channels 4800MHz DDR5 BW, which
means remote memory access can flow nearly at maximum bandwidth from one CPU to another. Three
xGMI links may be acceptable for NUMA-aware workloads or reduced memory population.
xGMI link speed and width are configurable in the UEFI. For NUMA-aware workloads, reduced link speed
and width can save uncore power to reduce overall power consumption and divert more power to the cores
for increased core frequency.
For those NUMA-unaware workloads, when accessing the memory attached directly to CPU 0, CPU 1 must
cross the xGMI link between the two sockets. This access is “non-uniform”, CPU 0 will access this memory
faster than CPU 1 because of the distance between two sockets. The xGMI link number, speed, and width
will impact the overall performance at that time.
Authors
Peter Xu is a Systems Performance Verification Engineer in the Lenovo Infrastructure Solutions Group
Performance Laboratory in Morrisville, NC, USA. His current role includes CPU, Memory, and PCIe
subsystem analysis and performance validation against functional specifications and vendor targets. Peter
holds a Bachelor of Electronic and Information Engineering and a Master of Electronic Science and
Technology, both from Hangzhou Dianzi University.
Redwan Rahman is a Systems Performance Verification Engineer in the Lenovo Infrastructure Solutions
Group Performance Laboratory in Morrisville, NC, USA. His current role includes CPU, Memory, and PCIe
subsystem analysis and performance validation against functional specifications and vendor targets.
Redwan holds a Bachelor of Science in Computer Engineering from University of Massachusetts Amherst.
LENOVO PROVIDES THIS PUBLICATION ”AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the
information herein; these changes will be incorporated in new editions of the publication. Lenovo may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
The products described in this document are not intended for use in implantation or other life support applications
where malfunction may result in injury or death to persons. The information contained in this document does not
affect or change Lenovo product specifications or warranties. Nothing in this document shall operate as an express
or implied license or indemnity under the intellectual property rights of Lenovo or third parties. All information
contained in this document was obtained in specific environments and is presented as an illustration. The result
obtained in other operating environments may vary. Lenovo may use or distribute any of the information you supply
in any way it believes appropriate without incurring any obligation to you.
Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials
for this Lenovo product, and use of those Web sites is at your own risk. Any performance data contained herein was
determined in a controlled environment. Therefore, the result obtained in other operating environments may vary
significantly. Some measurements may have been made on development-level systems and there is no guarantee
that these measurements will be the same on generally available systems. Furthermore, some measurements may
have been estimated through extrapolation. Actual results may vary. Users of this document should verify the
applicable data for their specific environment.