Saini 2011
Saini 2011
Saini 2011
Abstract—Intel provides Hyper-Threading (HT) in processors choice as, for example, in Intel’s Nehalem micro-
based on its Pentium and Nehalem micro-architecture such as architecture, where it is called Hyper-Threading (HT).
the Westmere-EP. HT enables two threads to execute on each
core in order to hide latencies related to data access. These two As is the case with other forms of on-chip parallelism,
threads can execute simultaneously, filling unused stages in the such as multiple cores and instruction-level parallelism,
functional unit pipelines. To aid better understanding of HT- SMT uses resource sharing to make the parallel
related issues, we collect Performance Monitoring Unit (PMU) implementation economical. With SMT, this sharing has the
data (instructions retired; unhalted core cycles; L2 and L3 potential for improving utilization of resources such as that
cache hits and misses; vector and scalar floating-point of the floating-point unit through the hiding of latency in the
operations, etc.). We then use the PMU data to calculate a new memory hierarchy. When one thread is waiting for a load
metric of efficiency in order to quantify processor resource instruction to complete, the core can execute instructions
utilization and make comparisons of that utilization between from another thread without stalling.
single-threading (ST) and HT modes. We also study
performance gain using unhalted core cycles, code efficiency of The purpose of this paper is to measure the impact of HT
using vector units of the processor, and the impact of HT mode on processor utilization. We accomplish this by computing
on various shared resources like L2 and L3 cache. Results processor efficiency and investigating how various shared
using four full-scale, production-quality scientific applications resources affect performance of scientific applications in HT
from computational fluid dynamics (CFD) used by NASA mode. Specifically, we present a new metric for processor
scientists indicate that HT generally improves processor efficiency to characterize its utilization in single threading
resource utilization efficiency, but does not necessarily (ST) and HT modes for the hex-core Westmere-EP processor
translate into overall application performance gain. used in SGI Altix ICE 8400EX supercomputer. We also
investigate the effect of memory hierarchy on the
Keywords: Simultaneous Multi-Threading (SMT), Hyper- performance of scientific applications in both the modes. We
Threading (HT), Intel’s Nehalem micro-architecture, Intel use four production computational fluid dynamics (CFD)
Westmere-EP, Computational Fluid Dynamics (CFD), SGI Altix applications—OVERFLOW, USM3D, Cart3D, and NCC—
ICE 8400EX, Performance Tools, Benchmarking, Performance
that are used extensively by scientists and engineers at
Evaluation
NASA and throughout the aerospace industry.
I. INTRODUCTION In order to better understand the performance
characteristics of these codes, we collect Performance
Current trends in microprocessor design have made high Monitoring Unit (PMU) data (instructions retired; L2 and L3
resource utilization a key requirement for achieving good cache hits and misses; vector and scalar floating-point
performance. For example, while deeper pipelines have led operations, etc.) in both ST and HT modes. We analyze the
to 3 GHz processors, each new generation of micro- results to understand the factors influencing the performance
architecture technology comes with increased memory of codes in HT mode.
latency and a decrease in relative memory speed. This results
in the processor spending a significant amount of time The remainder of this paper is organized as follows. We
waiting for the memory system to fetch data. This “memory present background and related work in the next section.
wall” problem continues to remain a major bottleneck and as Section III discusses HT in the context of the Nehalem
a result, sustained performance of most real-world micro-architecture and its Westmere-EP processor. In
applications is less than 10% of peak. Section IV, we detail the architecture of the platform used in
this study—the SGI Altix ICE 8400EX, based on the
Over the years, a number of multithreading techniques Westmere-EP processor. Section V discusses the
have been employed to hide this memory latency. One experimental setup, including the hardware performance
approach is simultaneous multi-threading (SMT), which counters. In Section VI, we describe the benchmarks and
exposes more parallelism to the processor by fetching and applications used in our study. In Section VII, we discuss
retiring instructions from multiple instruction streams, metrics used to measure the effectiveness of HT and the
thereby increasing processor utilization. SMT requires only utilization of processor resources in both ST and HT modes.
some extra hardware instead of replicating the entire core. Section VIII presents and analyzes the performance results of
Price and performance benefits make it a common design
P = (CST – CHT) / CST, because the instructions retired counter value reflects the
where CST and CHT are UNHALTED_CORE_CYLES in ST average across all threads in the computation.
and HT modes, respectively.
From the application point of view, the number unhalted VIII. RESULTS
core cycles is an important metric for code optimization as it In this section we present results for performance gain
reflects the total execution time. The goal of any code and efficiency, and then explain those results in terms of
optimization is to minimize the (a) cycles that are stalled by vectorization and memory hierarchy effects.
improving code and data locality, (b) minimizing branches or
using more predictable branching, and (c) using vector A. Efficiency and Performance Gain
instructions and/or faster and more efficient algorithms. To begin our analysis of hyper-threading, we examine
four metrics for the four applications in the study. Each of
In order to measure the core-level effects of HT, we the graphs in this section shows four plots:
define a quantity called efficiency that reflects the utilization
level of the core’s execution units. In particular, we calculate • HT efficiency (EHT as defined in the last section),
the fraction of available micro-operation slots that are being • ST efficiency (EST),
used to completely execute an instruction. If the fraction is • efficiency difference (EHT – EST), which is labeled
high, the execution units are being kept busy doing useful with “Efficiency difference (HT–ST)”, and
work.
• performance gain (P).
The total number of available micro-operation slots
during an execution on a single core is 1) NCC
S = µ ⋅ C, Figure 3 shows the plots for NCC. Efficiency in HT
mode is always higher than with ST. Efficiency increases
where C is the number of cycles executed on the core and µ from 35.3% to 40.5% and 39.3% to 44.3% in ST and HT
is the number of micro-operation slots available per cycle, modes, respectively, across the core counts. The difference
e.g., µ = 4 in the Nehalem microarchitecture. If I is the between the HT and ST efficiency increases from 4% at 24
number of instructions retired by the core during execution, cores to 5.3% at 96 cores and then decreases to 3.8% at 384
then the efficiency is cores. The HT efficiency correlates with the performance
gain, which increases with larger core count because data
E=I/S starts fitting in L1 data cache. NCC shows super-linear
The theoretical maximum value of E is unity and reflects scaling in ST mode and has enhanced super-scaling in HT
the case where each micro-operation slot is being used to mode as data for both threads fits into L2 cache.
retire an instruction. In practice, however, some instructions
will result in multiple micro-operations being issued. In
addition, there will often be empty slots because values
needed for a micro-operation are not available yet. Thus,
typical efficiencies will be less than one.
With our experimental setup, we can use the PMU
counters, which are per-thread counters, to calculate
efficiency during a single-threaded run as:
INSTRUCTIONS_RETIRED
EST =
4 ⋅ UNHALTED_CORE_CYCLES ,
3) Cart3D
Figure 5 shows the graphs for Cart3D. Again, efficiency
in HT mode is always higher than in ST mode. Performance
gain is 14%, 7%, 22%, and 15% for 24, 48, 96, and 192
cores, respectively. There is excellent anti-correlation
between HT efficiency and performance gain.
The main benefit of HT comes from the ability of Difference of % data from XX =
execution units in the core, such as the floating-point units (% data from XX in ST mode) –
(FPU), to handle instructions from more than one thread (% data from XX in HT mode),
simultaneously. The FPU is a shared resource that is
where XX = L2, L3, or MM.
unaware of the multiple threads. From its perspective, it is
merely handling a stream of instructions organized in a In the above formulas, L2H (L2 cache hit), L2M (L2
pipeline of the six execution units—during each cycle, it can cache miss), L3R (L3 cache reference), and L3M (L3 cache
start executing the micro-operation in the next stage. Note miss) correspond to the following measured counter data,
that this will often lead to gaps (as shown earlier in Figure 1) respectively: L2_RQSTS:LD_HIT, L2_RQSTS:LD_MISS,
where there is no micro-operation to execute. This could be, LLC_REFERENCES, and LLC_MISSES. We calculate L2R
for example, due to a wait for a load instruction to complete. (L2 cache reference) from L2H+L2M and L3H (L3 cache
With HT, such gaps in the FPU’s pipeline can be filled with hit) from L3R-L3M. We assume all the L3 cache misses hit
micro-operations from a second thread—thus making for the main memory.
better utilization of the FPU.
We use these graphs to explain how using HT impacts
Pitted against any potential benefit due to HT is the the four applications. For each application, the first graph
additional cost of executing with multiple threads. There is allows us to identify the primary source of the data. If the
almost certainly going to be a time penalty due to increased second graph shows that the difference in the ST and HT
contention in the memory hierarchy. The bottom line is that percentage for the primary data source is high, it implies that
we will only see an overall benefit for HT if the time saved in HT mode the two threads have to go to the next level of
by utilizing the idle resources in the pipeline is greater than the memory hierarchy more often, thus incurring extra
the extra time needed due to memory hierarchy contention. latency costs. Our proposition is that a low value for this
With a high level of vectorization, the number of execution difference should result in a performance gain for HT mode.
gaps is very small and there is possibly insufficient Our overall proposition is that a code benefits from HT if the
opportunity to make up any penalty due to increased primary source of data can accommodate the request in HT
contention in HT. With a low level of vectorization there is mode also.
potential for benefit. Thus, the level of increased memory
1) NCC 2) Cart3D
Figure 9 shows the percentage of data from each source Figure 11 shows the percentage of data from each source
for NCC in ST mode. The amount of data coming from L2 for Cart3D. In ST mode, 61% to 65%, 5% to 8%, and 27% to
cache, L3 cache, and main memory is 69-79%, 7-13%, and 34% of the data comes from L2 cache, L3 cache, and main
15-20%, respectively. The percentage of data coming from memory, respectively. Hence, the primary source of data is
L2 cache decreases with increasing number of cores because L2 cache.
a larger portion of the process’s data fits into L1 cache.
3) USM3D
Figure 13 shows that main memory (MM) is the source
for almost 80% of the data for USM3D across the whole
range of cores tested. Note that USM3D is an unstructured
code with tetrahedral meshes involving indirect addressing.
It usually cannot reuse the data from L2 or L3 cache and thus
has to fetch data from main memory. Thus, the various cache
Figure 10. Performance gain and percentage difference between ST and levels do not seem to play any significant role for this
HT for NCC. application.
Figure 15. Percentage of data from each source for OVERFLOW.