Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Saini 2011

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

The Impact of Hyper-Threading on Processor

Resource Utilization in Production Applications


Subhash Saini, Haoqiang Jin, Robert Hood, David Barker, Piyush Mehrotra and Rupak Biswas
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035-1000, USA
{subhash.saini, haoqiang.jin, robert.hood, david.p.barker, piyush.mehrotra, rupak.biswas}@nasa.gov

Abstract—Intel provides Hyper-Threading (HT) in processors choice as, for example, in Intel’s Nehalem micro-
based on its Pentium and Nehalem micro-architecture such as architecture, where it is called Hyper-Threading (HT).
the Westmere-EP. HT enables two threads to execute on each
core in order to hide latencies related to data access. These two As is the case with other forms of on-chip parallelism,
threads can execute simultaneously, filling unused stages in the such as multiple cores and instruction-level parallelism,
functional unit pipelines. To aid better understanding of HT- SMT uses resource sharing to make the parallel
related issues, we collect Performance Monitoring Unit (PMU) implementation economical. With SMT, this sharing has the
data (instructions retired; unhalted core cycles; L2 and L3 potential for improving utilization of resources such as that
cache hits and misses; vector and scalar floating-point of the floating-point unit through the hiding of latency in the
operations, etc.). We then use the PMU data to calculate a new memory hierarchy. When one thread is waiting for a load
metric of efficiency in order to quantify processor resource instruction to complete, the core can execute instructions
utilization and make comparisons of that utilization between from another thread without stalling.
single-threading (ST) and HT modes. We also study
performance gain using unhalted core cycles, code efficiency of The purpose of this paper is to measure the impact of HT
using vector units of the processor, and the impact of HT mode on processor utilization. We accomplish this by computing
on various shared resources like L2 and L3 cache. Results processor efficiency and investigating how various shared
using four full-scale, production-quality scientific applications resources affect performance of scientific applications in HT
from computational fluid dynamics (CFD) used by NASA mode. Specifically, we present a new metric for processor
scientists indicate that HT generally improves processor efficiency to characterize its utilization in single threading
resource utilization efficiency, but does not necessarily (ST) and HT modes for the hex-core Westmere-EP processor
translate into overall application performance gain. used in SGI Altix ICE 8400EX supercomputer. We also
investigate the effect of memory hierarchy on the
Keywords: Simultaneous Multi-Threading (SMT), Hyper- performance of scientific applications in both the modes. We
Threading (HT), Intel’s Nehalem micro-architecture, Intel use four production computational fluid dynamics (CFD)
Westmere-EP, Computational Fluid Dynamics (CFD), SGI Altix applications—OVERFLOW, USM3D, Cart3D, and NCC—
ICE 8400EX, Performance Tools, Benchmarking, Performance
that are used extensively by scientists and engineers at
Evaluation
NASA and throughout the aerospace industry.
I. INTRODUCTION In order to better understand the performance
characteristics of these codes, we collect Performance
Current trends in microprocessor design have made high Monitoring Unit (PMU) data (instructions retired; L2 and L3
resource utilization a key requirement for achieving good cache hits and misses; vector and scalar floating-point
performance. For example, while deeper pipelines have led operations, etc.) in both ST and HT modes. We analyze the
to 3 GHz processors, each new generation of micro- results to understand the factors influencing the performance
architecture technology comes with increased memory of codes in HT mode.
latency and a decrease in relative memory speed. This results
in the processor spending a significant amount of time The remainder of this paper is organized as follows. We
waiting for the memory system to fetch data. This “memory present background and related work in the next section.
wall” problem continues to remain a major bottleneck and as Section III discusses HT in the context of the Nehalem
a result, sustained performance of most real-world micro-architecture and its Westmere-EP processor. In
applications is less than 10% of peak. Section IV, we detail the architecture of the platform used in
this study—the SGI Altix ICE 8400EX, based on the
Over the years, a number of multithreading techniques Westmere-EP processor. Section V discusses the
have been employed to hide this memory latency. One experimental setup, including the hardware performance
approach is simultaneous multi-threading (SMT), which counters. In Section VI, we describe the benchmarks and
exposes more parallelism to the processor by fetching and applications used in our study. In Section VII, we discuss
retiring instructions from multiple instruction streams, metrics used to measure the effectiveness of HT and the
thereby increasing processor utilization. SMT requires only utilization of processor resources in both ST and HT modes.
some extra hardware instead of replicating the entire core. Section VIII presents and analyzes the performance results of
Price and performance benefits make it a common design

U.S. Government work not protected by U.S. copyright


our experiments. We discuss other factors that influenced the execution units and memory hierarchy, when
results of this study in Section IX, and end with some executing in HT mode.
conclusions from this work in Section X.
III. HYPER-THREADING IN NEHALEM MICRO-
ARCHITECTURE
II. BACKGROUND AND RELATED WORK
Intel introduced SMT, called Hyper-Threading (HT), into Hyper-Threading (HT) allows instructions from multiple
its product line in 2002 with new models of their Pentium 4 threads to run on the same core. When one thread stalls, a
processors [1-3]. The advantage of HT is its ability to better second thread is allowed to proceed. To support HT, the
utilize processor resources and to hide memory latency. Nehalem micro-architecture has several advantages over the
There have been a few efforts studying the effectiveness of Pentium 4. First, the newer design has much more memory
HT on application performance [4-6]. Boisseau et al. bandwidth and larger caches, giving it the ability to get data
conducted a performance evaluation of HT on a Dell 2650 to the core faster. Second, Nehalem is a much wider
dual processor-server based on Pentium 4 using matrix- architecture than Pentium 4. It supports two threads per core,
matrix multiplication and a 256-particle molecular dynamics presenting the abstraction of two independent logical cores.
benchmark written in OpenMP [4]. Haung et al. The physical core contains a mixture of resources, some of
characterized the performance of Java applications using which are shared between threads [2]:
Pentium 4 processors with HT [5]. Blackburn et al. studied
the performance of garbage collection in HT mode by using • replicated resources for each thread, such as register
some of the Pentium 4 performance counters [6]. A key state, return stack buffer (RSB), and the instruction
finding of these investigations was that the Pentium 4’s queue;
implementation of HT was not very advantageous, as the • partitioned resources tagged by the thread number,
processor had very limited memory bandwidth and issued such as load buffer, store buffer, and reorder buffer;
only two instructions per cycle. • shared resources, such as L1, L2, and L3 cache; and
Recently, HT was extended to processors that use Intel’s • shared resources unaware of the presence of
Nehalem micro-architecture [7]. In these processors, threads, such as execution units.
memory bandwidth was enhanced significantly by
overcoming the front-side bus memory bandwidth bottleneck The RSB is an improved branch target prediction
and by increasing instruction issuance from two to four per mechanism. Each thread has a dedicated RSB to avoid any
cycle. Saini et al. conducted a performance evaluation of HT cross-contamination. Such replicated resources should not
on small numbers of Nehalem nodes using NPB [8]. Results have an impact on HT performance. Partitioned resources are
showed that for one node, HT provided a slight advantage statically allocated between the threads and reduce the
only for LU. BT, SP, MG, and LU achieved the greatest resources available to each thread. However there is no
benefit from HT at 4 nodes: factors of 1.54, 1.43, 1.14, and competition for these resources. On the other hand, the two
1.14, respectively, while FT did not achieve any benefit threads do compete for shared resources and the performance
independent of the number of nodes. Later on Saini et al. depends on the dynamic behavior of the threads. Some of the
extended their work on HT to measure the relative efficiency shared resources are unaware of HT. For example, the
E of the processor in terms of cycle per instruction using the scheduling of instructions to execution units is independent
formula of threads, but there are limits on the number of instructions
from each thread that can be queued.
E = 100*(2*CPIST / CPIHT) – 100
Figure 1 is a schematic description of HT for the
Nehalem micro-architecture. In the diagram, the rows depict
where CPIST and CPIHT are cycle per instruction in ST and each of the Westmere-EP processor’s six execution units—
HT modes respectively [9]. two floating-point units (FP0 and FP1), one load unit (LD0),
In this study we focus on the Westmere-EP Xeon one store unit (ST0), one load address unit (LA0), and one
processor, which is based on the Nehalem micro- branch unit (BR0). It is a sixteen-stage pipeline. Each box
architecture. represents a single micro-operation running on an execution
unit.
The contributions of this paper are as follows:
Figure 1(a) shows the ST mode (no HT) in a core where
• We present efficiency, a new performance metric in the core is executing only one thread (Thread 0 shown in
terms of instruction per cycle to quantify the green) and white space denotes unfilled stages in the
utilization of the processor, by collecting PMU data pipeline. The peak execution bandwidth of the Nehalem
in both ST and HT modes using a range of core micro-architecture is four micro-operations per cycle. Often
counts. ST does not utilize the execution units optimally and
• We analyze the PMU data to identify the factors that operates at less than peak bandwidth, as indicated by the
influence the performance of the codes, in particular large number of idle (white) execution units.
focusing on the impact of shared resources, such as
kernel and SGI overlays as its operating system and has a
Lustre file system for I/O.

Figure 1. Hyper Threading on the sixteen-stage pipeline Nehalem


architecture with six execution units.
Figure 1(b) shows the HT feature in one of the processor
cores. This core in HT mode executes the micro-operations, Figure 2. Configuration of an Intel Westmere-EP node.
from both threads (Thread 0 and Thread 1 shown in green
and blue, respectively). This arrangement can operate closer V. EXPERIMENTAL SETUP AND COUNTERS
to peak bandwidth, as indicated by the smaller number of In this section we give a brief description of the
idle (white) execution units. In HT mode, the processor can experimental setup for collecting and analyzing the data
utilize execution units more efficiently. based on the hardware performance counters. We also
describe the performance counters used in our study.
IV. COMPUTING PLATFORM
A. Experimental Setup
This study was conducted using NASA’s Pleiades super- In this work, we used the SGI Message Passing Toolkit
computer, an SGI Altix ICE 8400EX system located at (MPT) version 1.25 and the Intel compiler version 11.1 [12].
NASA Ames Research Center. Pleiades comprises of 10,752 We used op_scope, a tool developed by Supersmith to
nodes interconnected with an InfiniBand (IB) network in a collect low-level performance data, e.g. floating-point
hypercube topology. The nodes are based on three different operations, instruction counts, clock cycles, cache
Intel Xeon processors: Harpertown, Nehalem-EP, and misses/hits, etc. [18]. The tool relies on the Performance
Westmere-EP. In this study, we used the Westmere-EP based Application Programming Interface (PAPI) [13] to access
nodes [10]. This subset of Pleiades is interconnected via 4X hardware performance counters. In the present study,
Quad Data Rate (QDR) IB switches. As shown in Figure 2, op_scope was built with PAPI version 4.1.0.
the Westmere-EP based nodes have two Xeon X5670 The experiments were performed using the same number
processors, each with six cores. Each processor is clocked at of physical resources for both ST and HT modes; that is, for
2.93 GHz, with a peak performance of 70.32 Gflop/s. The a given number of physical cores, say n, we used n MPI
total peak performance of the node is therefore 140.64 processes in ST mode but doubled it to 2n MPI processes in
Gflop/s. HT mode. The main reason for keeping the number of cores
Each Westmere-EP processor has two parts: “core” and used constant while toggling HT was to approximate the
“uncore”. The core part consists of six cores with per-core situation faced by users: whether or not they should use HT
L1 and L2 caches. The uncore part has a shared L3 cache, an when running an application on a given set of resources.
integrated memory controller, and QuickPath Interconnect Note that this approach raises two issues for the
(QPI). Each core has 64 KB of L1 cache (32 KB data and 32 remainder of the analysis. First, we are making an implicit
KB instruction) and 256 KB of L2 cache. All six cores share assumption that the codes scale perfectly, i.e., there is no
12 MB of L3 cache. The on-chip memory controller supports performance loss or gain in going from n MPI processes (in
three DDR3 channels running at 1333 MHz, with a peak- ST mode) to 2n MPI processes (in HT mode). Second, this
memory bandwidth per socket of 32 GB/s (and twice that per approach also changes the amount of work done per MPI
node). Each processor has two QPI links: one connects the process and the overall communication pattern, which we are
two processors of a node to form a non-uniform-memory not considering. With ST, there are a maximum of 12 MPI
access (NUMA) architecture, while the other connects to the processes communicating with other nodes while the number
I/O hub. Each QPI link runs at 6.4 GT/s (“T” for doubles in HT mode, thereby potentially creating a
transactions), at which rate 2 bytes can be transferred in each bottleneck at the Host Channel Adaptor (HCA), a physical
direction, for an aggregate of 25.6 GB/s. HT was enabled on network card that connects a node to the IB network fabric.
each processor for our experiments. Pleiades utilizes SUSE A similar communication bottleneck can also occur at the
Linux Enterprise Server (SLES) based on the 2.6.32 Linux inter IRU links. This effect is more pronounced at a higher
number of cores especially for MPI collectives such as
MPI_Allreduce and MPI_Bcast. We will not be able to fully VI. APPLICATIONS USED IN THE STUDY
address the network-dependent effects in this study since we Here is a brief overview of the codes that we used in this
did not gather any network data for MPI communication. We study.
are currently examining mechanisms for differentiating
hardware counter data from inside and outside Cart3D is an unstructured high fidelity, inviscid CFD
communication routines. application that solves the Euler equations of fluid dynamics
[14]. It includes a solver called Flowcart, which uses a
A tool called dplace from SGI was used to bind a related second-order, cell-centered, finite-volume upwind spatial
set of processes to specific cores to prevent process discretization scheme, in conjunction with a multi-grid
migration. In addition, if only a part of the node was used in accelerated Runge-Kutta method for steady-state cases. We
a run, it was ensured that both ST and HT modes used the used the geometry of the Space Shuttle Launch Vehicle
same set of cores. Also, in order to reduce the impact of the (SSLV) for the simulations in this work. The SSLV uses 24
initialization phase of an application (reading the input data, million cells for computation, and the input dataset is 1.8
setting up the computational grid, etc.) on the results, each GB. The application (in this case, the MPI version) requires
case was run twice—once for the first iteration only and 16 GB of memory to run.
another for all iterations. Results from the first run were then
subtracted from the second run for both timing and hardware OVERFLOW is a general-purpose Navier-Stokes solver
counter data. for CFD problems [15]. The Fortran90 MPI version has
130,000 lines of code. The application uses an overset grid
methodology to perform high-fidelity viscous simulations
B. Westmere-EP Performance Counter Events around realistic aerospace configurations. The main
computational logic of the sequential code consists of a time
Hardware counter data was collected from the loop and a nested grid loop. The code also uses finite
Performance Monitor Unit (PMU) of the Westmere-EP differences in space with implicit time stepping, and overset
processor. PMU provides seven counters per core; 3 fixed structured grids to accommodate arbitrarily complex moving
and 4 general-purpose [10]. PAPI users can access 117
geometries. The dataset used here is a wing-body-nacelle-
native events. We narrowed these 117 down to 8 events that pylon geometry (DLRF6), with 23 zones and 36 million grid
were appropriate for the present study; Table 1 shows the
points. The input dataset is 1.6 GB in size, and the solution
names and descriptions of the events we used. The counter
file is 2 GB.
data presented in the later sections is an average of the values
collected for all the MPI processes.
NCC, the National Combustion Code, is an unstructured-
grid Navier-Stokes CFD application used to develop new
TABLE I. Intel Westmere-EP events. physical models for turbulence, chemistry, spray, and
Name Description turbulence-chemistry, as well as turbulence-spray
Clock cycles interactions [16]. It employs a cell-centered finite-volume
UNHALTED_CORE_CYCLES spatial discretization and pseudo-time preconditioning. An
when not halted
Number of explicit four-stage Runge-Kutta scheme is used to advance
INSTRUCTIONS_RETIRED instructions the solution in pseudo-time for steady state or time-accurate
retired simulations. Domain decomposition to divide the total
Number of computational domain into spatial zones is performed by
FP_COMP_OPS_EXE:SSE_FP_PACKED packed FP uops using the METIS partitioner. Each zone is solved on a
executed separate core and MPI is used for inter-core communication.
Number of The test case used is H2C4 fuel injector geometry consisting
FP_COMP_OPS_EXE:SSE_FP_SCALAR scalar FP uops of seven individual injectors in a radial array each with four
executed gaseous hydrogen injection ports. A 3.49 million element
Number of loads tetrahedral grid is used to model the injector and cylindrical
L2_RQSTS:LD_HIT that hit the L2 duct used in the experiment. Each NCC run consisted of 350
cache pseudo-time iterations.
Number of loads
L2_RQSTS:LD_MISS that miss the L2 USM3D is a 3-D unstructured tetrahedral, cell-centered,
cache finite-volume Euler and Navier-Stokes flow solver [17].
Last level cache Spatial discretization is accomplished using an analytical
LLC_REFERENCES demand requests
reconstruction process for computing solution gradients
from this core
within tetrahedral cells. The solution is advanced in time to a
Last level cache
steady-state condition by an implicit Euler time-stepping
demand requests
LLC_MISSES from this core
scheme. A single-block, tetrahedral, unstructured grid is
that missed the partitioned into a user-specified number of contiguous
LLC partitions, each containing nearly the same number of grid
cells. Grid partitioning is again accomplished using METIS.
The test case used 10 million tetrahedral elements, requiring counters, on the other hand, are usually both incremented for
about 16 GB of memory and 10 GB of disk space. each cycle, as the core is halted relatively infrequently. Thus,
either counter can be used to reflect the number of cycles
VII. PERFORMANCE METRICS executed on the core. We calculate the efficiency for the
whole core in HT mode as:
From the user’s point of view, the impact of HT is
typically measured by calculating the relative speed up
2 ⋅ INSTRUCTIONS_RETIRED
attained, e.g. the code sped up by x% using HT. Using PMU EHT = ,
counters, we calculate this performance gain as: 4 ⋅ UNHALTED_CORE_CYCLES

P = (CST – CHT) / CST, because the instructions retired counter value reflects the
where CST and CHT are UNHALTED_CORE_CYLES in ST average across all threads in the computation.
and HT modes, respectively.
From the application point of view, the number unhalted VIII. RESULTS
core cycles is an important metric for code optimization as it In this section we present results for performance gain
reflects the total execution time. The goal of any code and efficiency, and then explain those results in terms of
optimization is to minimize the (a) cycles that are stalled by vectorization and memory hierarchy effects.
improving code and data locality, (b) minimizing branches or
using more predictable branching, and (c) using vector A. Efficiency and Performance Gain
instructions and/or faster and more efficient algorithms. To begin our analysis of hyper-threading, we examine
four metrics for the four applications in the study. Each of
In order to measure the core-level effects of HT, we the graphs in this section shows four plots:
define a quantity called efficiency that reflects the utilization
level of the core’s execution units. In particular, we calculate • HT efficiency (EHT as defined in the last section),
the fraction of available micro-operation slots that are being • ST efficiency (EST),
used to completely execute an instruction. If the fraction is • efficiency difference (EHT – EST), which is labeled
high, the execution units are being kept busy doing useful with “Efficiency difference (HT–ST)”, and
work.
• performance gain (P).
The total number of available micro-operation slots
during an execution on a single core is 1) NCC
S = µ ⋅ C, Figure 3 shows the plots for NCC. Efficiency in HT
mode is always higher than with ST. Efficiency increases
where C is the number of cycles executed on the core and µ from 35.3% to 40.5% and 39.3% to 44.3% in ST and HT
is the number of micro-operation slots available per cycle, modes, respectively, across the core counts. The difference
e.g., µ = 4 in the Nehalem microarchitecture. If I is the between the HT and ST efficiency increases from 4% at 24
number of instructions retired by the core during execution, cores to 5.3% at 96 cores and then decreases to 3.8% at 384
then the efficiency is cores. The HT efficiency correlates with the performance
gain, which increases with larger core count because data
E=I/S starts fitting in L1 data cache. NCC shows super-linear
The theoretical maximum value of E is unity and reflects scaling in ST mode and has enhanced super-scaling in HT
the case where each micro-operation slot is being used to mode as data for both threads fits into L2 cache.
retire an instruction. In practice, however, some instructions
will result in multiple micro-operations being issued. In
addition, there will often be empty slots because values
needed for a micro-operation are not available yet. Thus,
typical efficiencies will be less than one.
With our experimental setup, we can use the PMU
counters, which are per-thread counters, to calculate
efficiency during a single-threaded run as:

INSTRUCTIONS_RETIRED
EST =
4 ⋅ UNHALTED_CORE_CYCLES ,

When running with HT, we note that since the core is


retiring instructions from two threads we need to add the per-
thread INSTRUCTIONS_RETIRED hardware counter for Figure 3. Efficiency and percentage performance gain for NCC.
each thread. The two UNHALTED_CORE_CYCLES
2) USM3D number of cores increases. The reason for this improvement
is that more data fits into L2 cache with increasing core
Figure 4 presents the data for USM3D. We do not show count. There is a good correlation between efficiency and
results for 256 cores since we do not have a grid for 512 performance gain. But efficiency does not explain the
processes. (The data point requires that the HT run use 512 magnitude of performance gain.
processes on 256 cores). As with NCC, efficiency in HT
mode is always higher than with ST. Efficiency in both ST
and HT modes decreases from 32 to 64 cores and then
increases to 128 cores. The difference between the HT and
ST efficiency decreases from 1.9% at 32 cores to 0.9% at 64
cores and then increases to 3.3% at 128 cores. Performance
gain is 11% and remains almost constant.

Figure 6. Efficiency and percentage performance gain for OVERFLOW.

Figure 7 recaps the performance gain in HT mode for


four applications–NCC, USM3D, OVERFLOW and Cart3D.
OVERFLOW is the only application that does not benefit
from HT. The other three applications do benefit from HT.

Figure 4. Efficiency and performance gain for USM3D.

3) Cart3D
Figure 5 shows the graphs for Cart3D. Again, efficiency
in HT mode is always higher than in ST mode. Performance
gain is 14%, 7%, 22%, and 15% for 24, 48, 96, and 192
cores, respectively. There is excellent anti-correlation
between HT efficiency and performance gain.

Figure 7. Performance gain in HT mode for applications.

B. Effect of Code Vectorization


In our observations, one of the factors influencing the
performance of HT is the degree of vectorization in the
application code. We compute the percentage of
vectorization from two different hardware counters:
• FP_COMP_OPS_EXE:SSE_FP_PACKED that gives
Figure 5. Efficiency and percentage performance gain for Cart3D. the number of vector micro-operations executed, and
• FP_COMP_OPS_EXE:SSE_FP_SCALAR that gives
4) OVERFLOW
the number of scalar micro-operations executed.
The results for OVERFLOW are plotted in Figure 6. As
Figure 8 shows the percentage of vector instructions
with the other applications, efficiency in HT mode is always
(over all floating point instructions) for the codes used in the
higher than the ST mode. Efficiency in ST and HT modes as
study. For each code, this percentage was fairly consistent
well as the difference between the two increases as the
(within 1%) across the range of core counts. In comparing hierarchy and network contention will determine whether
the vectorization percentage with performance gain, we there is any HT benefit.
observe that high vectorization correlates to a negative HT
impact, as in the case of OVERFLOW as shown in Figure 7. C. Effect of Memory Hierarchy
However, it does not necessarily follow that at lower In this subsection we focus on analyzing the effect of the
percentages of vectorization, there is a correlation between memory hierarchy on performance when running in HT
the degree of vectorization and performance gain. In mode. If sufficient resources (cache and memory bandwidth)
particular, NCC shows the best overall gain, but lies between are available, sharing them across multiple threads in HT
USM3D and Cart3D in vectorization percentage. In order to mode will result in better performance than with ST.
understand this behavior, we need to explore the However, we expect that if such sharing increases contention
implementation of HT more closely. between the threads to the extent that data needs to be
accessed from the next level of cache for each of the threads,
there will, in general, be no performance benefit of running
in HT mode. Below, we analyze hardware-counter based
data related to cache and memory accesses in order to
identify application characteristics that can provide a
performance boost with HT.
We present two kinds of graphs for each application,
namely
1. Percentage of data from each source (L2 cache, L3 cache,
and main memory – MM) in ST mode, as:

% data from L2 = L2H / L2R ⋅ 100,


% data from L3 = L3H / L3R ⋅ L2M / L2R ⋅ 100,
% data from MM = L3M / L3R ⋅ L2M / L2R ⋅ 100,
2. Percentage difference in ST and HT mode (ST–HT) from
Figure 8. Percentage of vectorization of codes. each data source as listed above as

The main benefit of HT comes from the ability of Difference of % data from XX =
execution units in the core, such as the floating-point units (% data from XX in ST mode) –
(FPU), to handle instructions from more than one thread (% data from XX in HT mode),
simultaneously. The FPU is a shared resource that is
where XX = L2, L3, or MM.
unaware of the multiple threads. From its perspective, it is
merely handling a stream of instructions organized in a In the above formulas, L2H (L2 cache hit), L2M (L2
pipeline of the six execution units—during each cycle, it can cache miss), L3R (L3 cache reference), and L3M (L3 cache
start executing the micro-operation in the next stage. Note miss) correspond to the following measured counter data,
that this will often lead to gaps (as shown earlier in Figure 1) respectively: L2_RQSTS:LD_HIT, L2_RQSTS:LD_MISS,
where there is no micro-operation to execute. This could be, LLC_REFERENCES, and LLC_MISSES. We calculate L2R
for example, due to a wait for a load instruction to complete. (L2 cache reference) from L2H+L2M and L3H (L3 cache
With HT, such gaps in the FPU’s pipeline can be filled with hit) from L3R-L3M. We assume all the L3 cache misses hit
micro-operations from a second thread—thus making for the main memory.
better utilization of the FPU.
We use these graphs to explain how using HT impacts
Pitted against any potential benefit due to HT is the the four applications. For each application, the first graph
additional cost of executing with multiple threads. There is allows us to identify the primary source of the data. If the
almost certainly going to be a time penalty due to increased second graph shows that the difference in the ST and HT
contention in the memory hierarchy. The bottom line is that percentage for the primary data source is high, it implies that
we will only see an overall benefit for HT if the time saved in HT mode the two threads have to go to the next level of
by utilizing the idle resources in the pipeline is greater than the memory hierarchy more often, thus incurring extra
the extra time needed due to memory hierarchy contention. latency costs. Our proposition is that a low value for this
With a high level of vectorization, the number of execution difference should result in a performance gain for HT mode.
gaps is very small and there is possibly insufficient Our overall proposition is that a code benefits from HT if the
opportunity to make up any penalty due to increased primary source of data can accommodate the request in HT
contention in HT. With a low level of vectorization there is mode also.
potential for benefit. Thus, the level of increased memory
1) NCC 2) Cart3D
Figure 9 shows the percentage of data from each source Figure 11 shows the percentage of data from each source
for NCC in ST mode. The amount of data coming from L2 for Cart3D. In ST mode, 61% to 65%, 5% to 8%, and 27% to
cache, L3 cache, and main memory is 69-79%, 7-13%, and 34% of the data comes from L2 cache, L3 cache, and main
15-20%, respectively. The percentage of data coming from memory, respectively. Hence, the primary source of data is
L2 cache decreases with increasing number of cores because L2 cache.
a larger portion of the process’s data fits into L1 cache.

Figure 11. Percentage of data from each source for Cart3D.

Figure 12 shows the percentage difference between ST


Figure 9. Percentage of data from each source for NCC. and HT modes for the three data sources for Cart3D along
with the performance gain by HT over ST. As was the case
Figure 10 shows the percentage difference between ST with NCC, there is an anti-correlation between the
and HT modes for the three data sources for NCC, along performance gain and the percentage difference between ST
with the performance gain for HT over ST. The percentage and HT for the primary data source—L2 cache.
difference between ST and HT for NCC from L2 cache, L3
cache, and main memory is 20% to 50%, -44% to -22%, and
-6% to 2%, respectively. In particular, the percentage
difference for the primary source of data, L2 cache, steadily
decreases from 50% to 20% across the range of cores. There
is an anti-correlation between the performance gain and the
percentage difference between ST and HT for L2 cache. That
is, with increasing cores, the L2 cache can better
accommodate both threads in HT mode resulting in more
performance gain in this mode.

Figure 12. Performance gain and percentage difference between ST and


HT for Cart3D.

3) USM3D
Figure 13 shows that main memory (MM) is the source
for almost 80% of the data for USM3D across the whole
range of cores tested. Note that USM3D is an unstructured
code with tetrahedral meshes involving indirect addressing.
It usually cannot reuse the data from L2 or L3 cache and thus
has to fetch data from main memory. Thus, the various cache
Figure 10. Performance gain and percentage difference between ST and levels do not seem to play any significant role for this
HT for NCC. application.
Figure 15. Percentage of data from each source for OVERFLOW.

Figure 16 shows the percentage difference between ST and


Figure 13. Percentage of data from each source for USM3D. HT modes for L2 cache, L3 cache, and main memory for
Figure 14 shows the percentage difference between ST and OVERFLOW. As was seen in the other applications, there is
HT modes for L2 cache, L3 cache and main memory for an anti-correlation between performance gain and the
USM3D. The differences are small, indicating that as we percentage difference between ST and HT for the primary
expected, even in HT mode most of data comes from the data source—in this case L2 cache. Note also that the
main memory. Since the code is only 20% vectorized and secondary data source (L3 cache) has a negative difference
most of the data (80%) comes from main memory, there is between ST and HT. This means that HT causes more L3
an opportunity to hide memory latency of one thread while requests than ST. Since the L3 latency is higher, this
the second thread utilizes the floating-point units in the HT degrades the overall performance.
mode. This results in better performance gain. As was the
case with the two previous applications, there is an anti-
correlation between the performance gain and the percentage
difference between ST and HT for the primary data source—
in this case main memory.

Figure 16. Performance gain and percentage difference between ST and


HT for OVERFLOW.

IX. IMPACT OF APPLICATION CHARACTERISTIC


The applications we investigated in this paper fall into
two broad classes at the highest level: those utilizing
structured meshes (OVERFLOW) and those utilizing
unstructured meshes (NCC, Cart3D, and USM3D). The three
unstructured applications use a similar strategy for sub-
domain partitioning of the grids except that Cart3D uses
unstructured Cartesian grids and load-balancing is done via
Figure 14. Performance gain and percentage difference between ST and
HT for USM3D.
space-filling curves whereas USM3D and NCC use
tetrahedral grids and partitioning is accomplished using
4) OVERFLOW
METIS. In this study we found that unstructured-grid
Figure 15 shows the percentage of data from each source applications benefit by running them in HT mode whereas
for OVERFLOW in ST mode. For ST, the amount of data structured-grid applications do not. In order to understand
coming from L2 cache, L3 cache, and main memory is 44- this performance behavior, we briefly describe the
71%, 18-31%, and 12-25%, respectively. As the number of characteristics of these two classes of applications.
cores increases, more data fits into L2 cache. As a result, a
Structured applications access adjacent elements of the
higher percentage of data comes from it.
underlying data structures and this spatial locality of data
allow them to be optimized for cache by the compiler. Such As future work, we propose to quantify the impact of
codes also tend to be associated with a high degree of scaling and communication in HT mode. We also intend to
vectorization. The success of vectorization puts increased investigate the impact of power and thermal efficiencies in
pressure on the memory hierarchy and can result in stalls that HT mode.
lower our measure of efficiency. Adding a second thread to ACKNOWLEDGMENT
that core with HT increases the demands on memory and
communication resources, and does not result in any We gratefully acknowledge Sharad Gavali’s help with
performance benefits. running NCC, and stimulating discussions with Johnny
Unstructured applications, on the other hand, involve Chang, Jahed Djomehri and Kenichi Taylor.
indirect addressing and adjacent elements are often not REFERENCES
accessed in sequence. Also, the compiler is usually unable to
vectorize the codes, which results in sub-optimal utilization [1] Intel Pentium 4 Processor Extreme Edition Supporting Hyper-
Threading Technology,
of floating-point execution units and gives opportunities for http://www.intel.com/products/processor/pentium4htxe/index.htm
HT to utilize the resources. Thus, we expect hyper-threading [2] Intel Hyper-Threading Technology (Intel HT Technology),
to provide a boost in performance as long as there is no http://www.intel.com/technology/platform-technology/hyper-
significant increase in the contention for the memory threading/
hierarchy or communication resources. [3] D. Marr, et al., “Hyper-Threading Technology Architecture and
Microarchitecture,” Intel Technology Journal, Volume 06, Issue 01
X. CONCLUSIONS February 14, 2002.
http://www.intel.com/technology/itj/archive/2002.htm
In this paper we have studied the effect of hyper-
[4] J. Boisseau, K. Milfeld, and C. Guiang. “Exploring the Effects of
threading on four applications of interest to NASA: Cart3D, Hyperthreading on Scientific Applications,” presented in Technical
NCC, USM3D, and OVERFLOW. While the first three session number 7B, 45th Cray User Group Conference, , Columbus,
showed performance boosts from using HT, OVERFLOW Ohio, May 2003. http://www.cug.org/7-
did not. archives/previous_conferences/2003/CUG2003/pages/1-
program/final_program/20.tuesday.htm
In order to explain the differences in performance that we [5] W. Huang, J. Lin, Z. Zhang, and J. M. Chang. “Performance
saw, we introduced an efficiency metric to quantify Characterization of Java Applications on SMT Processors,”
processor resource utilization. Using the metric, we find that International Symp. on Performance Analysis of Systems and
Software (ISPASS), March 2005,
efficiency in hyper-threaded mode is higher than in single-
[6] S. Blackburn, P. Cheng, and K. McKinley. “Myths and Realities: The
threaded mode across all core counts for all four Performance Impact of Garbage Collection,” Proc. SIGMETRICS
applications. Since OVERFLOW did not see any ’04, June 2004.
improvement from HT, there must be other factors [7] Intel® Microarchitecture (Nehalem),
influencing performance. In particular, vectorization plays a www.intel.com/technology/architecture-silicon/next-gen/.
key role, as OVERFLOW was by far the most highly [8] S. Saini, A. Naraikin, R. Biswas, D. Barkai, and T. Sandstrom, “Early
vectorized of the codes in the study. Performance Evaluation of a Nehalem Cluster Using Scientific and
Engineering Applications,” Proc. ACM/IEEE SC09, Portland,
HT increases competition for resources in the memory Oregon, Nov. 2009.
hierarchy, such as memory bandwidth. Moreover, HT [9] S. Saini, P. Mehrotra, K. Taylor, M. Aftosmis, and R. Biswas,
performance is affected by increased communication “Performance Analysis of CFD Application Cart3D Using MPInside
pressure as additional processes compete for network and Performance Monitor Unit Data on Nehalem and Westmere
resources such as HCA chips and IB switches. One factor Based Supercomputers,” 13th IEEE Intl. Conf. on High Performance
Computing and Communications, Banff, Canada, Sep. 2011.
that affects the results of our experiments is that we
[10] Intel Westmere,
conducted a strong scaling study. Also, in the analysis we http://ark.intel.com/ProductCollection.aspx?codeName=33174
have assumed that the applications scale perfectly from n to [11] SGI Altix ICE 8400: http://www.sgi.com/products/servers/altix/ice/
2n ranks, and thus the entire performance impact in going [12] Message Passing Toolkit (MPT) User’s Guide,
from ST to HT mode is from the use of hyper-threading. We http://techpubs.sgi.com/library/manuals/3000/007-3773-003/pdf/007-
have not taken into account the changes in communication in 3773-003.pdf
our analysis of the results. [13] PAPI 4.1.1 Release, http://icl.cs.utk.edu/papi/news/news.html?id=203
[14] D. J. Mavriplis, M. J. Aftosmis, and M. Berger. “High Resolution
We found that unstructured-grid applications like NCC, Aerospace Applications using the NASA Columbia Supercomputer,”
Cart3D, and USM3D benefit from HT whereas the Proc. ACM/IEEE SC05, Seattle, Washington, Nov. 2005.
structured-grid application (OVERFLOW) did not. The [15] OVERFLOW: http://aaac.larc.nasa.gov/~buning/
unstructured codes usually have a low percentage of [16] A. Quealy, R. Ryder, A. Norris, and N-S. Liu. “National Combustion
vectorization and could get a performance boost from HT Code: Parallel Implementation and Performance,” 38th AIAA
provided competition from an additional thread does not Aerospace Sciences Mtg., Reno, Nevada, Jan. 2000.
cause load instructions to go deeper in the memory hierarchy [17] USM3D, http://aaac.larc.nasa.gov/tsab/usm3d/usm3d_52_man.html
to be satisfied. We also found an anti-correlation between the [18] op_scope, Supermith, Inc., Pebble Beach, CA,
performance gain in HT mode and the primary data source http://supersmith.com/op_scope
for the four applications used in the present study.

You might also like