3.2 Performance Monitor
Performance monitoring for an application begins when it is instantiated. MAPPER currently does not leverage information gathered in prior executions and instead adapts to application behavior for the current instantiation and input data. For each registered application instance, the performance monitor gets information about application behavior from low-level hardware performance counters (using the Linux performance event subsystem (
perf [
26])). For each thread of the application, the monitor specifies a set of hardware events that need to be monitored.
perf periodically reads per-task hardware performance counters and gathers this information over time. This information is used to determine the characteristics of each thread and the application as a whole.
MAPPER uses metrics derived from hardware performance counters commonly available on modern processors to assess application resource bottlenecks [
44,
45]. On the Intel Xeon CPU E7-4820, we use five hardware performance counters and six performance events to compute four application per-thread performance metrics (Table
1).
The performance event counts of all threads of an application are aggregated to obtain per-application event counts (resource demand). Whether the application is memory-intensive or data-sharing-intensive (implying a high degree of cross-core data communication) is determined by comparing the application’s resource demand (from all of its associated tasks) to system capability (obtained via a one-time offline profiling). Table
2 lists the information currently maintained by the monitor in MAPPER, which includes IPC and the current number of active threads for each application.
In the event that a progress metric is not specified by the application, our runtime system uses IPC values obtained from perf (averaged over all active threads in the application) as a measure of progress. IPC as a measure of progress has the inherent flaw of capturing useless work, for example, spinning on a synchronization variable. Techniques to eliminate measurement while spinning (possible if all application synchronization occurs through the runtime) help mitigate these flaws. Our experiments with microbenchmarks and OpenMP show that aggregate IPC tracks performance fairly well if the runtime is modified to identify and eliminate spinning from the measurement.
3.3 Resource Allocation
To allocate resources across applications in a fair manner, our implementation places MAPPER’s resource allocation and mapping outside the parallel runtime and inside a separate privileged daemon. The MAPPER daemon pools knowledge of individual co-running application’s scaling characteristics to better allocate resources to applications that use them more efficiently. This process is carefully moderated to provide well-defined performance guarantees to each application.
The MAPPER daemon is responsible for allocating and mapping resources to registered and simultaneously executing applications. Applications that run in the absence of others will have access to all system resources. Ideally, in these cases, MAPPER will pick the mapping and resource allocation that results in the best performance. For applications that scale well, the best performance is attained when using all available resources. For others, a reduction in resource assignment could result in better performance.
MAPPER relies on information from perf to tune and control application parallelism and prevent saturation of resources critical to both application and system-wide performance. For example, applications that saturate memory bandwidth may benefit from a reduction in number of threads to keep queuing delays to a minimum. Applications without any perceived hardware bottlenecks may benefit from an increased number of threads. To avoid repeated bad decisions, the MAPPER daemon maintains a history of application performance at different degrees of parallelism.
In this article, we focus on hardware context allocation while taking resource bottlenecks into account. The MAPPER daemon maintains a per-application count of the number of worker threads. Applications are initially granted a fair share (
FairCoreCnt) of hardware contexts, depending on the system load. Assuming equal priority, we divide the total available cores uniformly across all running applications; for example, four registered parallel applications with sufficient parallelism to utilize the whole machine will be granted 25% of the hardware contexts.
2Using data collected over the application execution, MAPPER calculates the parallel efficiency at different core counts (number of cores used). The performance that applications can fairly expect to achieve (
MaxFairPerf) is calculated based on the application performance data available for the first
\( n \) core counts, where
\( n \) is the application’s fair share of CPU cores.
Each application is guaranteed a minimum performance (Equation
3) defined as being within a specified percentage (
MinQoS) of
MaxFairPerf (Equation
2). MAPPER can then determine the number of cores it can spare (to a better scaling application) based on this lower limit on performance. We use a value of 85% for
MinQoS in our experiments (see Section
4 for a sensitivity analysis). To determine the number of spare cores, MAPPER first estimates the difference between the minimum guaranteed performance and the performance at the current core count. This extra performance is calculated as a fraction of the total performance achieved at the current core count. A fraction of cores equal to the fraction of extra to current performance is then estimated as spare cores (Equation
4). This calculation makes the conservative assumption that performance is linear with the number of cores, which is typically not the case, allowing us to ensure that
MinQoS indeed places a lower bound on performance. It should be possible to adopt other application-specific definitions of progress and their associated notions of minimum QoS. We do not explore them in this article.
In the absence of an application-specific progress metric, instructions per cycle (IPC) is used as a measure of progress. We define performance (Perf), a measure of progress, as the sum of the IPCs (or application-specific metric if defined) of each core used in a parallel execution.
An efficiency metric as shown in Equation (
6) is calculated for each application. Parallel efficiency (a measure of performance relative to the resources used to achieve it) at
\( n \) cores is usually defined as
\( T_1/(T_{n}\times n) \) , where
\( T_1 \) is the execution time when using one core and
\( T_n \) is the execution time at
\( n \) cores: essentially the speedup divided by the core count. The expectation of this efficiency calculation is that single core (sequential) execution efficiency is 1. Since our calculations are dynamic and in terms of progress toward the eventual goal of application completion, we slightly modify this definition. We define parallel efficiency at a particular core count as the execution’s progress metric (
Perf) divided by the core count (shown in Equation (
6a)). Since this parallel efficiency is defined in terms of an application-specific progress metric (even baseline IPC varies as a function of the computation), we then use a normalization factor (similar to the expectation that sequential execution has an efficiency of 1). We choose the maximum efficiency across all core counts as a normalizing factor (
6b). Using this normalizing factor, a new normalized efficiency is calculated for every core count (
6c). A value of 1 at a particular core count indicates that the application is at its highest efficiency (as known to MAPPER at that point in time). If and when MAPPER identifies a more efficient core count, these values will be updated to reflect the change. The normalized efficiency (
NormEff) allows cross-application comparison when improving overall system efficiency.
The normalized parallel efficiency is generally expected to be close to 1 for scalable applications. It is possible for the normalized efficiency to be significantly less at lower core counts if the application experiences super-linear speedup at higher core counts. Super-linear speedup can occur if the data used for computation by each thread fits within a faster and higher level of the memory hierarchy. This can happen occasionally, since a smaller degree of parallelism can often increase the data handled on a per-thread basis thereby spilling over to slower but bigger caches/memory. It is also possible for parallel efficiency to fall off with increasing number of threads (due to excess synchronization, context switching overheads, higher contention, etc.).
Applications are prioritized according to their parallel efficiency and any free hardware contexts or spare cores from applications that scale more poorly are reallocated to the higher efficiency applications so long as quality of service requirements are met for each application. If an application can benefit from more resources (for example, when it is in an exploratory phase) than it is currently using, then free resources in the system are allocated to it. When free resources are absent, MAPPER attempts to allocate spare cores from an application whose parallel efficiency is less than its own. If any application sees a drop in performance (5%) below MinQoS due to granting its resources to other applications, then the reallocation is reverted.
An application running under MAPPER is typically in one of two states: steady or exploratory. Steady state is entered at the end of an exploratory phase, during which a configuration that meets QoS and system efficiency goals is chosen. The runtime system continually monitors performance. Periodically, if performance changes significantly or new hardware bottlenecks are detected (and occasionally even without performance change to capture any gradual changes), then the runtime system transitions the application to exploratory state. The period may be dynamically adjusted based on application behavior [
4].
During an exploration phase, MAPPER uses an application’s parallel efficiency information (which reflects the existence of bottlenecks) to determine the direction of configuration change. The number of threads in the configuration explored is reduced for applications that saturate memory bandwidth and increased for applications without any perceived hardware bottlenecks. Information on application bottlenecks and behavior may also help reduce the number of unnecessary states explored. In particular, if the application is a source of heavy coherence traffic (inter- or intra-socket), then the degree of parallelism is changed in multiples of cores within a socket to enable localization of coherence traffic within as few sockets as possible. If the new configuration results in improvement (we use a 5% threshold) in performance, then the change is made permanent and exploration continues in the same direction (whether an increase or decrease in active parallelism). If the new configuration results in reduced performance, then the previous configuration is restored and the application is transitioned to stable state.
3.4 Resource Mapping
In addition to decisions on the number of hardware contexts to allocate to each application, MAPPER also carefully maps application threads to specific hardware contexts based on CPU, memory, cache, and interconnect resource needs using a refinement of the heuristics used by the SAM system [
44,
45], which includes grouping threads by application. Input to this decision-making strategy is a priority-based sorted list of applications with each application listed using its critical bottleneck. Most modern multiprocessor systems have a number of hardware resources, all of which may or may not be utilized by every application. To arrive at the application’s critical bottleneck, we develop microbenchmarks to perform a one-time characterization of the system’s capability by stressing hardware resources, one at a time. During this characterization phase, we monitor each microbenchmark using hardware performance counters. Data from the counters help us arrive at thresholds that can be used to detect bottlenecks in application performance.
The applications are sorted according to the priority of the hardware bottleneck exhibited: (Data sharing (coherence traffic) bottleneck
\( \gt \) Memory bandwidth bottleneck
\( \gt \) Compute (IPC)). Within each category, applications are sorted based on how severely they contribute to the overall bottleneck. Applications with higher contributions are given higher priority for resource mapping. The resource mapper walks this sorted list and determines the mapping strategy using the application’s bottleneck information maintained as already shown in Table
2. Mapping strategies decide whether to colocate threads on the same core (if sharing- but not CPU-intensive), on the same socket (if sharing- and CPU-intensive), whether to spread them across sockets (if memory-intensive), or avoid hyperthreads (if CPU-intensive) based on core availability and the application’s critical bottleneck.
The above mapping strategy can assign one of many possible topological configurations of hardware contexts to an application. To preserve locality and minimize migrations, MAPPER attempts to minimize differences between successive cpuset mappings. When the system has multiple sockets, this problem is magnified, and so MAPPER generates a sorted list of sockets for each application, reflecting a preference for application thread placement based on the chosen mapping strategy.
For strategies that require colocation, sockets are sorted in descending order of the application’s residency score, which is the total number of hardware contexts already allocated to that application within each socket. The residency score is used to minimize the data movement and cache warmups induced due to thread migration. When new hardware contexts are allocated to the application, MAPPER will attempt to accommodate them in sockets with the highest scores. If the prior mapping decision spread the application threads across sockets either due to unavailability of hardware contexts (for colocation) or due to change in type of hardware bottleneck, then MAPPER will prefer to colocate them to the highest scored sockets in the current decision-making interval. Naturally, when applications need to relinquish resources they are released from the lowest non-zero scored sockets.
When the mapping strategy necessitates distributing hardware contexts across sockets, the sockets are sorted in ascending order of their residency scores. If additional hardware contexts are allocated to the application, then they are placed on the sockets with lowest scores (wherever possible). Conversely, if hardware contexts are taken away from an application, then those on the highest-scored sockets are freed.
Within each selected socket, contexts that are already occupied by the application are retained when possible to further minimize migrations. Compute-intensive threads are preferably placed on physical cores that are occupied by hardware threads that are not compute bound.
When hardware contexts that satisfy the mapping strategy are not available on any socket, MAPPER attempts to pick hardware contexts that will result in the least amount of resource contention (whether CPU, cache, or interconnect). MAPPER’s resource allocation and mapping decisions constitute a complementary two-step process that when combined eliminates the need to explore the entire state space of possible mappings of \( ^{n}{P}_{c} \) , where \( n \) is the number of threads and \( c \) is the number of contexts.
3.5 MAPPER Runtime Interface: Parallelism Regulation
As described in Section
3.1, applications participating in parallel efficiency regulation are launched by the MAPPER application launcher. If the application runtime so chooses, then it may use the launcher both to communicate additional application-specific progress metrics and to determine current application-specific MAPPER resource allocation. Figure
1 shows the information flow between an application runtime and MAPPER, which takes place via a shared memory segment established between the daemon and the launcher. Runtime systems may choose to use the returned information on resource allocation to change task creation and granularity.
Example Use in OpenMP: Upon application initialization, the number of active threads in OpenMP is set as specified by the user or to any default value set in the system. If no such value is specified, then the runtime determines the number of hardware contexts available in the machine and sets the number of threads to this value. In OpenMP, active application parallelism can be controlled in two different ways. The most direct method to control the active numbers of threads is to change the size of the thread pool that performs the parallel computation. We modify OpenMP to control the thread pools used to execute a parallel region at the beginning of a parallel region (defined by #OMP parallel).
However, once a parallel region is currently executing, threads may no longer be added or removed without changes to the application’s parallel structure. We can, however, change the number of active threads for each loop or other OMP task construct within the parallel region. OMP loops may be either statically or dynamically scheduled. To control the parallelism for static loops, we modify the GCC compilation code to introduce a parallelism_control function call at the beginning of these constructs. The parallelism_control function is called independently by every thread within the parallel region and is used to select the threads from the current worker pool that will participate in the current OMP computation. The remaining threads are queued at a barrier and may be used for the next OMP construct, depending on changes in resource allocation. OMP dynamic loops are implemented by having worker threads request tasks to execute upon completion of the previous task; if no additional tasks are found, the thread queues at a barrier. We modified the code within the OpenMP library so excess worker threads (determined by calling parallelism_control) go directly to the barrier without looking for new tasks. Thus, the finest granularity of control within OMP is at the level of these parallel constructs. Our modifications do not restructure the computation, its granularity, or its load balance and are compliant with OpenMP specifications.
We modified the OpenMP parallel constructs to interface with the launcher thread at these safe points to effect changes in the degree of parallelism and thereby reduce potential load imbalance and synchronization overheads. Additionally, when runtimes can change parallelism in response to MAPPER, synchronization and other runtime-related overheads can be reduced, making performance counter information more reflective of the actual computation performed by the worker threads. The alternative would be to modify the runtime to disable performance counters while executing synchronization and other thread creation/deletion code. This would be expensive and can misdirect MAPPER about real CPU contention or data sharing introduced due by spinning synchronization or other runtime code.
We also added support for application-specific progress reporting by using a shared array with dedicated entries per application thread, padded to prevent false sharing. Application developers can use a newly introduced function call within OMP parallel constructs to manipulate this dedicated entry to imply application progress. This information is then aggregated by the modified runtime and passed to MAPPER. We do not use this feature in our evaluation.