E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors

Till Smejkal till.smejkal@tu-dresden.de TU DresdenGermany , Robert Khasanov robert.khasanov@tu-dresden.de TU DresdenGermany , Jeronimo Castrillon jeronimo.castrillon@tu-dresden.de TU DresdenGermany and Hermann Härtig hermann.haertig@tu-dresden.de TU DresdenGermany

Abstract.

Energy efficiency has become a key concern in modern computing. Major processor vendors now offer heterogeneous architectures that combine powerful cores with energy-efficient ones, such as Intel P/E systems, Apple M1 chips, and Samsung’s Exynos CPUs. However, apart from simple cost-based thread allocation strategies, today’s OS schedulers do not fully exploit these systems’ potential for adaptive energy-efficient computing. This is, in part, due to missing application-level interfaces to pass information about task-level energy consumption and application-level elasticity.

This paper presents E-Mapper, a novel resource management approach integrated into Linux for improved execution on heterogeneous processors. In E-Mapper, we base resource allocation decisions on high-level application descriptions that user can attach to programs or that the system can learn automatically at runtime. Our approach supports various programming models including OpenMP, Intel TBB, and TensorFlow. Crucially, E-Mapper leverages this information to extend beyond existing thread-to-core allocation strategies by actively managing application configurations through a novel uniform application-resource manager interface. By doing so, E-Mapper achieves substantial enhancements in both performance and energy efficiency, particularly in multi-application scenarios. On an Intel Raptor Lake and an Arm big.LITTLE system, E-Mapper reduces the application execution on average by \qty20% with an average reduction in energy consumption of \qty34%. We argue that our solution marks a crucial step toward creating a generic approach for sustainable and efficient computing across different processor architectures.

1. Introduction

With the introduction of Intel Alder Lake processors (Intel, 2022), Intel follows the trend of heterogeneous CPUs, similar to the processor design seen in Arm big.LITTLE (Greenhalgh, 2011) and Apple M1 (Apple, 2020) for the domain of powerful x86 desktop computers and servers. Recently, AMD revealed that upcoming processor releases will also include heterogeneous versions (Liu, 2023; Branover et al., 2021). With this, all major processor vendors now include heterogeneous CPUs in their portfolio.

Heterogeneous CPUs typically combine a small number of high-performance cores with a larger number of energy-efficient ones on one System-on-Chip (SoC). High-performance cores offer higher single-thread performance at the cost of increased power consumption, while energy-efficient cores have lower single-thread performance but substantially reduce power consumption. This difference in execution characteristics requires revisiting strategies for resource management in modern operating systems.

The type of core to which an application thread is scheduled can significantly affect its performance. Beyond differences in single-thread performance among core types, some cores may support different instruction sets. For instance, the energy-efficient E-cores in Intel Alder Lake processors lack the AVX-512 extensions available in the P-cores (Cutress and Frumusanu, 2021). However, to what extent an application can profit from high-performance cores depends on its characteristics. For memory-bound and I/O-bound applications, for instance, the performance gap between core types can be negligible.

User-level requirements on application importance must also be accounted for in resource allocation. For instance, a foreground application like a video player is typically more important than a background update. The OS should allocate high-performance cores to the video player, even if other applications might benefit from these cores. Moreover, application priorities can change dynamically; for instance, an update process becomes more critical if explicitly triggered by the user. Ideally, this priority change should be immediately signaled to the OS, allowing prompt core reassignment and faster response to user requests.

Modern OS schedulers have begun addressing the challenges posed by heterogeneous cores. Windows 11 added support for the Intel Thread Director to manage process placement on different CPU core types (Cutress, 2021; Intel, 2021). Similarly, Linux introduced the Energy-Aware-Scheduler (EAS) extension in version 5.0, which considers the individual core performance when assigning them to processes in Arm’s big.LITTLE systems (Perret, 2019). In addition, the Linux community is working to enhance the default system scheduler with feedback from the Intel Thread Director (Chen, 2023; Neri, 2023). Intel Thread Director allows for better runtime decisions on where to allocate applications, as live characteristics of the application behavior are taken into account. However, these systems lack an API for applications to influence placement decisions beyond explicit core pinning. State-of-the-art strategies used in traditional operating systems rely on simple and fast heuristics, often resulting in unpredictable and suboptimal execution.

Optimizing the mapping of applications onto heterogeneous multi-core systems is a well-known problem in the embedded domain (Singh et al., 2013b). Solutions can be classified based on when the decision is taken, e.g., at compile-time, at runtime, or a combination of both (hybrid). Compile-time methods offer near-optimal mappings but lack adaptability to changing workloads, making them unsuitable for dynamic systems where workload changes are unpredictable. Runtime methods, in turn, consider the current system workload and generate mappings on-the-fly — at application launch or during its execution. Hybrid application mapping (HAM) approaches combine the benefits of both design-time and runtime methods (Pourmohseni et al., 2020). HAM methods offload compute-intensive calculations to compile-time, generating a set of Pareto-optimal mappings for each application, and adapt these intermediate solutions to the current workload at runtime. Modern HAM approaches consider heterogeneous processors to optimize energy efficiency while meeting certain Quality of Service requirements (Khasanov and Castrillon, 2020; Khasanov et al., 2021; Weichslgartner et al., 2018; Wildermann et al., 2015; Spieck et al., 2022).

Our work introduces E-Mapper, a runtime system designed to efficiently manage applications on desktop systems, servers, and mobile devices equipped with heterogeneous processors. Drawing from the embedded systems domain, our approach generalizes from HAM-based approaches, expanding them to support the execution of both known and unknown applications – a scenario not commonly addressed in embedded systems. Moreover, E-Mapper accommodates user-defined priorities and supports dynamic changes in core assignments at runtime in response to application triggers. In this paper, we also discuss the modifications that are necessary to effectively support applications in E-Mapper. Specifically, we detail how we adapted application models such as OpenMP, Intel Thread Building Blocks, and TensorFlow for efficient utilization in an OS that integrates E-Mapper. Furthermore, we explain how E-Mapper can automatically learn and predict the application behavior at different configurations – a feature that is crucial to efficiently manage a wide variety of applications.

To our knowledge, E-Mapper is the first system capable of semi-automatically managing applications with dynamic properties on heterogeneous processors while accounting for user requirements and system triggers. E-Mapper aims to fill important gaps in modern operating systems, offering better system utilization, improved system performance, reduced energy consumption, and enhanced user experience. Our evaluation shows that E-Mapper could reduce the energy consumption by \qty34% and the execution time by \qty20% over all applications measured on the Arm Odroid XU3-E board and the Intel Raptor Lake Core i9-13900K.

The remainder of the paper is structured as follows: section 2 provides background on heterogeneous processors, their benefits, and associated challenges. section 3 presents the design of E-Mapper, followed by section 4 outlining the resource allocation approach and section 5 detailing the runtime refinement of application configurations. Section 6 presents the evaluation of E-Mapper. In section 7 we discuss other solutions to the scheduling problem on heterogeneous CPUs, and section 8 concludes our work.

2. A Case for Heterogeneity

Heterogeneous computing has become mainstream in the last years. Various types of heterogeneous systems, including CPU-GPU, CPU-FPGA, CPU-SmartNIC, and CPU-ASIC combinations, have their individual strengths and weaknesses as well as their own scaling and management challenges. In this work, however, we focus on heterogeneous processors where different core types, all running the same Instruction Set Architecture (ISA), are combined on one socket or System-on-Chip (SoC).

2.1. Heterogeneity in one Processor

Heterogeneous processors have long been used in mobile and embedded devices. The Arm big.LITTLE architecture was the first commercial design for such processors, consisting of two tightly connected islands of processors – one island contains one or many high-performance cores, and the other contains one or many energy-efficiency cores (Limited, 2013). The high-performance (big) cores feature high single-thread performance and operate at a higher power envelope. The energy-efficient (LITTLE) cores, on the other hand, feature much lower single-thread performance at a lower power envelope. The big.LITTLE architecture was later extended by Arm in the DynamIQ design, allowing more flexible configurations and more core types (Wathan, 2017). For instance, the recent DynamIQ Shared Unit-120 (dsu, 2023) allows combining up to three different core types in a single processor.

In 2020, Apple released their M1 chip, which is based on Arm and uses the big.LITTLE approach with two types of cores on one socket. The Apple M1 chip features one island with P-cores, equivalent to Arm’s big cores, and another island with E-cores, equivalent to Arm’s LITTLE cores. Later versions of the M1 chip, as well as the M2 and M3 chips, continue this design with varying combinations of P and E-cores.

A similar design is found in modern Intel x86 CPUs, such as the Alder Lake CPUs released in 2021 and the latest Raptor Lake CPUs released in 2022. In Core i5 versions and higher, these CPUs also feature a two-island design with P-core and E-core islands, following an approach similar to the Arm big.LITTLE architecture.

2.2. Benefits of Heterogeneous Processors

In the following, we discuss what trade-offs are opened up by heterogeneous processors.

Trading Performance against Energy

Modern processors, especially high-performance CPUs, have high static power consumption due to significant leakage resulting in less energy proportionality when scaling frequencies and voltages. Consequently, applications that cannot leverage the full single-thread performance of individual cores waste considerable energy on high-performance cores. This is typical for memory and I/O-bound applications. For such applications, energy-efficient cores with less static power are more suitable. By combining powerful and energy-efficiency cores in one socket, operating systems can decide at runtime where to run applications.

Trading Performance against Space

High-performance P-cores on modern Intel Raptor Lake processors require significantly more space on the die than the energy-efficient E-cores, even though the number of E-cores is larger than that of P-cores. This creates interesting trade-offs for chip designers aiming to produce processors tailored for certain workloads, such as CPUs with many E-cores for I/O-heavy applications. However, these tailored systems can only be truly leveraged by operating systems that are aware of application characteristics and can balance the allocation of P-cores and E-cores effectively.

2.3. Challenges of Heterogeneous Processors

Figure 4. Overview of the E-Mapper application management approach and system design.

To illustrate the performance-energy trade-off and how it varies from application to application consider the plot in figure 3. The figure shows different configurations for two applications from the NAS Parallel Benchmark suite on an Intel Raptor Lake Core i9-13900K. For ep.C (figure 3), there is a smooth gradient towards the upper right corner, both in terms of execution time and energy, indicating it runs well on both core types and scales with increasing core counts. Conversely, mg.C (figure 3) shows degraded performance and energy consumption on a heterogeneous core combination. Instead, this application performs much better on homogeneous configurations with a low overall core count.

Traditional system schedulers do not account for these differences. Moreover, there is no interface for applications to inform the operating system about their workload characteristics. E-Mapper addresses this by configuring applications to use only (near-)optimal configurations, highlighted in green in figure 3. For instance, E-Mapper will never execute mg.C in the less efficient red configurations in the upper right corner, even if executing alone. On the other hand, ep.C can use the upper right configuration when controlled by E-Mapper, as this configuration has benefits under certain circumstances. All this information is communicated within the E-Mapper system using a well-defined interface.

3. E-Mapper Design

This section details the E-Mapper system’s architecture and the various application types it supports. Figure 4 provides an overview of our approach, divided into design time and execution time components. At design time, a high-level application description file containing the application operating points is created. Each operating point represents a combination of application configuration, specific hardware resource allocation, and non-functional characteristics such as expected utility (e.g. instructions per second or transactions per second) and average power consumption (figure 4). This characterization of operating points can be done through various methods, from sophisticated design-space exploration using models, traces, and static analysis (Mariani et al., 2012; Hähnel and Smejkal, 2018; Castrillon et al., 2012) to simple measurement-based resource annotations.

The core part of E-Mapper lies in its execution time management approach. During this phase, E-Mapper selects one of the available operating points for each managed application from the provided application description files (figure 4), optimizing a user-selected global optimization target. For each application, E-Mapper allocates corresponding hardware resources of the heterogeneous processor and reconfigures the applications accordingly (figure 4 + ). The resource allocation process considers runtime demands of applications and users, optimizing resource usage, application performance, and overall system energy consumption.

3.1. The E-Mapper System Architecture

E-Mapper includes three main components to interoperate with the operating system and client applications: (1) the E-Mapper resource manager, (2) a shared library (libMapper) for application-level integration, and (3) application descriptions files. The system design is outlined on the right of figure 4.

There is a single instance of the central resource manager and as many instances of the libMapper library as managed applications are in the system. The E-Mapper resource manager maintains an overview of the managed applications and their allocated resources. The E-Mapper resource manager does not replace a traditional OS scheduler but manages resource allocations and application configurations. It searches and selects new application configurations whenever a new application starts in the system, a running application exits, or system-level triggers occur. The selection process, guided by the application (figure 4) and hardware (figure 4) descriptions, optimizes the resource allocation for individual applications. Section 4 discusses the resource allocation optimization approach. The E-Mapper resource manager also monitors performance characteristics and power usage of all applications to train and calibrate the application mapping descriptions at runtime, detailed in section 5.

The libMapper library establishes communication between the global resource manager and individual applications and applies the configuration selected by the E-Mapper manager. Our system uses protobuf¹¹1https://github.com/protocolbuffers/protobuf messages over Unix sockets for a flexible and extensible communication interface. How the configuration is applied in a specific application depends heavily on the application type and its available level of elasticity, discussed in section 3.2.

Figure 5. Typical control flow between a managed application and the E-Mapper resource manager.

A typical control flow between a managed application and the E-Mapper resource manager is shown in figure 5. At application startup, the libMapper library initializes, registers with the resource manager, and reads the application description file, which contains an application identifier and the list of operating points. Additionally, the library performs checks on the application binary to identify potential configuration options. The collected information is transmitted to the E-Mapper manager, which thereon reevaluates the current system load and resource allocations and selects new configurations for all managed applications. These selected configurations are communicated back to the individual application libraries, which adapt the applications accordingly. Hence, the libMapper library responsible for the newly started application (App₁ in the figure) also receives an activate_config message and performs the initial application configuration. Such messages can occur anytime during an application’s lifetime, potentially reconfiguring the application multiple times over the course of its execution.

3.2. E-Mapper Application Types

E-Mapper distinguishes between different types of applications to optimally manage them within a heterogeneous processor. The main difference between the types lies in the adaption options available to each type. Determining the correct application type is done by the libMapper library during startup or specified in the application’s description file.

Static Applications

Applications without identified runtime adaptation mechanisms are labeled as static. Static applications are allocated to specific cores at runtime, always according to an operating point selected by the manager. Depending on the application description, the libMapper library might allocate specific application threads to individual cores or restrict the entire application to a subset of available processor cores. If applications run with more threads than the cores allocated by E-Mapper, the possible parallelism will be lower as threads will have to time-multiplex the same processor core.

Scalable Applications

The libMapper library searches for supported runtime libraries for adaptive execution, such as Intel TBB, OpenMP, and TensorFlow. If such libraries are used, the application is marked as scalable. E-Mapper can dynamically scale applications when resource allocations change. For example, if an application is reallocated to two E-cores on an Intel Raptor Lake system, the libMapper library not only adapts the CPU affinity but also instructs the runtime library to use two threads, allowing applications to better adapt to new configurations and reduce the interference due to resource overutilization.

Custom Applications

Application developers can provide their own extensions to support more application configuration options. For instance, we implemented a libMapper library extension that enables application-specific scaling for Kahn Process Networks (KPN) (Khasanov et al., 2018). Instead of generically scaling the number of threads, this extension scales specific parallel regions inside the application, leaving other parts unmodified. This allows fine-grained resource allocations for different application phases.

Future work could investigate further extensions that react to the allocated resources, such as using different algorithms or specialized code paths within the application or handling ISA-extension differences between core types. These configurations need to be application-specific and provided by the developer. Our libMapper provides an interface to extend communication with the E-Mapper manager, easily supporting application-specific adaptations.

3.3. E-Mapper within the System Stack

E-Mapper aims to extend the current system stack rather than replace an existing component. E-Mapper relies on OS components such as the scheduler or performance and power monitoring. E-Mapper’s core responsibility requires a holistic view of the system, similar to modern system management daemons such as systemd, OpenRC, and launchd, we envision E-Mapper to play a central role within user-space management. Hardware description files (figure 4) would be provided by the system manufacturer or distribution while application description files (figure 4) would be shipped together with the applications. Managing these description files in a central file hierarchy would allow a well-defined and user-extensible configuration database, enabling easy adjustments and powerful resource management by E-Mapper.

Making Applications E-Mapper-ready

To achieve the best results with E-Mapper, applications should be equipped with a description file tailored to the hardware they run on. For static or scalable applications supported by libMapper out-of-the-box, no further adjustments are needed. The library automatically identifies if the application can be dynamically adapted and reconfigures it at runtime accordingly. Custom application support, like for KPNs, requires changes to the application itself for proper E-Mapper-integration. If an application lacks a description file or has an incomplete one, E-Mapper can automatically learn the application’s behavior at runtime. The Online Configuration Refinement algorithm is described in detail in section 5.

4. E-Mapper Resource Allocation

The core challenge addressed by E-Mapper is the optimal allocation of resources to applications. Thus E-Mapper’s algorithm is designed to balance the resource needs of applications with overall system energy efficiency.

4.1. System Model

In E-Mapper, a platform is equipped with a heterogeneous processor $\mathcal{P}$ with $m$ core types, represented as the vector $\mathcal{P}[\vec{\Theta}]=[\Theta_{1},\dots,\Theta_{m}]^{T}$ . Each core type, while sharing the same Instruction Set Architecture (ISA), exhibits distinct performance-energy characteristics and may have varying hardware properties such as different ISA extensions or the number of supported hardware threads. Since performance and energy characteristics of the multi-threaded cores also vary depending on number of used hardware threads, our algorithm ensures that applications are isolated at the core level, thus do not share sibling hardware threads.

When an application $\sigma$ connects to E-Mapper, it sends an application description containing a list of operating points $\sigma[\Phi]$ . Each operating point $\sigma[\phi_{i}]\in\sigma[\Phi]$ is defined by the required resources $\vec{\theta}$ , (normalized) utility $\upsilon$ , and average power consumption $\rho$ , i.e., $\sigma[\phi]=\phi\langle\vec{\theta},\upsilon,\rho\rangle$ . Resource requirements are specified at the core level; if an operating point uses multiple hardware threads of a single processor core, this core is counted only once in $\phi[\vec{\theta}]$ . Operating points are assumed to be Pareto-filtered, meaning each point is better than any other in at least one parameter, such as fewer cores of a particular core type $\theta_{k}$ , higher utility $\upsilon$ , or lower power consumption $\rho$ .

For each operating point, E-Mapper calculates the energy-utility cost $\phi[\zeta]$ , an adaptation of the traditional Energy-Delay Product (EDP) formula designed to balance energy efficiency with the performance impact of applications (Martin, 2001; Pénzes and Martin, 2002). Assuming that utility is inversely proportional to delay, the energy-utility cost is defined as follows:

(1)

\phi[\zeta]=\left(\frac{\phi[\rho]}{\phi[\upsilon]}\right)\cdot\left(\frac{1}{% \phi[\upsilon]}\right)

4.2. Optimization Problem

The E-Mapper resource allocation algorithm selects one operating point per application in a way that minimizes the overall system energy-utility cost. This optimization problem can be formulated as:


(2a)	minimize	$\displaystyle\sum_{\sigma\in\Sigma}\sigma[\phi^{*}][\zeta],$
(2b)	subject to	$\displaystyle\sum_{\sigma\in\Sigma}\sigma[\phi^{*}][\vec{\theta}]\leq\mathcal{% P}[\vec{\Theta}].$

equation 2a minimizes the sum of the energy-utility costs of each application’s selected operating points. The constraint in equation 2b ensures that the total resource demand from all selected operating points does not exceed the available resources for each processor core type.

This optimization problem is equivalent to the multiple choice multi-dimensional knapsack problem (MMKP) (Martello and Toth, 1990). In MMKP, all items have a scalar value and a multi-dimensional weight, and they are divided into several groups. The goal is to select a single item from each group (multiple-choice) so that the overall value is maximized and the overall weight does not exceed the maximum allowed weight at each dimension.

In our optimization problem, each application’s operating points represent items in a group. The goal is to select one operating point per application (one item per group) such that the overall value is minimized. Here, the weight is the number of used processors of each type, and the value of the item is represented as a negative energy-utility cost.

Given that MMKP is NP-hard (Puchinger et al., 2010), finding an optimal solution within a reasonable time, especially for a large number of applications and operating points, is computationally challenging. Since our resource allocation algorithm is used at runtime, E-Mapper employs a state-of-the-art approximate algorithm based on Lagrangian relaxation, which solves the optimization problem with relaxed constraints and then selects the resultant operating points that fulfill the resource constraints. The detailed description of the algorithm is provided in (Wildermann et al., 2014, 2015).

Limitations

It is possible that the resource allocation algorithm does not find any suitable operating point for some of the applications due to prior allocations. This situation may occur if the number of managed applications exceeds the number of available resources. In such cases, E-Mapper temporarily relaxes the constraint in equation 2b, allowing applications to execute in co-allocation. Since co-allocation adversely affects the performance of the co-allocated applications, E-Mapper does not perform performance monitoring, as discussed in the next section.

5. Explorating Operating Points at Runtime

Traditionally, operating points for HAM approaches have been generated at design time, leveraging substantial computational resources to identify Pareto-optimal configurations (Pourmohseni et al., 2020). However, in practical PC scenarios, the operating system may not have predefined operating points, or the available ones might be imprecise due to variations in hardware, despite using similar architectures. Runtime exploration of operating points represents a synergy between Design Space Exploration (DSE) and Runtime Resource Management. This integration poses several challenges, primarily the need to identify effective operating points quickly. Once identified, the algorithm continues refining the Pareto front as new measurements become available.

During exploration, online measurements determine the utility and power consumption of the running applications. If applications do not provide their own utility metrics, we use Instructions Per Second (IPS) as a generic measure of utility. Performance metrics, such as IPS and power, are not constants; they fluctuate due to measurement noise and varying application stages. This variability requires periodic re-evaluation to ensure the robustness and reliability of solutions. Furthermore, the search for near-optimal configurations must be seamlessly integrated within the runtime resource management of the E-Mapper manager. This integration ensures that while exploring new operating points, the selected configurations do not compete with other concurrently running applications on processor cores. It also requires that the resource management algorithm allocates sufficient resources to new applications, allowing them substantial solution space for exploration without significantly undermining the performance of existing applications.

To address these challenges, we have enhanced the resource allocation algorithm in E-Mapper to include the exploration of operating points directly during application execution. This process involves continuous performance monitoring and employing a regression model to dynamically adjust operating points, ensuring robust, Pareto-optimal configurations. Our algorithm effectively balances new operating point exploration with efficient resource allocation, optimizing overall system performance.

5.1. Runtime Performance and Power Monitoring

Accurate monitoring of runtime performance metrics is essential for a good approximation of Pareto fronts. To reliably measure performance values such as IPS, we employ the Linux performance monitoring subsystem perf²²2Most of its functionality is provided by the system call perf_event_open., which allows monitoring various performance-related hardware and software metrics, such as retired instructions, cache misses, etc. With the proper configuration, perf automatically multiplexes measurements for different applications.

For assessing power consumption, we use built-in power sensors available in modern systems, such as the RAPL (Running Average Power Limit) counters on Intel machines. However, these sensors generally measure the total energy consumption of the system rather than per application. To address this challenge, we build atop EnergAt (Hè et al., 2023), which monitors RAPL counters alongside the execution metrics of each thread across the hardware threads.

Since EnergAt does not support different core types in heterogeneous CPUs, we introduce power coefficients between core types ( $P^{P}=\gamma\cdot P^{E}$ , determined statically) to attribute total energy consumption ( $E_{\Delta}^{CPU}$ ) to P-cores ( $E_{\Delta}^{P}$ ) and E-cores ( $E_{\Delta}^{E}$ ):

(3)

E_{\Delta}^{CPU}=E_{\Delta}^{P}+E_{\Delta}^{E}=T_{total}^{P}\cdot P^{P}+T_{% total}^{E}\cdot P^{E}

where $T_{total}^{P|E}$ is the sum of CPU time for all threads running on P-cores and E-cores, respectively. After approximating $E_{\Delta}^{P}$ and $E_{\Delta}^{E}$ values, we can employ the EnergAt methodology to further attribute energy to the applications running on homogeneous subsets of cores.

Considering the inherent variability in measured IPS and power, we implement an exponential moving average (EMA) to stabilize these metrics, accommodating recognition of changes in application stages and their corresponding power and performance characteristics. The EMA is updated as follows:

(4)

value_{new}=value_{measured}\cdot\alpha+value_{old}\cdot(1-\alpha)

where $\alpha$ (E-Mapper uses $0.1$ ) is a smoothing factor. This formula smooths short-term fluctuations while adapting to significant shifts in application behavior, ensuring accurate and responsive performance and power profiling.

5.2. Selection of the Regression Model

We evaluated several regression models for predicting IPS and power consumption for unexplored operating points. Each operating point’s configuration is characterized with a vector that includes the number of cores of each type and their hardware thread usage. For instance, on the Intel Raptor Lake platform, the configuration vector includes the number of E-cores, P-cores using one hardware thread, and P-cores using both hardware threads. The models assessed were Polynomial Regression (degrees 1 to 3), Neural Networks (NN), and Support Vector Machines (SVM). We used pre-measured configurations across 15 applications on the Intel Raptor Lake Core i9-13900K as the data sets. Each model was evaluated using training subsets of different sizes across 10 randomly generated seeds for robustness.

The first two plots in Figure 6 show the Mean Absolute Percentage Error (MAPE) for predicted IPS and power values. Polynomial regression models improved in accuracy for both IPS and power as the training size increased. Higher-degree polynomial models achieved greater accuracy at larger training sizes, albeit requiring more data points to converge. Conversely, NN and SVM models performed better in predicting power values at larger training sizes but were significantly worse in predicting IPS compared to polynomial models.

Furthermore, we compare the predicted Pareto fronts with a reference Pareto front (derived from the measured configurations) using Inverted Generational Distance (IGD) (Coello Coello and Reyes Sierra, 2004) and the ratio of common operating points. IGD measures the average distance from points on the reference Pareto front to the nearest point on the generated front, assessing the coverage of the generated front relative to the reference one. The bottom two plots in Figure 6 show that polynomial models consistently outperformed SVM and NN in aligning with the reference Pareto front. Particularly, polynomial regression models of degrees 2 and 3 outperformed the first-degree model in approximating the reference front. Both second and third-degree models produced similar accuracy, but the second-degree model was more efficient, requiring only 20 training points to generate robust Pareto fronts. Based on this efficiency, we selected the second-degree polynomial regression model for our runtime exploration approach.

5.3. Runtime Exploration Algorithm Design

As discussed earlier, it is critical to integrate the runtime exploration of operating points seamlessly into the resource allocation algorithm. The chosen points for measurement should not overlap with processor cores used by concurrently running applications. Simultaneously, the resource allocation algorithm should allocate sufficient resources to ensure effective exploration without adversely affecting the performance of other applications.

We categorize the maturity of an application’s operating points into three stages: (1) Initial stage, where there are insufficient measured operating points making approximations unreliable; (2) Refinement stage, characterized by an intermediate number of measured points but still limited accuracy in approximations; and (3) Stable stage, at which sufficient operating points have been explored, enabling reliable approximations.

Upon invocation, the resource allocation algorithm, as detailed in Section 4, generates the current Pareto front using both measured and approximated operating points. This allocation determines the set of cores assigned to each application. If there are unassigned cores, the exploration algorithm allocates these to applications in the Initial and Refinement stages, allowing them to explore a broader configuration space. Applications in the Stable stage execute on the designated cores provided by the resource allocation algorithm without further configuration adjustments. For applications in the Initial and Refinement stages, specific exploration techniques are employed within the set of remaining cores:

In the Initial stage, the selection heuristic for the next operating point is based on the configuration vector, choosing the configuration furthest from the measured configurations to maximize exploration diversity. In the Refinement stage, the heuristic is based on approximated IPS and power values and uses an auxiliary regression, which includes a "zero" configuration (zero power and zero IPS) to anchor the model. The heuristic selects configurations based on the largest discrepancies between the main and the auxiliary regression models. These discrepancies are calculated as the geometric mean of the relative differences in IPS and power values. If both models predict negative IPS or power values for some configurations, discrepancies are calculated relative to a zero value to increase the likelihood of measuring configurations that exhibit such anomaly predictions. After a fixed number of measurements, the exploration process repeats until the application progresses to the Stable stage. When applications are in the Stable stage, the resource allocation algorithm is invoked again after a larger number of measurements to reassess the current allocation and potentially switch to another one.

6. Evaluation

To demonstrate that E-Mapper enhances the management of energy-aware tasks on heterogeneous processors, we conducted an extensive evaluation on two different systems. We explicitly choose systems with distinct characteristics to underline that our approach is generic and works across various types of heterogeneous hardware.

6.1. Evaluation Setup

We evaluate our approach on an Arm-based Odroid XU3-E board (Hähnel and Härtig, 2014). The board features a Samsung Exynos 5422 processor, which implements an Arm big.LITTLE architecture with two core islands, a four-core A15 (big) island and a four-core A7 (LITTLE) island. The system is equipped with \qty2\giga of memory and energy sensors for the core islands, the memory, and the integrated graphics card. We run a custom-compiled Linux 6.6 kernel on the board, with full support for the Linux Energy-Aware-Scheduler (EAS), a Linux’ built-in optimized application placement strategy (Perret, 2019; Community, 2024).

The second system is an Intel Raptor Lake Core i9-13900K, one of Intel’s latest heterogeneous processors. It consists of 8 high-performance P-cores, each supporting SMT, and 16 energy-efficient E-cores that do not support SMT. The system is equipped with \qty128\giga of memory. For energy measurements, we use the integrated RAPL counters, which have been proven accurate in fine-grained energy measurements (Smejkal et al., 2017; Hähnel et al., [n. d.]; Colmant et al., [n. d.]; Schöne et al., [n. d.]; Hackenberg et al., [n. d.]).

We run a custom-compiled Linux 6.4 kernel, based on the default Debian Testing kernel version, extended with a patch-set adding preliminary support for Intel Thread Director (ITD) (Neri, 2023). We further extended the patch-set to make the ITD classification of threads and the reference IPC per class determined by the hardware available to user-space. Inspired by Saez et al. (Saez and Prieto-Matias, 2022) we implemented a version of E-Mapper that uses ITD classifications to allocate processor cores to application threads.

For both platforms we use the performance frequency governor and limit the maximum frequencies to prevent thermal throttling, allowing the cores to run at maximum frequencies for the entire benchmark execution. On the Raptor Lake system, we selected \qty4.6\giga for the P-cores and \qty3.8\giga for the E-cores, while on the Odroid system, we chose \qty1.2\giga for the LITTLE and \qty1.8\giga for the big cores.

6.2. Benchmarks

We use different sets of applications to evaluate the features of E-Mapper. To test dynamic adaptability, we use the OpenMP implementations of the NAS Parallel Benchmarks (Bailey et al., 1991), version 3.4.2. Since the Intel platform is more powerful than the Odroid board, we use different classes of the benchmarks for each system: class A for Odroid and class C for Intel.

On the Intel Raptor Lake platform, we also evaluate a selection of Intel Thread Building Blocks (Intel TBB) (Pheatt, 2008) benchmarks and two TensorFlow (Developers, 2022) applications. We chose the benchmarks binpack, fractal, parallel-preorder, pi, primes, and seismic from the official Intel TBB repository as they cover a wide spectrum of the building blocks of the framework. TensorFlow, an open-source framework for machine-learning algorithms, is also included in our evaluation. We implemented a E-Mapper-enabled version of TensorFlow Lite (tfl, 2023) that can dynamically scale its parallelism at runtime. We evaluate two models, VGG (Simonyan and Zisserman, 2014) and AlexNet (Krizhevsky et al., 2017), used for image recognition.

Additionally, we use two embedded KPN applications to test the custom extensions of E-Mapper: mandelbrot for calculating the Mandelbrot set (Khasanov et al., 2018), and lms implementing Leighton-Micali Signatures (McGrew et al., 2019). Both applications are used in two versions: one with a static application topology (annotated with static) and another with implicit data-parallelism in KPNs (Khasanov et al., 2018), used to demonstrate dynamic adaptation. Since KPN applications are targeted to embedded platforms, we chose to evaluate them only on the Odroid system.

6.3. Intel Raptor Lake

For our first experiment, we measure the execution of our selected benchmark collection on the Raptor Lake system. We evaluate both single benchmarks and parallel executions of multiple benchmarks. Every scenario is executed with resource management by E-Mapper and ITD-based resource allocation as described earlier. With E-Mapper, we use application description files generated from a design space exploration prior to the actual measurement. The ITD-based allocation does not require this step, instead uses hardware-provided classification information for resource allocation. We also measure scenarios using the Linux built-in work-distribution techniques as a baseline.

For each scenario, we record the overall execution time (makespan) and total energy consumption. Each scenario is measured ten times, and the average results are reported. Figure 7 shows the results for all benchmarks, including geometric means for single application scenarios, multiple application scenarios, and all scenarios combined. The results are presented as improvement factors over the baseline; an improvement factor above 1 indicates faster execution or lower energy consumption, while below 1 indicates the opposite.

Single Application

For single application scenarios, benchmarks generally benefit from being managed by E-Mapper. The geometric mean improvement factor for E-Mapper is \qty1.11 for execution time and \qty1.57 for energy consumption. There are two notable cases to highlight. First, the binpack benchmark from Intel TBB shows a \qty10× higher IPS with low thread counts. E-Mapper leverages this behavior by scaling the application accordingly, while the baseline and ITD-based manager cannot, thus resulting in extreme improvement. Second, benchmarks is, sp, and parallel-preorder run longer with E-Mapper but consume less energy due to E-Mapper’s optimization algorithm, which balances power consumption and utility.

The ITD-based resource allocation shows only minor differences from the baseline, with overall improvement factors of \qty1.01 for execution time and \qty1.04 for energy consumption. ITD improves the ua and TensorFlow vgg benchmarks by allocating P-cores to more active threads. However, it misclassifies threads for alexnet, leading to suboptimal performance.

Multiple Applications

For scenarios with multiple applications running in parallel, E-Mapper provides significant benefits. The average improvement factor is \qty1.62 for execution time and \qty1.81 for energy consumption. Most applications show improvements, with few exceptions matching the baseline.

Managing resources based on ITD information did not yield the expected improvements. Most scenarios only achieved similar performance to the baseline, resulting in an overall improvement factor of \qty0.95 for execution time and \qty0.98 for energy consumption when multiple applications ran in parallel. Some minor improvements were observed in scenarios like is and lu running in parallel, and when five applications run simultaneously. However, combinations like vgg and ft and primes showed worse results, highlighting classification issues.

6.4. Odroid XU3-E

To demonstrate E-Mapper’s effectiveness on different hardware, we conducted a second experiment on the Odroid XU3-E platform, measuring single and multiple application scenarios. The baseline for this platform uses the Linux Energy-Aware Scheduler (EAS), which leverages the power model of the heterogeneous processor. Figure 8 shows the results of this experiment.

Single Application

The results for single OpenMP benchmarks are similar to the Intel Raptor Lake experiment. The applications either perform similarly to the baseline or show improved energy consumption with longer execution times (e.g., ua). The is benchmark shows a significant increase in execution time as well as an energy consumption increase. E-Mapper picks a sub-optimal configuration for this application leading to this negative effects. In contrast, lu, a long-running benchmark, benefits significantly from E-Mapper in both energy consumption and execution time.

KPN applications show significant improvements with E-Mapper. E-Mapper can successfully use the application knobs to tune the application to the available hardware resources, resulting in lower energy consumption for lms and improvements in both metrics for mandelbrot. The static version of mandelbrot behaves the same as the baseline. The static version of lms behaves similar to ua and shows a lower energy consumption at the cost of a longer makespan. On average, E-Mapper improved execution time by a factor of \qty1.07 and energy consumption by \qty1.27 for single application scenarios.

Multiple Applications

When running multiple applications in parallel, E-Mapper significantly enhances overall system performance. The geometric mean improvement factor is \qty1.20 for execution time and \qty1.38 for energy consumption. Most scenarios show improvements, with only the combination of ep and ft suffering in execution time and energy consumption compared to the baseline.

6.5. Runtime Exploration of Operating Points

This section demonstrates E-Mapper’s behavior with runtime exploration of operating points on the Intel Raptor Lake Core i9-13900K across 15 single-application and 15 multi-application scenarios.

Each scenario runs 15 times, starting with no predefined operating points. Subsequent runs use data from prior runs to continue exploration. Applications remain in the initial stage until 6 operating points are measured, move to the refinement stage until 25 operating points with at least 20 measurements each are gathered. The measurement period is \qty50ms, discarding the first three measurements after reconfiguration to allow applications to adjust. Figure 9 shows the results. The "Training" stage signifies ongoing exploration in at least one application, while the "Stable" stage indicates all applications start from a stable stage.

Not all scenarios reach the stable stage within 15 executions. While all scenarios with single applications reach the stable stage, the three multi-application scenarios do not. These three scenarios involve short-running applications alongside longer-running ones. We suspect the resource distribution between the exploring and the stable stage applications is the issue; applications in the exploration stages do not get enough resources to progress to the stable stage. We believe that further refinement of the runtime exploration algorithm could reduce the likelihood of such situations.

During the training stages, performance and energy consumption fluctuate, and upon reaching the stable stage, the results stabilize, though E-Mapper continues to measure and refine operating points. Results are usually worse than those with offline-generated operating points, but for most single-application scenarios, online training gets close to the original results, indicating that online exploration effectively identifies optimal or near-optimal configurations. In multi-application scenarios, results are generally better than the CFS baseline. A notable exception is the binpack + fractal + parallel-preorder scenario, where runtime measurement and prediction result in poor execution time due to a bad estimation for the binpack benchmark. These negative results suggest that the runtime exploration heuristics could be further improved to find better configuration points.

7. Related Work

Optimizing the mapping of applications onto heterogeneous multi-core systems is a well-known problem in embedded systems (Singh et al., 2013b). The mapping approaches can be classified by the decision time into static mapping, runtime mapping, and hybrid application mapping techniques.

Static Mapping Approaches

Static mapping approaches are among the most widely studied approaches. In static mapping, resources are allocated prior to application execution, typically during compile or design time. This approach leverages static code analysis, often supplemented by execution traces, profiling data, and models of the target hardware, to determine the optimal resource allocation.

Various strategies are employed to find optimal static mappings. These include meta-heuristic techniques like evolutionary algorithms for design-space exploration (DSE) (Alexandrescu et al., 2011; Quan and Pimentel, 2014; Kang et al., 2012), as well as formulations of well-known problems, like ILP or SMT, which are solved using dedicated solvers (Malik et al., 2018). Additionally, simpler heuristics based on domain-specific knowledge are also used (Castrillon et al., 2012; Singh et al., 2013a; Brunet et al., 2013; Khasanov et al., 2021).

Since static mappings are specified at design time, they allow for more extensive calculations to find an optimal allocation than any runtime approach. This enables complex analysis methods and can produce binaries that are easy to deploy since they do not necessarily need any runtime support. However, static mapping approaches cannot adapt to dynamic scenarios, such as varying system loads or the presence of other concurrently executing applications.

Dynamic Mapping Approaches

On the other end of the spectrum lie dynamic mapping approaches. In these methods, the decision regarding resource allocation is deferred until runtime, where a simple heuristic typically determines the assignment. In most cases, the mapping decision is also combined with task scheduling within single processing elements (PEs) and embedded within the operating system (Brandenburg et al., 2008; Zhuravlev et al., 2012). As mentioned in the Section 1, modern OS schedulers are evolving to better manage heterogeneous multi-core processors, aiming to improve energy efficiency. These schedulers employ techniques such as using CPU energy models (Perret, 2019) or leveraging hardware-assisted tools such as Intel’s Thread Director (Neri, 2023; Cutress, 2021; Chen, 2023; Saez and Prieto-Matias, 2022; Bilbao et al., 2023). However, dynamic approaches are inherently limited in their optimization space. While they are more adaptive, the applied heuristics must remain lightweight to ensure efficient execution. This requirement often restricts them to operating with limited information.

Hybrid Mapping Approaches

To address the limitations of static and dynamic mapping methods, a trend for hybrid application mapping (HAM) approaches has emerged (Pourmohseni et al., 2020). HAM strategies leverage extensive analyses from static methods at compile time while deferring final decision-making to runtime. This class of approaches ensures efficiency while retaining flexibility to adapt to dynamic system loads. During the design stage, hybrid approaches employ sophisticated DSE methodologies to characterize the impact of different resource allocations on the non-functional properties of the mapping. The outcome of DSE is a set of (possibly incomplete) mapping options, referred to as operating points, each characterized by the required amount and type of resources and the obtainable execution properties. Approaches such as (Massari et al., 2014; Singh et al., 2013a; Onnebrink et al., 2019) use heuristics for efficient exploration of Pareto-optimal configurations, while others (Ascia et al., 2004; Mariani et al., 2012; Weichslgartner et al., 2014; Quan and Pimentel, 2015) apply evolutionary algorithms. Other approaches also distinguish different application scenarios (Quan and Pimentel, 2015; Schranzhofer et al., 2010; Schor et al., 2012; Spieck et al., 2022).

At runtime, the resource managers utilize the generated operating points and partition resources among multiple, concurrently executed applications. Algorithms may map applications on the platform one at a time (Singh et al., 2013a; Weichslgartner et al., 2014), or in a joint manner by formulating the problem via a Multiple-choice Multidimensional Knapsack Problem (MMKP) (Martello and Toth, 1990), and using ILP solvers (Bini et al., 2011), or fast knapsack heuristics based on Pareto algebra principles (Shojaei et al., 2013), Lagrangian relaxation (Wildermann et al., 2014, 2015; Spieck et al., 2022), or greedy heuristics (Ykman-Couvreur et al., 2011). Similar approaches are also discussed for cluster scheduling in cloud environments, where the dynamic distribution of jobs in a heterogeneous cloud environment is solved using ILP solvers (Tumanov et al., 2016).

The aforementioned approaches generate spatial multi-application mappings, i.e., they only generate a mapping for the current running set of applications and do not consider the changes in the workload nor further optimize the execution. To address it, some works (Khasanov and Castrillon, 2020; Khasanov et al., 2021, 2024) proposed generating spatio-temporal mappings at runtime instead. However, it is important to note that while these approaches offer significant improvements in terms of energy efficiency and meeting real-time constraints, they often postpone the execution of the application to select the more efficient operating later. Such a strategy, while suitable for real-time systems, is less ideal for regular desktop computers where users expect immediate application responsiveness.

Application Adaptivity

A common limitation in many of the mapping approaches discussed is their focus primarily on thread-to-core pinning, without fully exploiting the adaptability potential of applications. Beyond mere assignment of threads to processor cores, certain applications possess additional knobs that allow them to adjust their internal configurations dynamically at runtime. For instance, some applications have the capability to modify their topology, alter the parallelization degree of data-parallel regions, or switch between different internal algorithms based on runtime conditions (Khasanov et al., 2018; Schor et al., 2014). Another scaling approach involves distributing work within OpenMP applications in a heterogeneous-hardware aware fashion (Saez et al., 2020). Despite these possibilities, modern operating system often fail to leverage these adaptive features, missing opportunities to optimize application performance and resource utilization.

8. Conclusion

This paper introduces E-Mapper, a resource management system designed for energy-aware applications on heterogeneous processors. E-Mapper’s key innovations include a uniform interface to a global resource manager for passing high-level application descriptions and an allocation algorithm that balances performance and energy consumption. E-Mapper supports a broad spectrum of application models, from static to scalable types, and even custom-specific ones with unique adaptivity features like reconfiguration options and algorithmic variations. Crucially, E-Mapper does not require detailed knowledge of each application’s adaptivity aspects. Instead, it receives only the essential information needed for efficient resource allocation (resource requirements and non-functional characteristics), with the rest handled on the client side. This flexible design allows for various management scenarios.

Evaluations on two different heterogeneous systems showed that E-Mapper can significantly improve application execution, both when given exclusive access to resources and when competing with concurrent applications. We reported improvements in terms of execution time and energy consumption of \qty25% and \qty40% for the Intel Raptor Lake Core i9-13900K and \qty12% and \qty25% for the Odroid XU3-E. For embedded KPN applications with custom adaptivity knobs, we demonstrated E-Mapper’s extensibility in providing fine-grained resource adaptations. Finally, we showed that even in the absence of application descriptions, E-Mapper can efficiently manage unknown applications through its runtime exploration approach, which predicts application behavior based on few samples.

In summary, this work highlights the value of a simple yet effective interface to express application characteristics to the OS. By communicating operating points annotated with resource requirements and non-functional characteristics, E-Mapper considerably improves the system’s overall energy-awareness. We advocate for the development of a uniform application-resource-manager interface to enhance energy efficiency across diverse applications and systems.

Acknowledgement

We would like to thank Andrés Goens and Marcus Hähnel for early discussions, as well as Dylan Gageot, Fabius Mayer-Uhma, and Marc Dietrich for their contribution in supporting Kahn Process Networks and TensorFlow applications. In addition, we thank the anonymous reviewers from ASPLOS 24 for their valuable input and feedback, which led to the introduction of the runtime exploration component. The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany in the programme of “Souverän. Digital. Vernetzt.” (joint project 6G-life, project number 16KISK001K) and the E4C project (16ME0426K). This work also received funding from the EU Horizon Europe Programme under grant agreement No 101135183 (MYRTUS). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

References

(1)
dsu (2023) 2023. Arm DynamIQ Shared Unit 120 Technical Reference Manual. https://developer.arm.com/documentation/102547/0100. [Online; accessed 22-Mai-2024].
tfl (2023) 2023. TensorFlow Lite. https://www.tensorflow.org/lite. [Online; accessed 22-Mai-2024].
Alexandrescu et al. (2011) Adrian Alexandrescu, Ioan Agavriloaei, and Mitică Craus. 2011. A Genetic Algorithm for mapping tasks in heterogeneous computing systems. In 15th International Conference on System Theory, Control and Computing. 1–6.
Apple (2020) Apple. 2020. Apple unleashes M1. https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/. [Online; accessed 22-Mai-2024].
Ascia et al. (2004) Giuseppe Ascia, Vincenzo Catania, and Maurizio Palesi. 2004. Multi-Objective Mapping for Mesh-Based NoC Architectures. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (Stockholm, Sweden) (CODES+ISSS ’04). Association for Computing Machinery, New York, NY, USA, 182–187. https://doi.org/10.1145/1016720.1016765
Bailey et al. (1991) D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga. 1991. The Nas Parallel Benchmarks. Int. J. High Perform. Comput. Appl. 5, 3 (sep 1991), 63–73. https://doi.org/10.1177/109434209100500306
Bilbao et al. (2023) Carlos Bilbao, Juan Carlos Saez, and Manuel Prieto-Matias. 2023. Flexible system software scheduling for asymmetric multicore systems with PMCSched: A case for Intel Alder Lake. Concurrency and Computation: Practice and Experience 35, 25 (2023), e7814.
Bini et al. (2011) Enrico Bini, Giorgio Buttazzo, Johan Eker, Stefan Schorr, Raphael Guerra, Gerhard Fohler, Karl-Erik Arzen, Vanessa Romero, and Claudio Scordino. 2011. Resource Management on Multicore Systems: The ACTORS Approach. IEEE Micro 31, 3 (2011), 72–81. https://doi.org/10.1109/MM.2011.1
Brandenburg et al. (2008) Björn B. Brandenburg, John M. Calandrino, and James H. Anderson. 2008. On the Scalability of Real-Time Scheduling Algorithms on Multicore Platforms: A Case Study. In 2008 Real-Time Systems Symposium. 157–169. https://doi.org/10.1109/RTSS.2008.23
Branover et al. (2021) Alexander J. Branover, Benjamin Tsien, and Elliot H. Mednick. 2021. Method of Task Transition between Heterogenous Processor. https://www.freepatentsonline.com/y2021/0173715.html
Brunet et al. (2013) Simone Casale Brunet, Marco Mattavelli, and Jorn W. Janneck. 2013. Buffer optimization based on critical path analysis of a dataflow program design. In 2013 IEEE International Symposium on Circuits and Systems (ISCAS). 1384–1387. https://doi.org/10.1109/ISCAS.2013.6572113
Castrillon et al. (2012) Jeronimo Castrillon, Andreas Tretter, Rainer Leupers, and Gerd Ascheid. 2012. Communication-aware mapping of KPN applications onto heterogeneous MPSoCs. In DAC Design Automation Conference 2012. 1262–1267. https://doi.org/10.1145/2228360.2228597
Chen (2023) Tim Chen. 2023. Enable Cluster Scheduling for x86 Hybrid CPUs. https://lore.kernel.org/lkml/cover.1688770494.git.tim.c.chen@linux.intel.com/. [Online, accesses 22-Mai-2024].
Coello Coello and Reyes Sierra (2004) Carlos A. Coello Coello and Margarita Reyes Sierra. 2004. A Study of the Parallelization of a Coevolutionary Multi-objective Evolutionary Algorithm. In MICAI 2004: Advances in Artificial Intelligence, Raúl Monroy, Gustavo Arroyo-Figueroa, Luis Enrique Sucar, and Humberto Sossa (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 688–697.
Colmant et al. ([n. d.]) Maxime Colmant, Mascha Kurpicz, Pascal Felber, Loïc Huertas, Romain Rouvoy, and Anita Sobe. [n. d.]. Process-Level Power Estimation in VM-based Systems. In Proceedings of the Tenth European Conference on Computer Systems (Bordeaux, France, 2015-04) (EuroSys ’15). ACM. https://doi.org/10.1145/2741948.2741971
Community (2024) Linux Kernel Community. 2024. Energy Aware Scheduling. https://docs.kernel.org/scheduler/sched-energy.html. [Online; accessed 22-Mai-2024].
Cutress (2021) Ian Cutress. 2021. Thread Director: Windows 11 Does It Best. https://www.anandtech.com/show/16959/intel-innovation-alder-lake-november-4th/3. [Online; accessed 22-Mai-2024].
Cutress and Frumusanu (2021) Ian Cutress and Andrei Frumusanu. 2021. The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity. https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/2. [Online; accessed 22-Mai-2024].
Developers (2022) TensorFlow Developers. 2022. TensorFlow. Zenodo (2022).
Greenhalgh (2011) Peter Greenhalgh. 2011. Big. little processing with arm cortex-a15 & cortex-a7. https://www.eetimes.com/big-little-processing-with-arm-cortex-a15-cortex-a7/. ARM White paper 17 (2011). [Online; accessed 22-Mai-2024].
Hackenberg et al. ([n. d.]) Daniel Hackenberg, Robert Schöne, Thomas Ilsche, Daniel Molka, Joseph Schuchart, and Robin Geyer. [n. d.]. An Energy Efficiency Feature Survey of the Intel Haswell Processor (IPDPSW ’15). IEEE, 896–904. https://doi.org/10.1109/IPDPSW.2015.70
Hähnel and Härtig (2014) Marcus Hähnel and Hermann Härtig. 2014. Heterogeneity by the Numbers: A Study of the ODROID XU+E big.LITTLE Platform. In 6th Workshop on Power-Aware Computing and Systems (HotPower 14). USENIX Association, Broomfield, CO. https://www.usenix.org/conference/hotpower14/workshop-program/presentation/hahnel
Hähnel and Smejkal (2018) Marcus Hähnel and Till Smejkal. 2018. Modular Energy Modeling Using Energy/Utility. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering (Berlin, Germany) (ICPE ’18). Association for Computing Machinery, New York, NY, USA, 73–78. https://doi.org/10.1145/3185768.3186311
Hè et al. (2023) Hongyu Hè, Michal Friedman, and Theodoros Rekatsinas. 2023. EnergAt: Fine-Grained Energy Attribution for Multi-Tenancy. In Proceedings of the 2nd Workshop on Sustainable Computer Systems (Boston, MA, USA) (HotCarbon ’23). Association for Computing Machinery, New York, NY, USA, Article 4, 8 pages. https://doi.org/10.1145/3604930.3605716
Hähnel et al. ([n. d.]) Marcus Hähnel, Björn Döbel, Marcus Völp, and Hermann Härtig. [n. d.]. Measuring Energy Consumption for Short Code Paths Using RAPL. 40, 3 ([n. d.]), 13–17. https://doi.org/10.1145/2425248.2425252
Intel (2021) Intel. 2021. Optimizing software for x86 Hybrid Archiecture. Intel White Paper (2021).
Intel (2022) Intel. 2022. Intel Unveils 12th Gen Intel Core, Launches World’s Best Gaming Processor, i9-12900K. https://www.intel.com/content/www/us/en/newsroom/news/12th-gen-core-processors.html. [Online; accessed 22-Mai-2024].
Kang et al. (2012) Shin-Haeng Kang, Hoeseok Yang, Lars Schor, Iuliana Bacivarov, Soonhoi Ha, and Lothar Thiele. 2012. Multi-objective mapping optimization via problem decomposition for many-core systems. In 2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia. 28–37. https://doi.org/10.1109/ESTIMedia.2012.6507026
Khasanov and Castrillon (2020) Robert Khasanov and Jeronimo Castrillon. 2020. Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping. In Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE) (Grenoble, France) (DATE ’20). IEEE, 909–914. https://doi.org/10.23919/DATE48585.2020.9116381
Khasanov et al. (2024) Robert Khasanov, Marc Dietrich, and Jeronimo Castrillon. 2024. Flexible Spatio-Temporal Energy-Efficient Runtime Management. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). 777–784. https://doi.org/10.1109/ASP-DAC58780.2024.10473885
Khasanov et al. (2018) Robert Khasanov, Andrés Goens, and Jeronimo Castrillon. 2018. Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap. In Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (Manchester, United Kingdom) (PARMA-DITAM ’18). Association for Computing Machinery, New York, NY, USA, 20–25. https://doi.org/10.1145/3183767.3183790
Khasanov et al. (2021) Robert Khasanov, Julian Robledo, Christian Menard, Andrés Goens, and Jeronimo Castrillon. 2021. Domain-specific hybrid mapping for energy-efficient baseband processing in wireless networks. ACM Transactions on Embedded Computing Systems (TECS). Special issue of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES) 20, 5s, Article 60 (2021), 26 pages. https://doi.org/10.1145/3476991
Krizhevsky et al. (2017) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.
Limited (2013) Arm Limited. 2013. big. LITTLE technology: The future of mobile. https://armkeil.blob.core.windows.net/developer/Files/pdf/white-paper/big-little-technology-the-future-of-mobile.pdf. Arm Limited, White Paper (2013), 12. [Online; accessed 22-Mai-2024].
Liu (2023) Zhiye Liu. 2023. AMD Phoenix 2 Review Evaluates Zen 4, Zen 4c Performance. https://www.tomshardware.com/news/amd-phoenix-2-review-evaluates-zen-4-zen-4c-performance. [Online; accessed 22-Mai-2024].
Malik et al. (2018) Avinash Malik, Cameron Walker, Michael O’Sullivan, and Oliver Sinnen. 2018. Satisfiability modulo theory (SMT) formulation for optimal scheduling of task graphs with communication delay. Computers & Operations Research 89 (2018), 113–126. https://doi.org/10.1016/j.cor.2017.08.012
Mariani et al. (2012) Giovanni Mariani, Vlad-Mihai Sima, Gianluca Palermo, Vittorio Zaccaria, Cristina Silvano, and Koen Bertels. 2012. Using multi-objective design space exploration to enable run-time resource management for reconfigurable architectures. In 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1379–1384. https://doi.org/10.1109/DATE.2012.6176578
Martello and Toth (1990) Silvano Martello and Paolo Toth. 1990. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons, Inc., New York, NY, USA.
Martin (2001) Alain J Martin. 2001. Towards an energy complexity of computation. Inform. Process. Lett. 77, 2-4 (2001), 181–187.
Massari et al. (2014) Giuseppe Massari, Edoardo Paone, Patrick Bellasi, Gianluca Palermo, Vittorio Zaccaria, William Fornaciari, and Cristina Silvano. 2014. Combining application adaptivity and system-wide Resource Management on multi-core platforms. In 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV). 26–33. https://doi.org/10.1109/SAMOS.2014.6893191
McGrew et al. (2019) David McGrew, Michael Curcio, and Scott Fluhrer. 2019. Leighton-Micali Hash-Based Signatures. RFC 8554. https://doi.org/10.17487/RFC8554
Neri (2023) Ricardo Neri. 2023. Introduce classes of tasks for load balance. https://lore.kernel.org/lkml/20230613042422.5344-1-ricardo.neri-calderon@linux.intel.com/. [Online, accessed 22-Mai-2024].
Onnebrink et al. (2019) Gereon Onnebrink, Ahmed Hallawa, Rainer Leupers, Gerd Ascheid, and Awaid-Ud-Din Shaheen. 2019. A Heuristic for Multi Objective Software Application Mappings on Heterogeneous MPSoCs. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (Tokyo, Japan) (ASPDAC ’19). Association for Computing Machinery, New York, NY, USA, 609–614. https://doi.org/10.1145/3287624.3287651
Pénzes and Martin (2002) Paul I Pénzes and Alain J. Martin. 2002. Energy-delay efficiency of VLSI computations. In Proceedings of the 12th ACM Great Lakes Symposium on VLSI (New York, New York, USA) (GLSVLSI ’02). Association for Computing Machinery, New York, NY, USA, 104–111. https://doi.org/10.1145/505306.505330
Perret (2019) Quentin Perret. 2019. Energy Aware Scheduling (EAS) in Linux 5.0. https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/energy-aware-scheduling-in-linux. [Online; accessed 22-Mai-2024].
Pheatt (2008) Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298–298.
Pourmohseni et al. (2020) Behnaz Pourmohseni, Michael Glaß, Jörg Henkel, Heba Khdr, Martin Rapp, Valentina Richthammer, Tobias Schwarzer, Fedor Smirnov, Jan Spieck, Jürgen Teich, Andreas Weichslgartner, and Stefan Wildermann. 2020. Hybrid Application Mapping for Composable Many-Core Systems: Overview and Future Perspective. Journal of Low Power Electronics and Applications 10, 4 (2020). https://doi.org/10.3390/jlpea10040038
Puchinger et al. (2010) Jakob Puchinger, Günther R. Raidl, and Ulrich Pferschy. 2010. The Multidimensional Knapsack Problem: Structure and Algorithms. INFORMS J. on Computing 22, 2 (apr 2010), 250–265. https://doi.org/10.1287/ijoc.1090.0344
Quan and Pimentel (2014) Wei Quan and Andy D. Pimentel. 2014. Towards Exploring Vast MPSoC Mapping Design Spaces Using a Bias-Elitist Evolutionary Approach. In 2014 17th Euromicro Conference on Digital System Design. 655–658. https://doi.org/10.1109/DSD.2014.46
Quan and Pimentel (2015) Wei Quan and Andy D. Pimentel. 2015. A Hybrid Task Mapping Algorithm for Heterogeneous MPSoCs. ACM Trans. Embed. Comput. Syst. 14, 1, Article 14 (jan 2015), 25 pages. https://doi.org/10.1145/2680542
Saez et al. (2020) Juan Carlos Saez, Fernando Castro, and Manuel Prieto-Matias. 2020. Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors. In Proceedings of the 49th International Conference on Parallel Processing (Edmonton, AB, Canada) (ICPP ’20). Association for Computing Machinery, New York, NY, USA, Article 51, 11 pages. https://doi.org/10.1145/3404397.3404441
Saez and Prieto-Matias (2022) Juan Carlos Saez and Manuel Prieto-Matias. 2022. Evaluation of the intel thread director technology on an alder lake processor. In Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems. 61–67.
Schor et al. (2012) Lars Schor, Iuliana Bacivarov, Devendra Rai, Hoeseok Yang, Shin-Haeng Kang, and Lothar Thiele. 2012. Scenario-Based Design Flow for Mapping Streaming Applications onto on-Chip Many-Core Systems. In Proceedings of the 2012 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (Tampere, Finland) (CASES ’12). Association for Computing Machinery, New York, NY, USA, 71–80. https://doi.org/10.1145/2380403.2380422
Schor et al. (2014) Lars Schor, Iuliana Bacivarov, Hoeseok Yang, and Lothar Thiele. 2014. AdaPNet: Adapting process networks in response to resource variations. In 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 1–10. https://doi.org/10.1145/2656106.2656112
Schranzhofer et al. (2010) Andreas Schranzhofer, Jian-Jian Chen, and Lothar Thiele. 2010. Dynamic Power-Aware Mapping of Applications onto Heterogeneous MPSoC Platforms. IEEE Transactions on Industrial Informatics 6, 4 (2010), 692–707. https://doi.org/10.1109/TII.2010.2062192
Schöne et al. ([n. d.]) Robert Schöne, Thomas Ilsche, Mario Bielert, Andreas Gocht, and Daniel Hackenberg. [n. d.]. Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance. In International Conference on High Performance Computing & Simulation (Dublin, Ireland, 2019-07) (HPCS ’19). IEEE, 399–406. https://doi.org/10.1109/HPCS48598.2019.9188239
Shojaei et al. (2013) Hamid Shojaei, Twan Basten, Marc Geilen, and Azadeh Davoodi. 2013. A Fast and Scalable Multidimensional Multiple-Choice Knapsack Heuristic. ACM Trans. Des. Autom. Electron. Syst. 18, 4, Article 51 (oct 2013), 32 pages. https://doi.org/10.1145/2541012.2541014
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Singh et al. (2013a) Amit Kumar Singh, Akash Kumar, and Thambipillai Srikanthan. 2013a. Accelerating Throughput-Aware Runtime Mapping for Heterogeneous MPSoCs. ACM Trans. Des. Autom. Electron. Syst. 18, 1, Article 9 (jan 2013), 29 pages. https://doi.org/10.1145/2390191.2390200
Singh et al. (2013b) Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013b. Mapping on Multi/Many-Core Systems: Survey of Current and Emerging Trends. In Proceedings of the 50th Annual Design Automation Conference (Austin, Texas) (DAC ’13). Association for Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/2463209.2488734
Smejkal et al. (2017) Till Smejkal, Marcus Hähnel, Thomas Ilsche, Michael Roitzsch, Wolfgang E. Nagel, and Hermann Härtig. 2017. E-Team: Practical Energy Accounting for Multi-Core Systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 589–601. https://www.usenix.org/conference/atc17/technical-sessions/presentation/smejkal
Spieck et al. (2022) Jan Spieck, Stefan Wildermann, and Jürgen Teich. 2022. A Learning-Based Methodology for Scenario-Aware Mapping of Soft Real-Time Applications onto Heterogeneous MPSoCs. ACM Trans. Des. Autom. Electron. Syst. 28, 1, Article 4 (dec 2022), 40 pages. https://doi.org/10.1145/3529230
Tumanov et al. (2016) Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems (London, United Kingdom) (EuroSys ’16). Association for Computing Machinery, New York, NY, USA, Article 35, 16 pages. https://doi.org/10.1145/2901318.2901355
Wathan (2017) Govind Wathan. 2017. Arm DynamIQ: Technology for the next era of compute. https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-dynamiq-technology-for-the-next-era-of-compute. [Online; accessed 22-Mai-2024].
Weichslgartner et al. (2014) Andreas Weichslgartner, Deepak Gangadharan, Stefan Wildermann, Michael Glaß, and Jürgen Teich. 2014. DAARM: Design-time application analysis and run-time mapping for predictable execution in many-core systems. In 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 1–10. https://doi.org/10.1145/2656075.2656083
Weichslgartner et al. (2018) Andreas Weichslgartner, Stefan Wildermann, Deepak Gangadharan, Michael Glaß, and Jürgen Teich. 2018. A Design-Time/Run-Time Application Mapping Methodology for Predictable Execution Time in MPSoCs. ACM Trans. Embed. Comput. Syst. 17, 5, Article 89 (2018), 25 pages. https://doi.org/10.1145/3274665
Wildermann et al. (2014) Stefan Wildermann, Michael Glaß, and Jürgen Teich. 2014. Multi-objective Distributed Run-time Resource Management for Many-cores. In Proceedings of DATE (Dresden, Germany). Article 221, 6 pages.
Wildermann et al. (2015) Stefan Wildermann, Andreas Weichslgartner, and Jürgen Teich. 2015. Design Methodology and Run-Time Management for Predictable Many-Core Systems. In 2015 IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops. 103–110. https://doi.org/10.1109/ISORCW.2015.48
Ykman-Couvreur et al. (2011) Chantal Ykman-Couvreur, Vincent Nollet, Francky Catthoor, and Henk Corporaal. 2011. Fast multidimension multichoice knapsack heuristic for MP-SoC runtime management. ACM Transactions on Embedded Computing Systems (TECS) 10, 3, Article 35 (may 2011), 16 pages. https://doi.org/10.1145/1952522.1952528
Zhuravlev et al. (2012) Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of Scheduling Techniques for Addressing Shared Resources in Multicore Processors. ACM Comput. Surv. 45, 1, Article 4 (dec 2012), 28 pages. https://doi.org/10.1145/2379776.2379780