In this section, we demonstrate the proposed evaluation framework. First, we discuss the experimental setup. Subsequently, we demonstrate the performance of the benchmark generator, the temperature emulator and the DVFS emulator, followed by the performance of the complete evaluation approach. Finally, we present the suitability of the approach in a case study.
7.1 Experimental Setup
For the following demonstrations, we integrate the evaluation approach into a tiled many-core processor, illustrated in Figure
4. The processor is implemented on a proFPGA platform [
13] consisting of four Virtex-7 FPGAs. Thereby, the processor consists of 16 tiles and five cores per tile, yielding an 80-core processor. Each processor has a dedicated 8 kB L1 instruction and an 8 kB L1 data cache. Furthermore, all cores on a tile share a 512 kB L2 cache for remote tile memory accesses and an 8 MB tile-local memory. While the caches are implemented as
block RAM (bRAM) on the FPGA, all tile local memories are physically located on an SRAM extension board, where each tile is mapped to a different bank. The FPGA platform runs at 50 MHz and emulates an ASIC target frequency of 4 GHz. For the ASIC emulation, we generate the temperature model using the default thermal chip characteristics provided by the state of the art thermal simulator Hotspot. The power emulator is based on the the concepts of Listl et al. [
25]. For this, we synthesize a LEON3 processor and run gate-level simulations to characterise the switching activity of the processor for all instructions individually. This information can then be used by the Synopsys tool PrimePower to simulate the power consumption of the processor based on its system status. In the power emulator, we store the simulation results in LUTs. Based on the current status of the cores, the respective power consumption is then loaded and scaled according to the selected
\( V/f \) level.
The monitoring aspect of the evaluation approach extends the monitoring architecture of Mettler et al. [
30]. The architecture is composed of three types of components: a set of probes, a set of tile monitors, and a tracing interconnect. A
probe is assigned to each core in the system. It extracts events from the trace data provided by the core, the power monitor and the temperature monitors to reduce the data volume. For instance, it is possible to detect events based on the program counter address of the processor. Thus, a probe could inform a monitor about the start or the end of an executed task. Furthermore, the probes support to define events on power and temperature ranges such that the violation of a power corridor or a temperature threshold could be detected. The detected events are then sorted and distributed by the
tracing interconnect to all tile monitors. As a result, all tile monitors have a consistent view on a globally sorted trace of events. The
tile monitors support temporal and logical supervision. Using the temporal supervision, it is not only possible to non-intrusively evaluate the execution time of an application but also to measure the time a core violates its predefined power corridor or thermal threshold. Furthermore, the logical supervision can be used to verify the control flow of the application or the behavior of the resource manager. Thus, the monitoring architecture can not only be used to collect run-time data of the system but also to verify the implementation of the management strategy simultaneously.
7.2 Benchmark Generation
For an application-independent analysis of a resource or thermal management strategy, it is important to evaluate its performance on benchmarks with various execution behaviors. Therefore, we evaluate the run-time characteristics of the synthetic benchmarks generated by the REX process using a depth parameter D \( \in \) \( \lbrace 1,2\rbrace \) , a breadth parameter B \( \in \) \( \lbrace 2,3\rbrace \) , a floating-point probability parameter \( P_{fp} \) \( \in \) \( \lbrace 0,0.25,\ldots ,1\rbrace \) and a memory size of \( N_M \) \( \in \) \( \lbrace 256\;B, 2028\;B, 32768\;B\rbrace \) . We generate 50 benchmarks for each combination of the input parameters, such that we evaluate the run-time characteristics of \( 3{,}000 \) benchmarks.
The instruction cache hold rate and the data cache hold rate of the generated benchmarks are illustrated in Figures
5(a) and
5(b) over the execution time. Thereby, the point cloud of the instruction cache hold rate decreases until a run-time of 10 us. This behavior can be explained by the low number of instructions of such short benchmarks. In contrast, benchmarks with a higher run-time show varying instruction cache hold rates between 100 and 600
\( \frac{1}{us} \) . In exceptional cases, the instruction cache hold rate reaches a value of up to 1,600
\( \frac{1}{us} \) . In contrast, the data cache hold rate of the benchmarks ranges between 300 and 1500
\( \frac{1}{us} \) for most execution times. However, it is noticeable that short benchmarks tend to have a higher data hold rate than long benchmarks. This behavior can be attributed to the fact that data cache first needs to be filled before the locality of the data can be exploited. Similarly, benchmarks with a long execution time contain many loop statements that are more likely to operate on local data. The integer and the floating-point instruction rate are illustrated in Figures
5(c) and
5(d), respectively. The integer instruction rate varies between 300 and 1,400
\( \frac{1}{us} \) . Furthermore, an increase over the execution time is visible. This behavior matches the decreasing cache hold rates, as a lower cache rate enables an higher instruction count. In contrast to that, the floating-point rate varies uniformly over all execution times between 0 and 200
\( \frac{1}{us} \) . Overall the generate benchmark show a great diversity in the memory access and compute intensity.
Finally, we demonstrate that the generated benchmarks cover a wide range of run-time characteristics by mimicking the behavior of real workloads. Therefore, we choose an object detection algorithm, whose actor graph is illustrated in Figure
6(a). All actors run in parallel on different tiles of the many-core processor and forward the data of the input images through the object detection pipeline. The application is implemented using the ActorX10 library [
41] of the X10 programming language [
46]. As the language implements the
asynchronously partitioned global address space (APGAS) programming model, it is especially suited for many-core processors. To mimic the behavior of the application, we first characterize each actor independently using the statistics unit of the LEON cores. The run-time characteristics can then be used to identify the best suited benchmark for each actor, respectively. As X10 is a managed programming language, the task characteristics may change when running several tasks together on a single tile, which would lead to a mismatch compared to the benchmark that tries to mimic this application. This can be remedied by post-tuning the benchmark. Thus, it is possible to iteratively match the run-time characteristics of the benchmarks with the characteristics of the real applications, even considering contentions. In Table
6(b), we compare the execution time
\( t_{exe} \) , the performance vector
\( C_t \) and the maximal temperature
\( T_{max} \) of each actor of the real applications with each actor of the emulated application. Thereby, the maximal temperature of the emulated actors differs on average by less than 2 °C. Furthermore, the table shows that the accuracy of the thermal behavior depends on the accuracy of the performance vectors. For example, the performance vector of the emulated
corner detection (CD) actor matches well with the actor of the real application and thus, also the maximal temperatures match well with each other. However, the performance vector of the emulated
SIFT Orientation (SO) actor especially differs in the integer instruction rate
\( c_{int} \) from the real actor and thus, also the maximal temperatures differ noticeably. As a result, it is expected that the accuracy of the thermal behavior further improves with the size of generated benchmarks, which increases the size of potential candidates to match the behavior of an actor. In summary, the results show that the generated benchmarks cover the run-time behavior of the object detection chain and additionally allow one to generate a large variety of run-time behaviors for the evaluation of run-time management strategies.
7.3 ASIC Temperature Emulation
The key performance indicators of an ASIC temperature emulation approach are its scalability, its accuracy and its hardware overhead. First, we compare the scalability of our
distributed temperature emulation (DTE) model with the scalability of the numerical solutions of the RC-thermal network by the
Runge-Kuttamethod (RK4) and the
time-invariant linear thermal system (TILTS). Both numerical methods have been used by Alam et al. [
2] to implement a temperature emulator on an FPGA prototype. The scalability comparison between the different methods is illustrated in Figure
7(a) for the number of model parameters, and in Figure
7(b) for the number of multiplications per iteration. Since only our approach scales linearly with the number of processors in terms of model parameters and multiplications, we outperform the other approaches by more than an order of magnitude on many-core processors. This comparison depicts the strengths of the decentralized emulation approach.
In addition to the scalability, also the accuracy of the emulation approach is important. Therefore, we compare the emulated temperature of our decentralized approach with the numerical solutions of the temperature model on a 80-core processor over
\( 1{,}000{,}000 \) iterations, corresponding to an execution time of
\( 64\;\text{ms} \) . The maximal emulation error, measured within intervals of
\( ~2\;\mu \text{s} \) , is illustrated in Figure
8(a). With a maximal emulation error below 0.04 °C, the accuracy of the emulation approach is more than sufficient. Furthermore, the histogram shows a clear maximum at an emulation error of 0 °C, which is desirable as well. Additionally, we verify the accuracy of the 32 bit fixed point implementation of the temperature mode in Figure
8(b). Here, the histogram shows a comparable behavior such that we can conclude that the emulation accuracy on the FPGA implementation is sufficient as well. Additionally, we evaluate the emulation accuracy of four architectures with various numbers of thermal nodes against Hotspot in Table
1. For each of the architectures, the mean average emulation error is below 0.03 °C. Also most architectures achieve a maximal emulation error of less than 0.05 °C. Even the maximal emulation error of the heterogeneous many-core processor, consisting of various core types and accelerators, is well acceptable at 0.55 °C. Finally, we evaluate the hardware overhead of our temperature emulator in Table
2. Thereby, each thermal node requires a 32 bit MAC unit to compute the local temperatures. This unit has been synthesized into four
digital signal processing blocks (DSPs) on the FPGA. A comparison with the hardware overhead of a 22 bit fixed-point implementation of
TILTs Thermal Emulation (TTE) IP for 16 thermal nodes, yields that TTE required 10 times fewer DSPs. However, this comparison does not reflect the emulation latency. While the centralized approach requires 3,496 cycles to compute the temperature per iteration, our approach requires 84 cycles only. Especially, for many-processors, like on our evaluation platform, the 2-MAC design of TTE would already require 178,584 cycles per iteration, while our decentralized approach still requires 84 cycles only. Furthermore, the additional effort to route the power signals from the respective thermal node to the TTE IP and the effort to rout the temperature signals back is impracticably large. Large-scale FPGA prototype span across multiple FPGAs. Thus, power and temperature signals do not only need to be pipelined to meet the timing requirements but also many IO resources are needed to send the signals to the IP, which can only be located on one FPGA. As a result, especially on large-scale prototypes, a decentralized temperature emulation approach is inevitable to minimize the design effort and to save hardware resources.
7.4 Dynamic Voltage Frequency Scaling Emulation
In this section, we first evaluate the proposed DVFS emulation approach in Table
3. Therefore, we calibrate the FPGA model for three exemplary processors and compare its accuracy with a macromodel, proposed by Park et al. [
37]. In this process, we compute the accuracy of both methods based on SPICE simulations conducted by Park et al. It can be seen that the FPGA model outperforms the macro model in terms of
mean absolute error (MAE) and
mean relative error (MRE) for the voltage transition time
\( \tau _{uc} \) and the energy consumption during down-scaling
\( E_{uc,down} \) . Even though the macro model achieves better results for the energy consumption of the processors during up-scaling
\( E_{uc,up} \) , it should be noted that the absolute error of the FPGA model is sufficient. Furthermore, we evaluate the impact of the DVFS overhead on the design of a state-of-the-practice hardware DTM [
21]. The DTM monitors the temperature of a core and reacts on a thermal violation by throttling down the
\( V/f \) level to a minimum value. Once the core temperature decreases below a lower thermal threshold, the
\( V/f \) level of the core is re-set to their peak value. The challenge in the design of such a DTM is to define a suitable lower thermal threshold, since the upper threshold is already defined by the safe operating temperature of the processor (here 80 °C). Therefore, we evaluate the execution time of an application for different lower thermal thresholds without considering DVFS overheads and with considering DVFS overheads in Table
4. It can be seen that for a given lower thermal threshold, the execution time is always higher when the DVFS overheads are considered. This behavior is mostly introduced by the voltage transition timing overhead. Even though the transition time is in the order of micro seconds, the overhead accumulates as the the DTM continually switches between the highest and the lowest frequency.
The greater the difference between the lower and the upper thermal threshold the longer is the throttling period after each thermal violation. Hence, the implementation with and without DVFS show very similar execution times for the smaller lower thresholds. Here, the number of \( V/f \) transitions are limited. In contrast, a larger lower thermal threshold implies a large number of \( V/f \) transitions. Hence, the DVFS overheads impact the execution time of the application significantly. As a result, an evaluation of the execution time without DVFS overheads suggests the choice of a high lower thermal threshold. However, the evaluations with the consideration of the DVFS overhead show that a smaller lower thermal threshold is actually better suited to maximize the performance of the application. Thus, the DVFS overheads must be emulated in the evaluation approach.
7.5 Evaluation Platform
In this section, we evaluate the scalability and the performance of the evaluation approach. Therefore, we illustrate the hardware utilization of the proFPGA system in Table
5. The 80-core processor uses 1,604,265 slice registers, 2,567,082 slice LUTs, and 1,184 DSPs on the four FPGAs. While this is a lot, the system uses only
\( 16.4\% \) of the available slice registers,
\( 52.5\% \) of the available slice LUTs, and
\( 13.7\% \) of the available DSPs on the FPGA. Thus, the number of available LUTs limits the scalability of the approach. In a first-order approximation, one would could estimate that up to ~150 cores can be integrated on four Virtex-7 FPGAs. However, on four Xilinx Ultrascale FPGAs, the system could be scaled up further to a maximum of ~340 cores. As a result, FPGAs prototypes are also suitable to demonstrate many-core systems.
Besides the scalability, the evaluation performance is also important. Therefore, we compare the simulation speed of the different approaches in MIPS. For this comparison, we assume that a single LEON3 core executes 0.7
instructions per cycle (IPC). Thus, the target ASIC design, consisting of 80 cores and a maximal frequency of 4 GHz, could provide a peak performance of 224,000 MIPS. On the FPGA prototype, the cores run at 50 MHz. Thus, it provides a peak performance of 2,800 MIPS. As a result, the FPGA prototype is only two order of magnitudes slower the the target system. Hence, it is still possible to do rapid evaluations and run the target application on top of an operating system. This is a performance that common simulation-based approaches can not achieve. Although, sniper does not perform cycle-accurate simulations it only achieves a performance of ~2 MIPS [
8,
19], which is five orders of magnitudes slower than the target design and three orders of magnitudes slower than the FPGA prototype. A cycle-accurate simulator, such as GEM5, achieves only a performance of ~0.3 MIPS [
18], making it six orders of magnitudes slower than the target design. In summary, simulation-based approaches do not provide the performance, which is needed to evaluation thermal and resource management strategies of many-core processors.
7.6 Case Study
In this case study, we employ our proposed platform to evaluate a state-of-the-art system-level thermal management technique based on power budgeting [
40] and compare it with the state-of-the-practice DTM evaluated in Section
7.4, as a baseline. Thereby, we assume a safe operating temperature of 80°C for the emulated processor. The DTM technique is reactive, and non-predictable. Hence, it is not possible to give timing guarantees for real-time tasks at design time, since DTM can be triggered at any point at run-time. To provide predictability, the concept of
Thermal Safe Power (TSP) [
36] can be employed. Thermal Safe Power (TSP) is a per-core power budget that guarantees avoiding thermal violations. There are multiple variants for Thermal Safe Power (TSP); uniform and non-uniform ones. In this case study, we employ the non-uniform Thermal Safe Power (TSP), which means different power budgets per task are calculated. Moreover, the worst-case schedule of parallel running tasks w.r.t. temperature is taken into account to provide guarantees at design time that thermal violations will not occur at run-time. The resulting power budget of each task is mapped to a thermally safe frequency
\( f_{safe} \) based on the power profile of the task.
To evaluate this technique, we conduct two use-cases with applications generated using the benchmark generation infrastructure earlier introduced. In the first experiment, one application is employed (illustrated in Figure
9(a)) and it follows the scatter-gather pattern, which is often seen in parallel benchmarks. The thermally safe power budgets are calculated for all tasks. Since
\( t_{0} \) and
\( t_{13} \) can never run in parallel to another task, they get a high power budget, which allows them to run at the peak frequency. Contrarily, for
\( t_{01} \) to
\( t_{12} \) , it is possible that all of these tasks are running in parallel (depending on the actual execution time of each task). As mentioned, calculating the thermally safe power budgets accounts for the worst-case schedule, thereby the power budgets of these tasks are lower, restricting them to run at lower frequencies; i.e., 3.0 and 3.3 GHz. The difference in the frequencies is because of the resulting non-uniform power budgets. Particularly,
\( t_{9} \) and
\( t_{11} \) are mapped next to several idle cores, which allows a higher power budget for them, and thereby higher
\( f_{safe} \) . Executing the tasks at their selected
\( f_{safe} \) avoids thermal violations (i.e., the maximally emulated temperature
\( T_{max} \) does not reach the thermal threshold of 80°C) as our platform demonstrates (Figure
9(b)). However, when this power budgeting technique is not employed, all tasks are executed at the peak frequency, but then thermal violations occur. Consequently, DTM is triggered frequently and throttles down the frequencies to return the system to a thermally safe state. Hence, the frequencies are oscillating between peak and minimum levels, and the average frequency is less than
\( f_{safe} \) . Therefore, execution times of most parallel-running tasks are higher than with thermally safe power budgeting, as shown in Figure
9(b).
The second use-case is shown in Figure
10, in which two applications are running in parallel. The first application has an early parallel phase, while the second application has a late parallel phase. Since the actual execution times of the tasks are not known at design-time, thermally safe power budgeting considers the worst-case schedule. For example, when calculating the power budget of task
\( t_{01} \) , the worst-case schedule is that the tasks
\( t_{02} \) to
\( t_{04} \) and
\( t_{12} \) to
\( t_{15} \) all are running in parallel to it, as illustrated in the figure. However, at run-time, for certain input data, these tasks do not overlap. As a result, the maximum emulated core temperatures
\( T_{max} \) are far from the thermal constraint, as shown in the table in Figure
10(b). That demonstrates that there is a price in terms of performance that needs to be paid to provide the thermal-safety guarantees at design time. In case the power budgeting technique is not employed, thermal violations occur and DTM is triggered, throttling down the
\( V/f \) levels. However, in this particular experiment, the negative impacts of triggering DTM on the execution times of the tasks are lower than the price paid for the thermally safe power budgets, as illustrated in Figure
10(b).
In summary, our approach enables a comprehensive analysis of a state-of-the-art thermally safe power budgeting technique that highlights its potential to guarantee a thermally safe execution and also its implied pessimism. Moreover, the approach demonstrates the negative impacts of the state-of-the-practice DTM on performance and how it hinders predictability. These findings could only be established by the combination of benchmark generation, ASIC emulation, and run-time monitoring. Thereby, the proposed benchmark generation approach enables the generation of applications, which reveals the advantageous and disadvantageous of both thermal management techniques. The proposed ASIC emulation approach enables the ASIC power, temperature and DVFS emulation, which makes the analysis possible in the first place and the run-time monitoring architecture, which provides performance indicators, such as the maximal emulated core temperature and the execution time of each task, to compare and analyse both techniques.