Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer Science Department, University of California, Los Angeles Los Angeles, CA 90095, USA {cong, karthikg, huihuang, liucy, reinman, zouyi}@cs.ucla.edu cache set. As can be seen, the cache set utilization varies for different cache sets and different times. It becomes more serious for low-power processors with low cache associativity due to a tight power budget. Abstract—By reconfiguring part of the cache as softwaremanaged scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time cache behavior. Previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access. In this paper an adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. This achieves 19%, 25%, 18% and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks. (a) astar (SPEC) (b) jpeg (MiBench) (c) h264ref(SPEC) Figure 1. Non-uniformed cache sets utilization in hybrid cache. Keywords—Energy Reduction; Hybrid Cache; Scratchpad Memory I. Balancing cache set utilization has been intensively investigated. Most of these techniques, such as V-way cache [8], indirect index cache [9] and set balancing cache [10] require that the cache tag array and data array be serially accessed. However, since SPM is designed for fast local access, the hybrid cache is typically at the primary (L1) cache level, which generally requires a parallel access to tag/data array. Therefore, these pseudo-associative cache techniques are not suitable for hybrid caches. There are also techniques that do not require serial tag/data array access. Victim cache [11] uses a small fully associative cache to store the victim cache blocks to mitigate the conflict-miss. It increases the per-cache-access energy since victim cache is searched in parallel with the regular cache. Serializing the victim cache access can save energy but incurs additional cycles when hitting in the victim cache. Balanced cache [12] uses a content-addressable memory (CAM) inside the cache decoder and increases the decoder length to associate cache sets. Although the CAM access latency fits in the decoder slack and introduces only 10% per-cache-access energy overhead at 0.18um technology, as it comes to a nano-scale technology such as 32nm, the CAM access latency exceeds the decoder slack by 20%~40% and incurs a large per-cache-access energy overhead [13]. Therefore, it is important to find an energy-efficient approach to addressing the hot cache set problem in hybrid caches without requiring serial tag/data array access. Fortunately, the nature of hybrid cache provides another possibility for balancing the cache set utilization. Instead of pseudo-associating the cache sets (as done in the previous approaches) and maintaining a fixed SPM mapping, we can dynamically remap SPM blocks from high-demand cache sets to lowdemand cache sets. Intuitively, it is similar to the previous cacheenergy reducing techniques which dynamically activate and deactivate cache lines based on the cache set utilization [14]. However, switching on/off cache lines in one cache set will not influence the other cache sets, but migrating SPM blocks from a high-demand set to a lowdemand set will increase the pressure of the destination cache set. Therefore, directly applying the previous approaches may result in a situation where several cache sets just keep passing SPM blocks among themselves repeatedly (referred to as circular bouncing effect). Another challenge caused by dynamically remapping SPM blocks is to that of quickly locating the SPM block in the cache, since the SPM block locations may change. Obviously, using software to manage the remapping can be costly due to inefficiency and the impact on code portability. Therefore, hardware support is desired so that software can focus on the use of a logically continuous SPM. INTRODUCTION Caches are widely used in modern processors to effectively hide the data access latency, since the memory reference patterns in most applications have good spatial/temporal locality. For applications with predictable data access patterns, it is possible to let the software directly manage the on-chip storage. This alternative is called scratchpad memory (SPM). Because the SPM does not need to perform tag comparisons and drive associative ways, it is much more energy-efficient than caches [1]. Embedded architectures use SPM in conjunction with caches to reduce power consumption. However, certain applications may prefer SPM (e.g., with predictable array access patterns) while other applications may prefer cache (e.g., with dynamic and random accesses). Even for applications that prefer SPM, the SPM size required by different applications may vary [1]. Under these circumstances, designing the cache and SPM separately at the physical level with a fixed size for each of them is likely to be suboptimal for particular applications. As a result, reconfigurable caches have been proposed to provide good support for flexibly sizing the cache and SPM based on application requirements in a hybrid cache design. Column caching [2] and FlexCache [3] expose part of the cache as software-controlled memory. The reconfigurable cache [4] and virtual local store [5] enable the cache to be dynamically partitioned at a granularity from cache ways to cache blocks. Besides the reconfigurable caches, an adjustable-granularity cache-locking function—available on multiple embedded architectures such as Freescale e300 [6]—can also be utilized to achieve flexible partitioning of cache and SPM. Way stealing [7] uses special cache preload and locking instructions to provide local memory for instruction set extensions. However, all of the above hybrid cache designs partition the cache and SPM without adaptation to the run time cache behavior; i.e., when allocating cache blocks into SPM, they will select blocks from cache sets uniformly. Since cache sets are not uniformly utilized [8], this uniform mapping of SPM blocks onto cache blocks may create hot cache sets at run time, which will increase the conflict miss rate and degrade the performance. Figure 1 shows the cache set utilization stats for a hybrid cache design (system configuration is shown in Section IV.A). Each column represents a set in the cache, and each row represents 1 million cycles of time. A darker point means a hotter This research is partially supported by the NSF Expeditions in Computing Award CCF-0926127, NSF grant CCF-0903541, and GSRC. 978-1-61284-660-6/11/$26.00 © 2011 IEEE 67 array amplitude, and spm_size sets the SPM size register as the size of the array amplitude and state. Note that these system APIs do not impact the ISA since they use regular instructions for register value assignment. The base address and size of the SPM can be set multiple times across the software. If the software sets the SPM size larger than the maximum SPM size (discussed in Section III.A), it can still run on AH-Cache, but AH-Cache will only provide its maximum SPM size. The SPM references beyond this size are treated as regular memory references and are supported by the cache. This scheme allows portability of the software on different AH-Cache sizes. We have developed a compilation pass [17] inside the LLVM [18] compilation infrastructure to automatically transform and optimize original application code for better SPM utilization on AH-Cache. To the best of our knowledge, this is the first work that considers run-time adaptation in hybrid cache designs. The main contributions of the proposed adaptive hybrid cache (AH-Cache) are as follows: • The look-up operation of the SPM location is hidden in the execution (EX) pipeline of the processor, and a clean software interface is provided as a non-adaptive hybrid cache. • A victim tag buffer is used to assess the cache set utilization by sharing the tag array, resulting in no storage overhead. • An adaptive mapping scheme is proposed for fast adaptation to the cache behavior without the circular bouncing effect using a floating-block-holders queue. The remainder of this paper is organized as follows: Section II describes the software interface of AH-Cache. The AH-Cache architecture design and overhead is detailed in Section III. Section IV presents experimental results, and Section V concludes the paper. II. III. AH-CACHE ARCHITECTURE A. SPM Mapping Look-Up SOFTWARE INTERFACE As shown in Figure 2(d)(e), the partition between cache and SPM in AH-Cache is at a cache-block-wise granularity. If the requested SPM size is not a multiple of a cache block, it will be increased to the next block-sized multiple. The mapping information of SPM blocks onto the cache blocks is stored into an SPM mapping look-up table (SMLT). The number of entries in SMLT is the maximum number of cache blocks that can be configured as SPM. Since AH-Cache must hold at least one cache block for each cache set to maintain the cache functionality, the maximum SPM size on a M-way N-set setassociative cache is (M-1)*N blocks. In each SMLT entry, there are 1) a valid bit indicating whether this SPM block falls into the real SPM space, since the requested SPM size may be smaller than the maximum SPM size; and 2) a set index and a way index which locate the cache block upon which the SPM block is mapped. In a most recent non-adaptive hybrid cache design [5], the highorder bits of the virtual address of a memory reference are checked in the early pipeline (after the ALU computes the virtual address) to determine whether it is targeting the SPM or regular cache. The checking is done by comparing these high-order bits with the SPM base address. This enables fast checking, but requires that the SPM base address be aligned with all of its low-order bits as 0. First we will briefly talk about the software interface of AH-Cache, where we want to emphasize that the software only needs to be aware of a logically continuous SPM, but does not care where the SPM blocks are physically mapped. By providing such a clean software interface, 1) all of the previous compilation techniques that target SPM utilization optimization, such as dynamic data placement [1], stack and heap support in SPM [15], etc., can be directly used on AH-Cache since the compiler only views a logically continuous SPM; 2) in a multi-threaded architecture, the previous context switching schemes for SPM (e.g., [16]) can be directly used on AH-Cache, since the operating system only views a logically continuous SPM. Figure 3. SPM mapping look-up and access in AH-Cache. AH-Cache needs an additional step to use the low-order bits of the virtual address to look-up the SMLT. This further increases the pipeline critical path. To solve this problem, inspired by the zero-cycle load idea [19], we perform the address checking and SMLT look-up in parallel with the virtual address calculation of the memory operation in a pipelined architecture, as shown in Figure 3. Assuming a base + displacement address calculation mode, memory reference instructions will compute their virtual addresses from a base address and an offset address; these are obtained either from the register file or the immediate value of the instruction in the Instruction Decode (ID) stage. In the Execution (EX) stage, the ALU calculates the virtual address from these values. Simultaneously, the base address is compared to the SPM base address, and the offset is sent to the SMLT to obtain the mapping information (here the cache-set-index part of the offset bits will be used to index SMLT). When the output of the comparator is true and the valid information of the indexed SMLT entry is true, this Figure 2. (a) Original code. (b) Transformed code for AH-Cache. (c) Memory space view of SPM in AH-Cache. (d) SPM blocks. (e) SPM mapping in AH-Cache. (f) SPM mapping look-up table (SMLT). A simple example is shown in Figure 2. To manage the SPM in AH-Cache, the software is provided with two system APIs to specify the SPM base address and size. As shown in Figure 2(b), spm_pos sets the SPM base address register as the address of the first element of 68 Once it is recovered from an SPM block and becomes a regular cache block, its tag is enabled, and its corresponding VTB tag is disabled. This way, we can naturally combine the original tag array and the VTB. For each tag entry in the original tag array, one bit is added to indicate whether this tag is a regular tag or a victim tag. Figure 4(a) shows the VTB inside the tag array for the mapping in Figure 2(e). When a replacement happens in the cache part of AH-Cache, the tag of the victim block is written into the corresponding set of the VTB with pseudo LRU policy. There is a VTB counter for each set (not for each cache block, as shown in Figure 4(b)). The VTB is only accessed at a miss in the cache part of AH-Cache. If there is a hit in VTB, the set’s VTB counter will be increased by 1, since this situation indicates that if this block had been enabled in the cache part of AHCache, it would have been a hit. Cache misses due to streaming or thrashing will not lead to a VTB hit, as there is no reuse. memory reference is considered an SPM access instruction. In this way, both the address-checking and SMLT look-up are done in the EX stage, and SPM access time in AH-Cache will not be increased. This architecture imposes a constraint on the compiler. The compiler should generate the memory reference instructions to SPM in such a way that the base address of this instruction must be the SPM base address and the offset must be the offset related to the SPM base address. This constraint does not impact the optimization ability of the compiler since this transformation can be performed in the last stage of the code optimizations. However, extra care needs to be taken when a pointer of some element of the SPM is passed as a parameter to a function, and all memory references inside the callee function are based on the input pointer parameter. The compiler should first divide the callee’s input pointer parameter into two parts, a base pointer base and an offset of the original input pointer to base. Then, inside the callee, all memory references related to the original input pointer are generated with base as the new base pointer. For the caller, the SPM base address is passed to the callee’s base, and the offset of original input pointer to SPM base address to the callee’s offset. One concern is whether the virtual address calculation at the ALU can hide the look-up time of the SMLT (obviously the comparator is not in the critical path since it is much simpler than the ALU). For an M-way N-set set-associative cache, the size of SMLT is (M-1)*N entries with each entry containing (1+logM+logN) bits. TABLE I shows the access latencies for various L1 cache configurations using Cacti [20] at 32nm technology (cache block size is 64B). As shown in the table, all the SMLT accesses can be finished in 0.2 ns which fits into the cycle time of a 4GHz core. Given the fact that previous nonadaptive cache [5] could add a comparator after the ALU, the small delay of the 1-level MUX added after the ALU will be much smaller in timing. It should be noted that the way index encoder is also used in the non-adaptive hybrid cache [5] to avoid TLB look-ups and tag comparisons at SPM accesses. It is not an overhead of AH-Cache. C. Adaptive Mapping If the application only requires P SPM blocks while AH-Cache can provide Q SPM blocks at most, then there will be S=Q-P cache blocks (referred to as floating blocks) used to adaptively satisfy the high-demand cache sets. When we say that cache set A gets a floating block from cache set B, it means that A sends one of its SPM blocks to B and enables the vacant cache way as a regular cache block, while B needs to evict one of its cache blocks to accommodate the SPM block from A. We can not simply make a cache set with a high VTB counter get a floating block from a cache set with a low VTB counter, since a low VTB counter only means that this cache set does not need more floating blocks; it does not mean that it can afford to lose one. Therefore, it is possible that several cache sets just keep passing SPM blocks among themselves repeatedly. To solve this problem, we propose a mapping scheme based on a floating block holder (FBH) queue. The queue records the cache sets which are currently holding the floating blocks. Each queue node consists of the index of a floating block holder set and a re-insertion bit. A re-insertion bit indicates whether this cache set is re-inserted to the queue in the current adaptation interval (a fixed number of cycles). Each cache set holds a 1-bit insertion flag to indicate whether it has been inserted in the queue in the current interval, as shown in Figure 4(b). At the beginning of each adaptation interval, all the re-insertion bits in the queue and the insertion flags in the cache sets are reset to 0. When a cache set A’s VTB counter achieves a threshold T, the FBH queue will be searched, starting from the head, until a node with a re-insertion bit of 0 is found. Assume the set index in this node is B. Then set B will accommodate one SPM block from set A. This node is removed from the queue and a new node with set index A is inserted to the tail of the queue with its re-insertion bit as the current insertion flag of set A. Then set A’s insertion flag is updated to 1. With the reinsertion bit as 1, a high-demand set will not give up its floating blocks once it is re-inserted to the queue in the current interval. Once all the re-insertion bits in the queue are 1, the remapping of this interval is stopped. This will effectively remove the potential circular bouncing effect. With a small number of SPM block migrations, the proposed mapping scheme can form an SPM mapping which adapts to the cache set demands in the current interval, as shown in TABLE IV. The threshold T determines the size of the VTB counter. The selection of T and the interval length I should be co-considered to make a lazy or aggressive adaptation. In this work we set T to 16 and I to 1 million cycles. Then the length of the VTB counter for each cache set is 4 bits. Since a node removal is always accompanied with a node insertion, the FBH queue can be simply implemented with an SRAM controlled by a pointer. As shown in Figure 5(a), the number of active entries of this SRAM equals the number of floating blocks, and the total number of entries equals the maximum number of SPM blocks. When a VTB counter reaches the threshold and requests an SPM migration, the pointer will move from its current place until it finds one entry with a re-insertion bit of 0. Then the new node will overwrite this entry, and TABLE I. SMLT LATENCY OF VARIOUS CACHE CONFIGURATIONS Cache Size 8KB 16 KB 32 KB 64 KB 2 4 2 4 2 4 2 4 Cache associativity 64 96 128 192 256 384 512 768 SMLT entries 8 8 9 9 10 10 11 11 SMLT width (bits) Access latency(ns) 0.14 0.15 0.16 0.17 0.17 0.18 0.18 0.19 B. Cache Set Demand Assessment As in [8], we refer to cache sets that highly utilize most or all cache blocks as high-demand sets, and cache sets that underutilize their available blocks as low-demand sets. We want the low-demand sets to accommodate proportionally more SPM blocks than the highdemand sets, as shown in Figure 2(e). Miss rate can not be used to recognize a high-demand cache set, since for streaming applications with little locality or applications hopelessly thrashing the cache, even if the miss rate is high, there is little benefit in increasing the cache blocks. Therefore, we use a victim tag buffer (VTB) to capture the demand of each set; this is similar to the miss tag introduced in [14], but with no memory overhead (as explained below). Figure 4. (a) A VTB in the tag array. (b) VTB counters and insertion flags. Logically, the VTB consists of the same number of sets as the tag array and one less way in each set (since at least one cache block in each set is retained). When a cache block is configured as an SPM block, its tag is disabled, while its corresponding VTB tag is enabled. 69 total per-cache-access energy overhead of AH-Cache is only less than 6% of a non-adaptive hybrid cache with a per-cache-access energy of 16.6pJ for 16KB and 18.9pJ for 32KB. But it can save energy by reducing the low-level (L2) cache energy (reducing the miss rate) and leakage energy (reducing run time) as shown in the next section. the pointer moves to the next entry. It turns back to the head of the SRAM when it reaches the end of the active region. IV. EXPERIMENT RESULTS A. Evaluation Methodology To cover a diverse set of applications, our benchmarks are chosen from multiple benchmark suites. We select the benchmarks which have intensive memory accesses that SPM can help to improve; i.e., we choose the benchmarks which can benefit from a hybrid cache design (since our goal is to improve the hybrid cache designs). These benchmarks include: five benchmarks from the MiBench benchmark suite [24]: jpeg, gsm, dijkstra, patricia and susan; five memory reference intensive applications from the SPEC2006 benchmark suite [25]: h264ref, hmmer, astar, soplex and gobmk; and also four medical imaging benchmarks [26]: 1) biHarmonic performs 2D image registration with bi-harmonic regularization term, 2) mutualInfo computes the mutual information of two 2D images, 3) ricianDenoise performs iterative local denoising based on the rician noise model, 4) regionGrowing evaluates whether a region is part of an object in image segmentation. The number of memory references of these benchmarks is shown in TABLE II in order to indicate their scale. Figure 5. (a) FBH queue. (b) Parallel FBH search using RIBT. Serially searching the FBH queue may incur a worst-case delay of S cycles, where S is the maximum SPM size. To reduce the searching time, a parallel search scheme is developed as follows. We store the re-insertion bits in another SRAM called re-insertion bit table (RIBT), as shown in Figure 5(b). Each RIBT entry contains 16 re-insertion bits. Then an S-entry FBH queue will have an S/16-entry and 16-bit wide RIBT. Every 16 re-insertion bits can be searched in parallel using a priority encoder which outputs the index of the first zero-bit of its input vector. Then the longest search time is decreased to S/16 cycles. The FBH queue is searched when a cache set has a miss and its VTB counter achieves the threshold for requesting an SPM migration; thus, if the search can be finished before the missed cache block is fetched from the L2 cache, it will not affect the time of acquiring new data from the L2 cache. In our evaluation architecture, the L2 cache access latency is 20 cycles, while the maximum FBH queue length is 256 and the worst-case search time is 16 cycles, which is smaller than the L2 cache access latency. The FBH search latency can be further reduced by increasing the number of parallel searched re-insertion bits, at a cost of increasing the width of the priority encoder. We use the priority encoder designed in [21]. According to the Synopsys Design Compiler, the searching logic circuit has around 500 gates, which is less than 1% of the cache design. TABLE II. #MEMORY REFERENCES OF THE EVALUATED BENCHMARKS jpeg gsm susan hmmer soplex h264ref dijkstra patricia 19.8M 65.1M 76.1M 75.7M 22.1M 196.6M 47.7M 16.8M astar gobmk biHarmonic mutualInfo ricianDenoise regionGrowing 93M 256.7M 48.6M 98.2M 7.9M 128.3M To demonstrate the advantage of the adaptation in AH-Cache, we implemented the following designs for comparison. Non-adaptive hybrid cache (N): This is the baseline design which uses a 2-way set-associative hybrid cache as the L1 data cache. The SPM mapping onto cache blocks is fixed. We evaluated two cache sizes—16KB and 32KB, which are typical L1 data cache sizes in lowpower processors. According to Cacti [20], the energy per access is 16.6pJ for 16KB and 18.9pJ for 32KB at 32nm technology. Non-adaptive hybrid cache + balanced cache (B): This design enhances the baseline by using the balanced cache (B-Cache) [12]. It uses CAM and increases decoder length to increase the cache associativity. Due to the high energy overhead of CAM (90% more per-cache-access energy when BAS (B-Cache associativity) =8), to achieve a good performance and energy trade-off, we use BAS=4 and MF (mapping address mapping factor) =8 (1/8 of the memory address has a mapping to cache sets), which incurs additional per-cache-access energy of 6.4pJ for the 16KB cache and 7.8pJ for the 32KB cache. The energy data are obtained from [13] at 32nm technology, which extracts the technology parameters from Cacti [20]. It should be noted that the CAM access latency exceeds the original decoder slack, but we optimistically assume it does not increase the cache critical path. Non-adaptive hybrid cache + victim cache (Vp, Vs): The design Vp enhances the baseline design by using a parallel accessed fully associative victim cache [11]. We use a 4-entry victim cache for the 16KB cache and an 8-entry victim cache for the 32KB cache; these have a per-cache-access energy overhead of 8.9pJ and 16.3pJ, respectively. Experiment results show that further increasing the victim cache size only marginally improves performance while using much more energy. We also implement a serially accessed victim cache Vs, where the victim cache is only accessed at a L1 cache miss to increase the energy efficiency, but it incurs additional cycles when blocks are in the victim cache. An additional pipeline is needed inside the hybrid cache to control the serial victim cache access. Phase-reconfigurable hybrid cache (R): This design modifies the idea in [14] and applies it to the hybrid cache by reconfiguring the D. Storage and Energy Overhead To quantify the storage overhead of AH-Cache, we use a 16KB 2way set-associative, 128 sets, 64B data block size, 4B tag entry size (including the tag, coherence state bits, dirty bits etc.) hybrid cache as an example. It can provide at most 128 64B SPM blocks. Then the SMLT contains 128 9-bit entries (1 valid bit + 6-bit set index + 2-bit way index). The VTB physically shares the tag array, thus it only incurs one additional bit for each tag entry. Each cache set also has one additional insertion flag and 4-bit VTB counter. The FBH queue contains 128 7-bit entries. The RIBT contains 8 16-bit entries. The migration buffer contains one 64B cache block. Therefore, the total storage overhead introduced by AH-Cache is around 0.4KB, which is 3% of the baseline hybrid cache size. For the energy overhead, AH-Cache needs to access the SMLT at each cache access, access the VTB at each cache miss, and trigger the adaptive mapping unit (including the FBH queue, RIBT, and the 16bit priority encoder) each time that a VTB counter achieves the threshold. According to Cacti [20], at 32nm technology, the access energy to the SMLT is up to 0.8pJ. The access to the VTB, which physically shares the tag array, adds an additional 0.8pJ for a 16KB cache and 0.9pJ for a 32KB cache per cache miss. The worst-case energy for the adaptive mapping unit when all the RIBT is searched is 2pJ; this is obtained from the Synopsys Design Compiler and Cacti [20]. When an SPM block migration happens, the block will be first read out from the cache and written into the migration buffer, and then read out from the migration buffer and written back into the cache in the next cycle; thus the energy overhead is 66pJ for a 16KB cache and 75pJ for a 32KB cache. As can be seen in TABLE IV, there are only 4.4 SPM migrations for average every 1 million cycles. Therefore, the 70 SPM block migrations of AH-Cache at each interval is 4.4, which results in a run-time overhead of less than 0.1%. Some applications, such as susan and gsm which have dramatic cache misses reduction, do not see a corresponding run-time reduction because most of the memory references access SPM. However, the AH-Cache still reduces the run-time by 18% compared to baseline, and outperforms designs B, Vp, Vs and R by 3%, 4%, 8% and 12%, respectively. SPM mapping at each fixed interval based on the VTB counter stats. At the reconfiguration time, cache sets with a VTB counter higher than a high-threshold can migrate their SPM blocks to cache sets with a VTB counter lower than a low-threshold. The length of the interval and the two thresholds are tuned to achieve the best performance. The architecture of this design is almost the same as our proposed design (hides SMLT access in EX pipeline and shares VTB in tag array), but without the adaptive mapping unit. It is used to evaluate the effectiveness of our adaptive mapping scheme. Adaptive hybrid cache (AH): This is our proposed design. The energy overhead is discussed in Section III.D. The VTB counter threshold is 16 and adaptation interval length is 1 million cycles. Static optimized hybrid cache (S): This design uses the offline analysis of the cache set demand stats to optimize the remapping at each interval. This design point is impractical, but it serves as a reference point to check the optimality of the AH-Cache. Since all of the above designs can provide a clean software interface, from the software point-of-view they are the same. Thus the SPM configurations and utilizations for all the designs are the same. The benchmark binaries are generated by our compiler [17] to get the optimal SPM configurations. To accurately capture the system performance, we leverage the full system simulator SIMICS [22] and the GEMS toolset [23] as the timing model of the memory subsystem. All of the above designs are implemented in GEMS. The system configurations of SIMICS/GEMS are shown in TABLE III. TABLE III. Core L1 Instruction Cache L1 Data Cache L2 Cache Main Memory TABLE IV. AVERAGE #SPM BLOCK MIGRATIONS IN EACH 1 MILLION CYCLE INTERVAL (UPPER: 16KB, LOWER: 32KB) jpeg gsm susan hmmer soplex h264ref dijkstra patricia 5.68 0.04 1.14 8.66 15.9 20.2 6.39 0.79 0.28 0.01 1.20 0.45 5.26 4.26 0.62 0.15 astar gobmk biHarmonic mutualInfo ricianDenoise regionGrowing 10.5 4.87 0.03 1.95 0.03 0.04 10.3 2.65 0.03 0.03 0.02 0.01 C. Energy Comparisons In addition to the L1 data cache energy discussed in Section IV.A, we also obtain the dynamic and leakage energy data of other memory subsystem components including the L1 instruction cache, L2 cache and the main memory through Cacti [20] and McPAT [27]. Given these energy data, we record the access times to the logics and storages in our simulations and back-annotate them to our energy estimation models to generate the energy results for each design. The energy comparison results are shown in Figure 8 and are normalized to the baseline (design N). They are broken down into the dynamic energy of L1 cache (dominated by the L1 data cache), L2 cache and main memory, and the leakage energy. The designs B and Vp can reduce the L1 data cache miss rates, but with a higher percache-access energy. But they can still reduce the total energy in some cases by reducing the L2 cache energy (less access to L2 cache) and the leakage energy (less run time). Therefore, the average total energy overhead compared to baseline for designs B and Vp is 4% and 13%, respectively. By serializing the accesses to the regular L1 cache and victim cache, design Vs achieves an average total energy reduction of 3% compared to baseline. Design R achieves an average total energy reduction of 7% compared to baseline, mainly through moderately reducing the L1 miss rate and the run time. With the additional energy of the SMLT, VTB, and adaptive mapping unit, AH-Cache can still achieve an energy reduction of 16%, 22%, 10% and 7% compared to designs B, Vp, Vs and R, respectively. It consumes less energy than designs B, Vp and Vs since its per-cacheaccess energy overhead is much less than the CAM in B-cache and victim cache. It consumes less energy than design R since its adaptive mapping more effectively reduces L1 miss and thereby consumes less L2 cache energy and leakage energy (less run time). In summary, AH-Cache achieves energy-runtime-production reductions of 19%, 25%, 18% and 18% over the designs B, Vp, Vs and R, respectively. This verifies the energy efficiency of AH-Cache. SIMICS/GEMS SIMULATOR CONFIGURATION Sun UltraSPARC-III Cu processor core 16KB/32KB, 2-way set-associative, 64-byte block, 2-cycle access latency, pseudo-LRU 16KB/32KB, 2-way set-associative, 64-byte block, 2-cycle access latency, pseudo-LRU 512KB, 8-way set-associative, 64-byte block, 20cycle access latency, pseudo-LRU 4GB, 320-cycle access latency B. Performance Comparisons Figure 6 shows the comparison results of misses for the L1 data cache (hybrid cache). The results are normalized to the baseline (design N). By functionally increasing the cache associativity with increased decoder length, design B reduces the cache misses by 44%. By accommodating victim cache blocks, designs Vp and Vs reduce the cache misses by 42%. By reconfiguring the SPM mapping at each interval, design R reduces cache miss by 34%. AH-Cache reduces the cache miss by 52% compared to baseline, and outperforms designs B, Vp, Vs and R by 19%, 22%, 22% and 33%, respectively. The reason that AH-Cache outperforms design B is that the Bcache associates cache sets in a uniform way without considering the cache set demands; thus it is possible that the associated cache sets are all high-demand cache sets. Victim cache performance is constrained by its size (and additional victim cache access cycles for Vs). It can achieve larger miss rate reductions with a much higher energy overhead. Note that Vp and Vs perform very well for ricianDenoise which has only a few extremely high-demand cache sets. The fact that AH-Cache outperforms design R indicates that simply applying the previous phase-based reconfiguration approach to the hybrid cache can be affected by the circular bouncing effect. It can be seen that AHCache almost catches the optimality of design S in most cases (~1% difference), and even outperforms it at benchmarks h264ref and susan since design S is based on interval-level analysis, and it can not manage to adapt the dynamic variations inside an interval. This shows the positive effect of the run-time optimization of AH-Cache. Figure 7 shows the performance comparison results in terms of run-time (cycles), which are normalized to the baseline (design N). The results of AH-Cache and design R include the remapping penalty (the core to L1 cache queue is suspended for two cycles for each SPM block migration). As shown in TABLE IV, the average number of V. CONCLUSIONS In this paper an adaptive hybrid cache called AH-Cache is proposed. By providing dynamic remapping of the SPM blocks onto cache blocks based on the run-time cache behavior in hardware, AHCache makes the software focus on the utilization of logically continuous SPM. Experimental results show that AH-Cache can achieve energy-runtime-production reductions of 19%, 25%, 18% and 18% over representative previous techniques. Thus AH-Cache can serve as an energy-efficient hybrid cache in low-power processors that require flexible SPM sizes to satisfy various application requirements, but have low cache associativity due to a tight power budget. VI. ACKNOWLEDGMENTS The authors thank Mishali Naik and Jiajun Zhang for helpful discussions and assistance in experiments, and the anonymous reviewers for their useful comments on this work. 71 [14] M. Zhang and K. Asanovic, “Fine-grain CAM-tag cache resizing using miss tags,” in Proc. ISLPED, 2002, pp.130-135. [15] A. Dominguez, S. Udayakumaran, and R. Barua, “Heap data allocation to scratch-pad memory in embedded systems,” in J. Embedded Comput., vol. 1.4, pp. 521-540, 2005. [16] M. Verma, K. Petzold, L. Wehmeyer, H. Falk, and P. Marwedel, “Scratchpad sharing strategies for multiprocess embedded systems: A first approach,” in Proc. ESTMEDIA, 2005, pp. 115-120. [17] J. Cong, H. Huang, C. Liu, and Y. Zou, “A reuse-aware prefetching algorithm for scratchpad memory,” to appear in Proc. DAC, 2011. [18] LLVM compiler infrastructure: http://llvm.org/ [19] T. Austin and G. Sohi, “Zero-cycle loads: microarchitecture support for reducing load latency,” in Proc. MICRO, 1995, pp.82-92. [20] HP Cacti, http://quid.hpl.hp.com:9081/cacti/ [21] C. Kun, S. Quan, and A. Mason, “A power-optimized 64-bit priority encoder utilizing parallel priority look-ahead,” in Proc. ISCAS, 2004, pp. II 753-756. [22] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform,” in IEEE Computer, vol. 35, pp. 50-58, 2002. [23] M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, and D. Wood, “Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset,” in Computer Architecture News, pp. 92-99, 2005. [24] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown, “MiBench: A free, commercially representative embedded benchmark suite,” in Workshop on Workload Characterization, 2001, pp. 3-14. [25] S. Bird, A. Phansalkar, L. K. John, A. Mericas, and R. Indukuru, “Performance characterization of SPEC CPU benchmarks on Intel’s core microarchitecture based processor,” in SPEC Benchmark Workshop, 2007. [26] www.cdsc.ucla.edu [27] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, “McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proc. MICRO, 2009, pp. 469-480. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt, “DRDU: A data reuse analysis technique for efficient scratch-pad memory management,” in ACM Trans. Des. Autom. Electron. Syst. vol. 12.2, p.15, 2007. D. Chiou, P. Jain, L. Rudolph, and S. Devadas, “Application-specific memory management for embedded systems using software-controlled caches,” in Proc. DAC, 2000, pp. 416-419. C. Moritz, M. Frank, and S. Amarasinghe, “FlexCache: A framework for flexible compiler generated data caching,” in Lecture Notes in Computer Science, vol. 2107, pp.135-146, 2001. P. Ranganathan, S. Adve, and N. Jouppi, “Reconfigurable caches and their application to media processing,” in Proc. ISCA, 2000, pp. 214-224. H. Cook, K. Asanović, and D. Patterson, “Virtual local stores: Enabling software-managed memory hierarchies in mainstream computing environments,” Technical Report No. UCB/EECS-2009-131, 2009. J. Robertson and K. Gala, “Instruction and data cache locking on the e300 processor core,” Freescale Application Note, 2006. T. Kluter, P. Brisk, P. Ienne, and E. Charbon, “Way stealing: Cache-assisted automatic instruction set extensions,” in Proc. DAC, 2009, pp. 31-36. M. Qureshi, D. Thompson, and Y. Patt, “The V-way cache: Demand based associativity via global replacement,” in Proc. ISCA, 2004, pp. 544-555. E. G. Hallnor and S. K. Reinhardt, “A fully associative software-managed cache design,” in Proc. ISCA, 2000, pp. 107-116. D. Rolán, B. Fraguela, and R. Doallo, “Adaptive line placement with the set balancing cache,” in Proc. MICRO, 2009, pp. 529-540. N. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,” in Proc. ISCA, 1990, pp. 364-373. C. Zhang, “Balanced cache: Reducing conflict misses of direct-mapped caches,” in Proc. ISCA, 2006, pp. 155-166. B. Agrawal and T. Sherwood. “Modeling TCAM power for next generation network devices,” in Proc. ISPASS, 2006, pp.120-129. 120% Normalized L1 Data Cache Missess Norm alized L1 Data Cache Misses 120% 100% 80% 60% 40% 20% 0% hmmer astar h264ref soplex gobmk N biHarminic regionGrow ing dijkstra mutualInfo ricianDenoise patricia B Vp Vs R AH jpeg 100% 80% 60% 40% 20% 0% hmmer gsm astar susan h264ref soplex gobmk N S biHarminic regionGrow ing dijkstra mutualInfo ricianDenoise patricia B Vp Vs R AH jpeg gsm susan S Figure 6. Comparison results of L1 data cache (hybrid cache) misses (left: 16KB, right: 32KB). 110% Normalized Run Time (Cycles) 120% 100% Normalized Run Time (Cycles) 110% 90% 80% 70% 60% 50% 40% 30% 20% hmmer astar h264ref soplex biHarminic regionGrow ing dijkstra gobmk mutualInfo ricianDenoise patricia N B Vp Vs R AH jpeg 100% 90% 80% 70% 60% 50% 40% 30% 20% gsm hmmer susan astar h264ref soplex gobmk S N biHarminic regionGrow ing dijkstra mutualInf o ricianDenoise patricia B Vp Vs R AH jpeg gsm susan S Figure 7. Comparison results of run time (left: 16KB, right: 32KB). 1st bar: N 2nd bar: B 3rd bar: Vp 4th bar: Vs 5th bar: R 6th bar: AH 7th bar: S 1st bar: N 2nd bar: B 3rd bar: Vp 4th bar: Vs 5th bar: R 6th bar: AH 7th bar: S 180% 140% 160% 120% 140% Normalized Energy Normalized Energy 100% 80% 60% 40% 20% 120% 100% 80% 60% 40% 20% 0% 0% hmmer astar h264ref soplex hmmer biHarminic regionGrowing dijkstra jpeg gsm gobmk mutualInfo ricianDenoise patricia susan L1 Cache L2 Cache Main Memory astar h264ref soplex biHarminic regionGrowing dijkstra jpeg gsm gobmk mutualInfo ricianDenoise patricia susan L1 Cache Leakage L2 Cache Main Memory Figure 8. Comparison results of memory subsystem energy (left: 16KB, right: 32KB). 72 Leakage