An energy-efficient data cache organization for embedded processors with virtual memory is proposed. Application knowledge regarding memory references is used to eliminate most tag translations. A novel tagging scheme is introduced, where both virtual and physical tags coexist. Physical tags and special handling of superset index bits are only used for references to shared regions in order to avoid cache inconsistency. By eliminating the need for most address translations on cache access, a significant power reduction is achieved. We outline an efficient hardware architecture, where the application information is captured in a reprogrammable way and the cache is minimally modified.
Benini, L., Macii, A., and Poncino, M. 2003. Energy-Aware design of embedded memories: A survey of technologies, architectures, and optimization techniques. ACM Trans. Embed. Comput. Syst. 2, 1, 5--32.
Benini, L., Menichelli, F., and Olivieri, M. 2004. A class of code compression schemes for reducing power consumption in embedded microprocessor systems. IEEE Trans. Comput. 53, 4, 467--482.
Calder, B., Krintz, C., John, S., and Austin, T. 1998. Cache-Conscious data placement. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 139--149.
Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999. Cache-Conscious structure layout. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1--12.
Ekman, M., Dahlgren, F., and Stenstrom, P. 2002. TLB and snoop energy-reduction using virtual caches in low-power chip-microprocessors. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED). 243--246.
Juan, T., Lang, T., and Navarro, J. J. 1997. Reducing TLB power requirements. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED), 196--201.
Kadayif, I., Nath, P., Kandemir, M., and Sivasubramaniam, A. 2004. Compiler-Directed physical address generation for reducing DTLB power. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 161--168.
Kadayif, I., Sivasubramaniam, A., Kandemir, M., Kandiraju, G., and Chen, G. 2002. Generating physical addresses directly for saving instruction TLB energy. In Proceedings of the International Symposium on Microarchitecture (MICRO), 185.
Kandemir, M., Kadayif, I., and Chen, G. 2004. Compiler-Directed code restructuring for reducing data TLB energy. In Proceedings of the International Conference on Hardware/Software Codedesign and System Synthesis (CODES and ISSS), 98--103.
Kim, J., Min, S., Jeon, S., Ahn, B., Jeong, D., and Kim, C. 1995. U-Cache: A cost-effective solution to synonym problem. In Proceedings of the International Symposium on High-Performance Computer Archtecture (HPCA), 243--252.
Kulkarni, C., Ghez, C., Miranda, M., Catthoor, F., and Man, H. D. 2005. Cache conscious data layout organization for conflict miss reduction in embedded multimedia applications. IEEE Trans. Comput. 54, 1, 76--81.
Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 330--335.
Lee, J. H., Lee, J. S., Jeong, S., and Kim, S. 2001. A banked-promotion TLB for high performance and low power. In Proceedings of the IEEE International Conference on Computer Design (ICCD), 118--123.
Middha, B., Simpson, M., and Barua, R. 2005. MTSS: Multi task stack sharing for embedded systems. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), New York, 191--201.
Panda, P. R., Catthoor, F., Dutt, N. D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Vandercappelle, A., and Kjeldsberg, P. G. 2001. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst. 6, 2, 149--206.
Petrov, P., Tracy, D., and Orailoglu, A. 2005. Energy-Efficient physically tagged caches for embedded processors with virtual memory. In Proceedings of the IEEE/ACM Design Automation Conference (DAC), 17--22.
Qiu, X. and Dubois, M. 2001. Towards virtually-addressed memory hierarchies. In Proceedings of the International Symposium on High-Performance Computer Archtecture (HPCA), 51--62.
Simpson, M., Middha, B., and Barua, R. 2005. Segment protection for embedded systems using run-time checks. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), New York, 66--77.
Tarjan, D., Thoziyoor, S., and Jouppi, N. 2006. Cacti 4.0: An integrated cache timing, power and area model. Tech. Rep., HP Laboratories, Palo Alto, California, June.
Vratonjic, M., Zeydel, B., and Oklobdzija, V. 2005. Low- and ultra low-power arithmetic units: Design and comparison. In Proceedings of the International Conference on Computer Design (ICCD), 249--252.
Woo, D., Ghosh, M., Ozer, E., Biles, S., and Lee, H.-H. 2006. Reducing energy of virtual cache synonym lookup using bloom filters. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), 179--189.
Bardizbanyan AGavin PWhalley DSjalander MLarsson-Edefors PMcKee SStenstrom P(2013)Improving data access efficiency by using a tagless access buffer (TAB)Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495003(1-11)Online publication date: 23-Feb-2013
Basu AHill MSwift MLu STorrellas J(2012)Reducing memory reference energy with opportunistic virtual cachingProceedings of the 39th Annual International Symposium on Computer Architecture10.5555/2337159.2337194(297-308)Online publication date: 9-Jun-2012
ISLPED '07: Proceedings of the 2007 international symposium on Low power electronics and design
Chip multiprocessors (CMPs) emerge as a dominant architectural alternative in high-end embedded systems. Since off-chip accesses require a long latency and consume a large amount of power, CMPs are typically based on multiple levels of on-chip cache ...
This paper presents a low-power tag organization for physically tagged caches in embedded processors with virtual memory support. An exceedingly small subset of tag bits is identified for each application hot-spot so that only these tag bits are used for ...
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
Zhou and Petrov propose a cache architecture that aims to provide fast data access and low power consumption.
The memory hierarchy of modern computer systems includes a high-speed memory that is faster but smaller than the main memory, called cache memory. Caches reduce the average latency of memory accesses, and can be organized in multiple levels, where the size increases but the speed decreases with the level.
The processor accesses the cache in two steps: cache indexing and tag comparison. In cache indexing, the least significant bits of the memory address are used to select a cache set, where a set consists of one (for direct-mapped caches) or several (for set-associative caches) cache lines. Each cache line consists of data, tags, and state bits, with the tags containing the memory address. During tag comparison, the tags of all the lines in the selected set are compared against the memory address. If a match is found, a cache hit occurs and the data from the cache is used.
In systems with virtual memory, the processor issues virtual addresses that are translated into physical addresses using a combination of software and hardware. The virtual address space is divided into virtual pages, and the physical space is divided into page frames. A special structure in memory, called the page table, translates virtual page numbers to physical page numbers for each process, and a special cache called the translation lookaside buffer (TLB) caches the page table entries: "TLB is usually implemented as a highly associative cache structure which consumes a significant amount of power."
The memory address used for indexing and for tagging the cache can be either the virtual or the physical address. If both addresses are physical addresses, the cache architecture is called physical cache. Otherwise, it is called virtual cache. The most common kinds of virtual caches are those indexed and tagged with virtual address bits (V/V caches), and those indexed with virtual bits and tagged with physical bits (V/P caches).
Physical caches require that address translation be performed before cache indexing for each memory access. For this purpose, the TLB is accessed, which incurs both a performance penalty (because it inserts the TLB in the memory access path) and a power overhead (because of the power consumption of the TLB).
In contrast, V/V caches have the advantage that cache accessing does not require address translation (thus no TLB access), which results in fast access and low power consumption. However, V/V caches have the drawback of potential cache consistency problems. These problems can occur when the virtual-to-physical page mapping is changed by the operating system, or when multiple processes share some physical memory (that is, parts of the virtual address spaces of two processes are mapped to the same physical memory). The following kinds of cache consistency problems can occur: synonyms, aliases, homonyms, and cache coherence. (Cekleov and Dubois define these cache consistency problems in their paper [1].) In uniprocessor systems, cache coherence problems can occur when synonyms for shared writable data exist. Since information in instruction caches is not modified by processes, V/V caches can be safely used for instruction caches. The homonym problem is solved by extending the virtual tags with the process ID of the process that issues the virtual address.
In V/P caches, cache indexing proceeds in parallel with address translation, hiding some of the address translation latency. Tag comparison occurs when both indexing and address translation are complete. V/P caches consume more power than the V/V caches, are slower than V/V caches, and are faster than physical caches. However, V/P caches have the advantage over V/V caches in that cache consistency problems can be easily avoided. Therefore, V/P caches can be safely used for data caches.
The cache architecture proposed by the authors tries to combine the low power consumption and fast access of V/V caches with the elimination of consistency problems provided by V/P caches. The authors introduce a hybrid tagging scheme that uses virtual tags for private data and physical tags for shared data, employing application-specific information in order to decide which kind of tag to use for a certain virtual page.
Shared pages are identified using a combination of source-code annotation, compiler support, and additional hardware. The source code of the application declares shared data using #pragma directives. A portion of the virtual address space is reserved for shared data and the compiler maps data declared as shared to that reserved space. Finally, combinational logic (for example, a three-input and gate if the reserved address space is identified by ones in the most significant three address bits) is used to detect shared pages.
The merits of the proposed technique are questionable. First, the requirement to annotate the applications implies that existing applications need to be modified. Second, applying the techniques requires compiler support and changes to the processor logic.
The presentation style of the authors is not very clear, some ideas are repeatedly stated using slightly different phrasing, and some technical mistakes are made by the authors. For example, the authors incorrectly define cache aliasing as "a situation where the same virtual address from different tasks is mapped to different physical addresses." In fact, this defines homonyms [1].
Online Computing Reviews Service
Access critical reviews of Computing literature here
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Bardizbanyan AGavin PWhalley DSjalander MLarsson-Edefors PMcKee SStenstrom P(2013)Improving data access efficiency by using a tagless access buffer (TAB)Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6495003(1-11)Online publication date: 23-Feb-2013
Basu AHill MSwift MLu STorrellas J(2012)Reducing memory reference energy with opportunistic virtual cachingProceedings of the 39th Annual International Symposium on Computer Architecture10.5555/2337159.2337194(297-308)Online publication date: 9-Jun-2012
Yu BDong SMa YLin TWang YChen SGoto S(2011)Network flow-based simultaneous retiming and slack budgeting for low power designProceedings of the 16th Asia and South Pacific Design Automation Conference10.5555/1950815.1950913(473-478)Online publication date: 25-Jan-2011
Zheng LDong MOta KJin HGuo SMa J(2011)Energy Efficiency of a Multi-Core Processor by Tag ReductionJournal of Computer Science and Technology10.1007/s11390-011-1149-026:3(491-503)Online publication date: 12-May-2011
Zheng LDong MJin HGuo MGuo STu X(2010)The core degree based tag reduction on chip multiprocessor to balance energy saving and performance overheadProceedings of the 2010 IFIP international conference on Network and parallel computing10.5555/1882011.1882047(358-372)Online publication date: 13-Sep-2010
Zheng LDong MOta KLi HGuo SGuo M(2010)Exploring the Limits of Tag Reduction for Energy Saving on a Multi-core ProcessorProceedings of the 2010 39th International Conference on Parallel Processing Workshops10.1109/ICPPW.2010.26(104-112)Online publication date: 13-Sep-2010
Zheng LDong MJin HGuo MGuo STu X(2010)The Core Degree Based Tag Reduction on Chip Multiprocessor to Balance Energy Saving and Performance OverheadNetwork and Parallel Computing10.1007/978-3-642-15672-4_30(358-372)Online publication date: 2010