Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Software-based instruction caching for embedded processors

Published: 20 October 2006 Publication History

Abstract

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a "miss." On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.

References

[1]
F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, Sep 2004.
[2]
V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12. ACM Press, 2000.
[3]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, 2002.
[4]
D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 83--94, 2000.
[5]
D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), December 2000.
[6]
D.R. Cheriton, G.A. Slavenburg, and P.D. Boyle. Softwarecontrolled caches in the VMP multiprocessor. In Proceedings of the 13th annual international symposium on Computer architecture, pages 366--374. IEEE Computer Society Press, 1986.
[7]
B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 128--137. ACM Press, 1994.
[8]
R.F. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. Technical Report SMLI 93-12, UWCSE 93-06-06, Sun Microsystems Laboratories, Inc. and the University of Washington, 1993.
[9]
P.J. Denning. Virtual memory. ACM Computing Surveys, 2(3):153--189, 1970.
[10]
G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J.A. Fisher. DELI: a new run-time control point. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages 257--268, Nov 2002.
[11]
A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4), 2005.
[12]
K. Ebcioglu and E.R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 26--37, Jun 1997.
[13]
K. Ebcioglu, E.R. Altman, M. Gschwind, and S.W. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001.
[14]
A.E. Eichenberger, J.K. OBrien, K.M. OBrien, P.Wu, T. Chen, P.H. Oden, D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A.Wang, T. Zhang, P. Zhao, M.K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal, 45(1):59--84, January 2006.
[15]
D.R. Engler. VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, pages 160--170. ACM Press, 1996.
[16]
M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, March-April 2006.
[17]
S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li, and L.K. John. Using complete machine simulation for software power estimation: The SoftWatt approach. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, page 141, 2002.
[18]
E.G. Hallnor and S.K. Reinhardt. A fully associative softwaremanaged cache design. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 107--116, 2000.
[19]
K. Hazelwood and J.E. Smith. Exploring code cache eviction granularities in dynamic optimization systems. In CGO '04: Proceedings of the international symposium on Code generation and optimization, page 89, 2004.
[20]
W.-M. W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The superblock: an effective technique for VLIW and superscalar compilation. Journal of Supercomputing, 7(1-2):229--248, 1993.
[21]
B. Jacob and T. Mudge. Software-managed address translation. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 156--167, Feb 1997.
[22]
V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In USENIX Security Symposium, San Francisco, August 2002.
[23]
C. Lee, M. Potkonjak, and W.H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335, 1997.
[24]
P. Machanick, P. Salverda, and L. Pompe. Hardware-software tradeoffs in a direct Rambus implementation of the RAMpage memory hierarchy. ACM SIGPLAN Notices, 33(11):105--114, 1998.
[25]
C. May. Mimic: A fast System/370 simulator. In SIGPLAN '87: Papers of the Symposium on Interpreters and interpretive techniques, pages 1--13, New York, NY, USA, 1987. ACM Press.
[26]
J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P.C.M. Lin, L. Madden, D. Murray, M.H. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor. IEEE JSSC, 31(11):1703--1714, November 1996.
[27]
C. Moritz, M. Frank, W. Lee, and S. Amarasinghe. Hot pages: Software caching for Raw microprocessors. Technical Report LCSTM-599, Massachusetts Institute of Technology Lab for Computer Science, 1999.
[28]
H. Muller, D. May, J. Irwin, and D. Page. Novel caches for predictable computing. Technical Report CSTR-98-011, Department of Computer Science, University of Bristol, Oct 1998.
[29]
P. Naur. The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL). Communications of the ACM, 8(11):671--676, 1965.
[30]
R.J. Pankhurst. Operating systems: Program overlay techniques. Communications of the ACM, 11(2):119--125, 1968.
[31]
R.A. Ravindran, P.D. Nagarkar, G.S. Dasika, E.D. Marsman, R.M. Senger, S.A. Mahlke, and R.B. Brown. Compiler managed dynamic instruction placement in a low-power code cache. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 179--190, March 2005.
[32]
P. Shivakumar and N.P. Jouppi. CACTI 3.0: An integrated cache timing, power and area model. Technical Report 2001/2, Compaq Western Research Laboratory, Dec 2001.
[33]
T.R. Spacek. A proposal to establish a pseudo virtual memory via writable overlays. Communications of the ACM, 15(6):421--426, 1972.
[34]
S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, pages 409--417, Mar 2002.
[35]
M.B. Taylor, J. Kim, J.E. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, Mar 2002.
[36]
M.B. Taylor, W. Lee, J.E. Miller, D.Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture, pages 2--13, Jun 2004.
[37]
M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, 2004.
[38]
S.J.E. Wilton and N.P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE JSSC, 31(5):677--688, May 1996.
[39]
E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Measurement and Modeling of Computer Systems, pages 68--79, 1996.
[40]
S.-H. Yang, B. Falsafi, M.D. Powell, and T.N. Vijaykumar. Exploiting choice in resizable cache design to optimize deepsubmicron processor energy-delay. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pages 151--161, Feb 2002.
[41]
M. Zhang and K. Asanovic. Highly associative caches for low-power processors. In Kool Chips Workshop, 33rd International Symposium on Microarchitecture, 2000.

Cited By

View all
  • (2016)Enabling Deep Voltage Scaling in Delay Sensitive L1 Caches2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.26(192-202)Online publication date: Jun-2016
  • (2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
  • (2015)Hardware-Based Performance Enhancement Guaranteed CachesProceedings of the 2015 IEEE 18th International Symposium on Real-Time Distributed Computing10.1109/ISORC.2015.11(166-173)Online publication date: 13-Apr-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 41, Issue 11
Proceedings of the 2006 ASPLOS Conference
November 2006
425 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1168918
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
    October 2006
    440 pages
    ISBN:1595934510
    DOI:10.1145/1168857
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2006
Published in SIGPLAN Volume 41, Issue 11

Check for updates

Author Tags

  1. chaining
  2. instruction cache
  3. software caching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Enabling Deep Voltage Scaling in Delay Sensitive L1 Caches2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.26(192-202)Online publication date: Jun-2016
  • (2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
  • (2015)Hardware-Based Performance Enhancement Guaranteed CachesProceedings of the 2015 IEEE 18th International Symposium on Real-Time Distributed Computing10.1109/ISORC.2015.11(166-173)Online publication date: 13-Apr-2015
  • (2015)A comparative study of cache performance for embedded applications2015 International Conference on Computing and Network Communications (CoCoNet)10.1109/CoCoNet.2015.7411292(872-876)Online publication date: Dec-2015
  • (2014)PEG-C: Performance Enhancement Guaranteed Cache for Hard Real-Time SystemsIEEE Embedded Systems Letters10.1109/LES.2013.22967796:2(17-20)Online publication date: Jun-2014
  • (2014)A Real-Time Instruction Cache with High Average-Case PerformanceProceedings of the 2014 IEEE 17th International Symposium on Object/Component-Oriented Real-Time Distributed Computing10.1109/ISORC.2014.59(109-116)Online publication date: 10-Jun-2014
  • (2014)Adaptive Low-Power Architecture for High-Performance and Reliable Embedded ComputingProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.56(538-549)Online publication date: 23-Jun-2014
  • (2013)Enabling dynamic binary translation in embedded systems with scratchpad memoryACM Transactions on Embedded Computing Systems10.1145/2362336.239917811:4(1-33)Online publication date: 1-Jan-2013
  • (2012)Memory optimization of dynamic binary translators for embedded systemsACM Transactions on Architecture and Code Optimization10.1145/2355585.23555959:3(1-29)Online publication date: 5-Oct-2012
  • (2011)Core Working Set Based Scratchpad Memory ManagementIEICE Transactions on Information and Systems10.1587/transinf.E94.D.274E94-D:2(274-285)Online publication date: 2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media