article

Software-based instruction caching for embedded processors

Authors:

Jason E. Miller,

Anant AgarwalAuthors Info & Claims

ACM SIGPLAN Notices, Volume 41, Issue 11

Pages 293 - 302

https://doi.org/10.1145/1168918.1168894

Published: 20 October 2006 Publication History

Abstract

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access. Access times are also easier to predict with simple memories since there is no possibility of a "miss." On the other hand, they are more difficult for the programmer to use since they are not automatically managed.In this paper, we present a software system that allows all or part of an SRAM or scratchpad memory to be automatically managed as a cache. This system provides the programming convenience of a cache for processors that lack dedicated caching hardware. It has been implemented for an actual processor and runs on real hardware. Our results show that a software-based instruction cache can be built that provides performance within 10% of a traditional hardware cache on many benchmarks while using a cheaper, simpler, SRAM memory. On these same benchmarks, energy consumption is up to 3% lower than it would be using a hardware cache.

References

[1]

F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri. A post-compiler approach to scratchpad mapping of code. In CASES '04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, pages 259--267, Sep 2004.

Digital Library

[2]

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, pages 1--12. ACM Press, 2000.

Digital Library

[3]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In CODES '02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 73--78, 2002.

Digital Library

[4]

D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 83--94, 2000.

Digital Library

[5]

D. Bruening, E. Duesterwald, and S. Amarasinghe. Design and implementation of a dynamic optimization framework for Windows. In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization (FDDO-4), December 2000.

[6]

D.R. Cheriton, G.A. Slavenburg, and P.D. Boyle. Softwarecontrolled caches in the VMP multiprocessor. In Proceedings of the 13th annual international symposium on Computer architecture, pages 366--374. IEEE Computer Society Press, 1986.

Digital Library

[7]

B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 128--137. ACM Press, 1994.

Digital Library

[8]

R.F. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. Technical Report SMLI 93-12, UWCSE 93-06-06, Sun Microsystems Laboratories, Inc. and the University of Washington, 1993.

Digital Library

[9]

P.J. Denning. Virtual memory. ACM Computing Surveys, 2(3):153--189, 1970.

Digital Library

[10]

G. Desoli, N. Mateev, E. Duesterwald, P. Faraboschi, and J.A. Fisher. DELI: a new run-time control point. In MICRO 35: Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, pages 257--268, Nov 2002.

Digital Library

[11]

A. Dominguez, S. Udayakumaran, and R. Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4), 2005.

Digital Library

[12]

K. Ebcioglu and E.R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture, pages 26--37, Jun 1997.

Digital Library

[13]

K. Ebcioglu, E.R. Altman, M. Gschwind, and S.W. Sathaye. Dynamic binary translation and optimization. IEEE Transactions on Computers, 50(6):529--548, 2001.

Digital Library

[14]

A.E. Eichenberger, J.K. OBrien, K.M. OBrien, P.Wu, T. Chen, P.H. Oden, D.A. Prener, J.C. Shepherd, B. So, Z. Sura, A.Wang, T. Zhang, P. Zhao, M.K. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM Systems Journal, 45(1):59--84, January 2006.

Digital Library

[15]

D.R. Engler. VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, pages 160--170. ACM Press, 1996.

Digital Library

[16]

M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, March-April 2006.

Digital Library

[17]

S. Gurumurthi, A. Sivasubramaniam, M.J. Irwin, N. Vijaykrishnan, M. Kandemir, T. Li, and L.K. John. Using complete machine simulation for software power estimation: The SoftWatt approach. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, page 141, 2002.

Digital Library

[18]

E.G. Hallnor and S.K. Reinhardt. A fully associative softwaremanaged cache design. In ISCA '00: Proceedings of the 27th annual international symposium on Computer architecture, pages 107--116, 2000.

Digital Library

[19]

K. Hazelwood and J.E. Smith. Exploring code cache eviction granularities in dynamic optimization systems. In CGO '04: Proceedings of the international symposium on Code generation and optimization, page 89, 2004.

Digital Library

[20]

W.-M. W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery. The superblock: an effective technique for VLIW and superscalar compilation. Journal of Supercomputing, 7(1-2):229--248, 1993.

Digital Library

[21]

B. Jacob and T. Mudge. Software-managed address translation. In HPCA '97: Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 156--167, Feb 1997.

Digital Library

[22]

V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In USENIX Security Symposium, San Francisco, August 2002.

Digital Library

[23]

C. Lee, M. Potkonjak, and W.H. Mangione-Smith. Mediabench: a tool for evaluating and synthesizing multimedia and communicatons systems. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 330--335, 1997.

Digital Library

[24]

P. Machanick, P. Salverda, and L. Pompe. Hardware-software tradeoffs in a direct Rambus implementation of the RAMpage memory hierarchy. ACM SIGPLAN Notices, 33(11):105--114, 1998.

Digital Library

[25]

C. May. Mimic: A fast System/370 simulator. In SIGPLAN '87: Papers of the Symposium on Interpreters and interpretive techniques, pages 1--13, New York, NY, USA, 1987. ACM Press.

Digital Library

[26]

J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P.C.M. Lin, L. Madden, D. Murray, M.H. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor. IEEE JSSC, 31(11):1703--1714, November 1996.

[27]

C. Moritz, M. Frank, W. Lee, and S. Amarasinghe. Hot pages: Software caching for Raw microprocessors. Technical Report LCSTM-599, Massachusetts Institute of Technology Lab for Computer Science, 1999.

[28]

H. Muller, D. May, J. Irwin, and D. Page. Novel caches for predictable computing. Technical Report CSTR-98-011, Department of Computer Science, University of Bristol, Oct 1998.

Digital Library

[29]

P. Naur. The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL). Communications of the ACM, 8(11):671--676, 1965.

Digital Library

[30]

R.J. Pankhurst. Operating systems: Program overlay techniques. Communications of the ACM, 11(2):119--125, 1968.

Digital Library

[31]

R.A. Ravindran, P.D. Nagarkar, G.S. Dasika, E.D. Marsman, R.M. Senger, S.A. Mahlke, and R.B. Brown. Compiler managed dynamic instruction placement in a low-power code cache. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 179--190, March 2005.

Digital Library

[32]

P. Shivakumar and N.P. Jouppi. CACTI 3.0: An integrated cache timing, power and area model. Technical Report 2001/2, Compaq Western Research Laboratory, Dec 2001.

[33]

T.R. Spacek. A proposal to establish a pseudo virtual memory via writable overlays. Communications of the ACM, 15(6):421--426, 1972.

Digital Library

[34]

S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, pages 409--417, Mar 2002.

Digital Library

[35]

M.B. Taylor, J. Kim, J.E. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, Mar 2002.

Digital Library

[36]

M.B. Taylor, W. Lee, J.E. Miller, D.Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture, pages 2--13, Jun 2004.

Digital Library

[37]

M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic overlay of scratchpad memory for energy minimization. In CODES+ISSS '04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pages 104--109, 2004.

Digital Library

[38]

S.J.E. Wilton and N.P. Jouppi. CACTI: An enhanced cache access and cycle time model. IEEE JSSC, 31(5):677--688, May 1996.

[39]

E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In Measurement and Modeling of Computer Systems, pages 68--79, 1996.

Digital Library

[40]

S.-H. Yang, B. Falsafi, M.D. Powell, and T.N. Vijaykumar. Exploiting choice in resizable cache design to optimize deepsubmicron processor energy-delay. In HPCA '02: Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, pages 151--161, Feb 2002.

Digital Library

[41]

M. Zhang and K. Asanovic. Highly associative caches for low-power processors. In Kool Chips Workshop, 33rd International Symposium on Microarchitecture, 2000.

Cited By

Yan CJoseph R(2016)Enabling Deep Voltage Scaling in Delay Sensitive L1 Caches2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.26(192-202)Online publication date: Jun-2016
https://doi.org/10.1109/DSN.2016.26
Kornaros G(2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00026
Huangfu YZhang W(2015)Hardware-Based Performance Enhancement Guaranteed CachesProceedings of the 2015 IEEE 18th International Symposium on Real-Time Distributed Computing10.1109/ISORC.2015.11(166-173)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1109/ISORC.2015.11
Show More Cited By

Recommendations

Software-based instruction caching for embedded processors
Proceedings of the 2006 ASPLOS Conference

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...
Software-based instruction caching for embedded processors
Proceedings of the 2006 ASPLOS Conference

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...
Software-based instruction caching for embedded processors
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 41, Issue 11

Proceedings of the 2006 ASPLOS Conference

November 2006

425 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1168918

Issue’s Table of Contents

ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
October 2006
440 pages
ISBN:1595934510
DOI:10.1145/1168857
General Chair:
John Paul Shen
Intel Corp.
,
Program Chair:
Margaret R. Martonosi
Princeton University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2006

Published in SIGPLAN Volume 41, Issue 11

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,132
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan CJoseph R(2016)Enabling Deep Voltage Scaling in Delay Sensitive L1 Caches2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN.2016.26(192-202)Online publication date: Jun-2016
https://doi.org/10.1109/DSN.2016.26
Kornaros G(2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00026
Huangfu YZhang W(2015)Hardware-Based Performance Enhancement Guaranteed CachesProceedings of the 2015 IEEE 18th International Symposium on Real-Time Distributed Computing10.1109/ISORC.2015.11(166-173)Online publication date: 13-Apr-2015
https://dl.acm.org/doi/10.1109/ISORC.2015.11
Shameedha Begum BRamasubramanian N(2015)A comparative study of cache performance for embedded applications2015 International Conference on Computing and Network Communications (CoCoNet)10.1109/CoCoNet.2015.7411292(872-876)Online publication date: Dec-2015
https://doi.org/10.1109/CoCoNet.2015.7411292
Huangfu YZhang W(2014)PEG-C: Performance Enhancement Guaranteed Cache for Hard Real-Time SystemsIEEE Embedded Systems Letters10.1109/LES.2013.22967796:2(17-20)Online publication date: Jun-2014
https://doi.org/10.1109/LES.2013.2296779
Huangfu YZhang W(2014)A Real-Time Instruction Cache with High Average-Case PerformanceProceedings of the 2014 IEEE 17th International Symposium on Object/Component-Oriented Real-Time Distributed Computing10.1109/ISORC.2014.59(109-116)Online publication date: 10-Jun-2014
https://dl.acm.org/doi/10.1109/ISORC.2014.59
Ferreira RRolt JNazar GMoreira ÁCarro L(2014)Adaptive Low-Power Architecture for High-Performance and Reliable Embedded ComputingProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.56(538-549)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/DSN.2014.56
Baiocchi JChilders BDavidson JHiser J(2013)Enabling dynamic binary translation in embedded systems with scratchpad memoryACM Transactions on Embedded Computing Systems10.1145/2362336.239917811:4(1-33)Online publication date: 1-Jan-2013
https://dl.acm.org/doi/10.1145/2362336.2399178
Guha AHazelwood KSoffa M(2012)Memory optimization of dynamic binary translators for embedded systemsACM Transactions on Architecture and Code Optimization10.1145/2355585.23555959:3(1-29)Online publication date: 5-Oct-2012
https://dl.acm.org/doi/10.1145/2355585.2355595
DENG NJI WLI JZUO QSHI F(2011)Core Working Set Based Scratchpad Memory ManagementIEICE Transactions on Information and Systems10.1587/transinf.E94.D.274E94-D:2(274-285)Online publication date: 2011
https://doi.org/10.1587/transinf.E94.D.274
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents