research-article

LMStr: exploring shared hardware controlled scratchpad memory for multicores

Authors:

Nafiul Alam Siddique,

Abdel-Hameed A. Badawy,

David ResnickAuthors Info & Claims

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

Pages 152 - 165

https://doi.org/10.1145/3132402.3132440

Published: 02 October 2017 Publication History

Abstract

In this paper, we present an on-chip memory store called "Local Memory Store (LMStr)"which can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of a SPM among the cores in a multicore processor. This memory hierarchy is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e., local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication micro-application and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%. It also projects up to 85% less dynamic energy consumption compared to traditional cache.

References

[1]

Lluc Alvarez, Lluís Vilanova, Marc Gonzalez, Xavier Martorell, Nacho Navarro, and Eduard Ayguade. 2012. Hardware-software Coherence Protocol for the Coexistence of Caches and Local Memories. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 89, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389117

Digital Library

[2]

Lluc Alvarez, Lluís Vilanova, Miquel Moreto, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. 2015. Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 720--732.

Digital Library

[3]

Oren Avissar, Rajeev Barua, and Dave Stewart. 2002. An Optimal Memory Allocation Scheme for Scratch-pad-based Embedded Systems. ACM Trans. Embed. Comput. Syst. 1, 1 (Nov. 2002), 6--26.

Digital Library

[4]

Abdel-Hameed A. Badawy, Aneesh Agarwala, Donald Yeung, and Chau-Wen Tseng. 2001. Evaluating the Impact of memory system performance on software prefetching and locality optimizations. In Proceedings of the 15th Annual International Conference on Supercomputing. ACM, Sorrento, Italy.

Digital Library

[5]

Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. 2004. The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems. J. Instruction-Level Parallelism Vol. 6 (2004).

[6]

K. Bai and A. Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 317--325.

Digital Library

[7]

K. Bai and A. Shrivastava. 2013. Automatic and efficient heap data management for Limited Local Memory multicore architectures. In 2013 Design, Automation Test in Europe Conference Exhibition (DATE). 593--598.

Digital Library

[8]

K. Bai, A. Shrivastava, and S. Kudchadker. 2011. Stack data management for Limited Local Memory (LLM) multi-core processors. In ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors. 231--234.

Digital Library

[9]

José V. Busquets-Mataix and Carlos Catalá. 2011. Architecture Extensions for Efficient Management of Scratch-pad Memory. In Proceedings of the 21st International Conference on Integrated Circuit and System Design: Power and Timing Modeling, Optimization, and Simulation (PATMOS'11). Springer-Verlag, Berlin, Heidelberg, 43--52.

Digital Library

[10]

G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105.

Digital Library

[11]

J. Cong, K. Gururaj, H. Huang, C. Liu, G. Reinman, and Y. Zou. 2011. An energy-efficient adaptive hybrid cache. In IEEE/ACM International Symposium on Low Power Electronics and Design. 67--72.

Digital Library

[12]

Y. Guo, Q. Zhuge, J. Hu, M. Qiu, and E. H. M. Sha. 2011. Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems. In 2011 International Conference on Parallel Processing. 464--471.

Digital Library

[13]

Y. Guo, Q. Zhuge, J. Hu, J. Yi, M. Qiu, and E. H. M. Sha. 2013. Data Placement and Duplication for Embedded Multicore Systems With Scratch Pad Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 6 (June 2013), 809--817.

Digital Library

[14]

AM. A. Heroux, D. W. Doefler, P. S. Crozier, J. Willenbring, C. H. Edwards, and A. Williams. 2009. Improving Performance via Mini-applications. Sandia National Laboratories. Tech. Rep. SAND2009--5574 Vol. 3 (2009).

[15]

Andhi Janapsatya, Sri Parameswaran, and A. Ignjatovic. 2004. Hardware/software managed scratchpad memory for embedded system. In Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on. 370--377.

Digital Library

[16]

G. Kestor and R. Gioiosa. 2013. Quantifying the energy cost of data movement in scientific applications. In Workload Characterization (IISWC), 2013 IEEE International Symposium on. 56--65.

[17]

R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. 2015. Stash: Have your scratchpad and cache it too. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 707--719.

Digital Library

[18]

Snehasish Kumar and H. Zhao. 2012. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 376--388.

Digital Library

[19]

David Levinthal. 2009. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. (2009).

[20]

Lian Li, Lin Gao, and Jingling Xue. 2005. Memory coloring: a compiler approach for scratchpad memory management. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 329--338.

Digital Library

[21]

Yu Liu and Wei Zhang. 2015. Scratchpad Memory Architectures and Allocation Algorithms for Hard Real-Time Multicore Processors. Journal of Computing Science and Engineering 9, 2 (2015), 51--72.

[22]

Jing Lu, Ke Bai, and A. Shrivastava. 2013. SSDM: Smart Stack Data Management for software managed multicores (SMMs). In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--8.

Digital Library

[23]

Jing Lu, Ke Bai, and Aviral Shrivastava. 2015. Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores. ACM Trans. Embed. Comput. Syst. 14, 4, Article 71 (Dec. 2015), 24 pages.

Digital Library

[24]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, and Geoff Lowney. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190--200.

Digital Library

[25]

Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78--89.

Digital Library

[26]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. (2009).

[27]

Nghi Nguyen, Angel Dominguez, and Rajeev Barua. 2009. Memory Allocation for Embedded Systems with a Compile-time-unknown Scratch-pad Size. ACM Trans. Embed. Comput. Syst. 8, 3, Article 21 (April 2009), 32 pages.

Digital Library

[28]

Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC '97). IEEE Computer Society, Washington, DC, USA, 7-.

Digital Library

[29]

Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. Off-chip Memory: The Data Partitioning Problem in Embedded Processor-based Systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682--704.

Digital Library

[30]

I. Puaut and C. Pais. 2007. Scratchpad memories vs locked caches in hard realtime systems: a quantitative comparison. In Design, Automation Test in Europe Conference Exhibition, 2007. DATE '07. 1--6.

Digital Library

[31]

Preeti Ranjan Panda and NikilD. Dutt. 2002. Memory Architectures for Embedded Systems-On-Chip. In High Performance Computing - HiPC 2002, Sartaj Sahni, Viktor K. Prasanna, and Uday Shukla (Eds.). Lecture Notes in Computer Science, Vol. 2552. Springer Berlin Heidelberg, 647--662.

Digital Library

[32]

A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and B. Jacob. 2011. The Structural Simulation Toolkit. SIGMETRICS Perform. Eval. Rev. 38, 4 (March 2011), 37--42.

Digital Library

[33]

A. Shrivastava, A. Kannan, and J. Lee. 2009. A Software-Only Solution to Use Scratch Pads for Stack Data. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 11 (Nov 2009), 1719--1727.

Digital Library

[34]

Nafiul Siddique, Abdel-Hameed Badawy, Jeanine Cook, and David Resnick. 2016. LMStr: Local Memory Store: The Case for Hardware Controlled Scratchpad Memory for General Purpose Processors. In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[35]

Nafiul Siddique, Abdel-Hameed Badawy, Jeanine Cook, and David Resnick. 2017. Local Memory Store (LMStr): A Hardware Controlled Shared Scratchpad for Multicores. In To Appear in Proceedings of the 14th IEEE Conference on Advanced and Trusted Computing (ATC 2017). San Francisco.

[36]

Nafiul Siddique, Patricia Grubel, Abdel-Hameed Badawy, and Jeanine Cook. 2016. Cache Utilization as a Locality Metric: A Case study on the Mantevo Suite. In CSCI-ISPD Proceedings.

[37]

S. Steinke and L. Wehmeyer. 2002. Assigning program and data objects to scratchpad for energy reduction. In Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings. 409--415.

Digital Library

[38]

Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated Scratchpad Memory Optimization and Task Scheduling for MPSoC Architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES '06). ACM, New York, NY, USA, 401--410.

Digital Library

[39]

Sumesh Udayakumaran and Rajeev Barua. 2003. Compiler-decided Dynamic Memory Allocation for Scratch-pad Based Embedded Systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES '03). ACM, New York, NY, USA, 276--286.

Digital Library

[40]

M. Verma, K. Petzold, L. Wehmeyer, H. Falk, and P. Marwedel. 2005. Scratchpad sharing strategies for multiprocess embedded systems: a first approach. In 3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005. 115--120.

[41]

Lars Wehmeyer, Urs Helmig, and Peter Marwedel. 2004. Compiler-optimized Usage of Partitioned Memories. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 114--120.

Digital Library

[42]

Lehrstuhl Informatik Xii and R. Banakar. 2001. Comparison of Cache- and Scratch-Pad based Memory Systems with respect to Performance, Area and Energy Consumption. (2001).

[43]

H. Yang and Soonhoi Ha. 2008. ILP based data parallel multi-task mapping/scheduling technique for MPSoC. In 2008 International SoC Design Conference, Vol. 01. I-134--I-137.

Cited By

Siddique NGrubel PBadawy ACook J(2018)A performance study of the time-varying cache behaviorThe Journal of Supercomputing10.1007/s11227-017-2144-174:2(665-695)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1007/s11227-017-2144-1

LMStr: exploring shared hardware controlled scratchpad memory for multicores
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
2. Software and its engineering
  1. Software notations and tools

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

October 2017

409 pages

ISBN:9781450353359

DOI:10.1145/3132402

General Chair:
Bruce Jacob
University of Maryland

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

U.S. Army Research Laboratory (ARL)

Conference

MEMSYS 2017

MEMSYS 2017: The International Symposium on Memory Systems, 2017

October 2 - 5, 2017

Virginia, Alexandria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
96
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Siddique NGrubel PBadawy ACook J(2018)A performance study of the time-varying cache behaviorThe Journal of Supercomputing10.1007/s11227-017-2144-174:2(665-695)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1007/s11227-017-2144-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten