Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132402.3132440acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

LMStr: exploring shared hardware controlled scratchpad memory for multicores

Published: 02 October 2017 Publication History

Abstract

In this paper, we present an on-chip memory store called "Local Memory Store (LMStr)"which can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of a SPM among the cores in a multicore processor. This memory hierarchy is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e., local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication micro-application and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%. It also projects up to 85% less dynamic energy consumption compared to traditional cache.

References

[1]
Lluc Alvarez, Lluís Vilanova, Marc Gonzalez, Xavier Martorell, Nacho Navarro, and Eduard Ayguade. 2012. Hardware-software Coherence Protocol for the Coexistence of Caches and Local Memories. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 89, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389117
[2]
Lluc Alvarez, Lluís Vilanova, Miquel Moreto, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, and Mateo Valero. 2015. Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 720--732.
[3]
Oren Avissar, Rajeev Barua, and Dave Stewart. 2002. An Optimal Memory Allocation Scheme for Scratch-pad-based Embedded Systems. ACM Trans. Embed. Comput. Syst. 1, 1 (Nov. 2002), 6--26.
[4]
Abdel-Hameed A. Badawy, Aneesh Agarwala, Donald Yeung, and Chau-Wen Tseng. 2001. Evaluating the Impact of memory system performance on software prefetching and locality optimizations. In Proceedings of the 15th Annual International Conference on Supercomputing. ACM, Sorrento, Italy.
[5]
Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. 2004. The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems. J. Instruction-Level Parallelism Vol. 6 (2004).
[6]
K. Bai and A. Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 317--325.
[7]
K. Bai and A. Shrivastava. 2013. Automatic and efficient heap data management for Limited Local Memory multicore architectures. In 2013 Design, Automation Test in Europe Conference Exhibition (DATE). 593--598.
[8]
K. Bai, A. Shrivastava, and S. Kudchadker. 2011. Stack data management for Limited Local Memory (LLM) multi-core processors. In ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors. 231--234.
[9]
José V. Busquets-Mataix and Carlos Catalá. 2011. Architecture Extensions for Efficient Management of Scratch-pad Memory. In Proceedings of the 21st International Conference on Integrated Circuit and System Design: Power and Timing Modeling, Optimization, and Simulation (PATMOS'11). Springer-Verlag, Berlin, Heidelberg, 43--52.
[10]
G. J. Chaitin. 1982. Register Allocation & Spilling via Graph Coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction (SIGPLAN '82). ACM, New York, NY, USA, 98--105.
[11]
J. Cong, K. Gururaj, H. Huang, C. Liu, G. Reinman, and Y. Zou. 2011. An energy-efficient adaptive hybrid cache. In IEEE/ACM International Symposium on Low Power Electronics and Design. 67--72.
[12]
Y. Guo, Q. Zhuge, J. Hu, M. Qiu, and E. H. M. Sha. 2011. Optimal Data Allocation for Scratch-Pad Memory on Embedded Multi-core Systems. In 2011 International Conference on Parallel Processing. 464--471.
[13]
Y. Guo, Q. Zhuge, J. Hu, J. Yi, M. Qiu, and E. H. M. Sha. 2013. Data Placement and Duplication for Embedded Multicore Systems With Scratch Pad Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 6 (June 2013), 809--817.
[14]
AM. A. Heroux, D. W. Doefler, P. S. Crozier, J. Willenbring, C. H. Edwards, and A. Williams. 2009. Improving Performance via Mini-applications. Sandia National Laboratories. Tech. Rep. SAND2009--5574 Vol. 3 (2009).
[15]
Andhi Janapsatya, Sri Parameswaran, and A. Ignjatovic. 2004. Hardware/software managed scratchpad memory for embedded system. In Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on. 370--377.
[16]
G. Kestor and R. Gioiosa. 2013. Quantifying the energy cost of data movement in scientific applications. In Workload Characterization (IISWC), 2013 IEEE International Symposium on. 56--65.
[17]
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. 2015. Stash: Have your scratchpad and cache it too. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 707--719.
[18]
Snehasish Kumar and H. Zhao. 2012. Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 376--388.
[19]
David Levinthal. 2009. Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. (2009).
[20]
Lian Li, Lin Gao, and Jingling Xue. 2005. Memory coloring: a compiler approach for scratchpad memory management. In 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05). 329--338.
[21]
Yu Liu and Wei Zhang. 2015. Scratchpad Memory Architectures and Allocation Algorithms for Hard Real-Time Multicore Processors. Journal of Computing Science and Engineering 9, 2 (2015), 51--72.
[22]
Jing Lu, Ke Bai, and A. Shrivastava. 2013. SSDM: Smart Stack Data Management for software managed multicores (SMMs). In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--8.
[23]
Jing Lu, Ke Bai, and Aviral Shrivastava. 2015. Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores. ACM Trans. Embed. Comput. Syst. 14, 4, Article 71 (Dec. 2015), 24 pages.
[24]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, and Geoff Lowney. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190--200.
[25]
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78--89.
[26]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. (2009).
[27]
Nghi Nguyen, Angel Dominguez, and Rajeev Barua. 2009. Memory Allocation for Embedded Systems with a Compile-time-unknown Scratch-pad Size. ACM Trans. Embed. Comput. Syst. 8, 3, Article 21 (April 2009), 32 pages.
[28]
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC '97). IEEE Computer Society, Washington, DC, USA, 7-.
[29]
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. Off-chip Memory: The Data Partitioning Problem in Embedded Processor-based Systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682--704.
[30]
I. Puaut and C. Pais. 2007. Scratchpad memories vs locked caches in hard realtime systems: a quantitative comparison. In Design, Automation Test in Europe Conference Exhibition, 2007. DATE '07. 1--6.
[31]
Preeti Ranjan Panda and NikilD. Dutt. 2002. Memory Architectures for Embedded Systems-On-Chip. In High Performance Computing - HiPC 2002, Sartaj Sahni, Viktor K. Prasanna, and Uday Shukla (Eds.). Lecture Notes in Computer Science, Vol. 2552. Springer Berlin Heidelberg, 647--662.
[32]
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and B. Jacob. 2011. The Structural Simulation Toolkit. SIGMETRICS Perform. Eval. Rev. 38, 4 (March 2011), 37--42.
[33]
A. Shrivastava, A. Kannan, and J. Lee. 2009. A Software-Only Solution to Use Scratch Pads for Stack Data. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28, 11 (Nov 2009), 1719--1727.
[34]
Nafiul Siddique, Abdel-Hameed Badawy, Jeanine Cook, and David Resnick. 2016. LMStr: Local Memory Store: The Case for Hardware Controlled Scratchpad Memory for General Purpose Processors. In 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).
[35]
Nafiul Siddique, Abdel-Hameed Badawy, Jeanine Cook, and David Resnick. 2017. Local Memory Store (LMStr): A Hardware Controlled Shared Scratchpad for Multicores. In To Appear in Proceedings of the 14th IEEE Conference on Advanced and Trusted Computing (ATC 2017). San Francisco.
[36]
Nafiul Siddique, Patricia Grubel, Abdel-Hameed Badawy, and Jeanine Cook. 2016. Cache Utilization as a Locality Metric: A Case study on the Mantevo Suite. In CSCI-ISPD Proceedings.
[37]
S. Steinke and L. Wehmeyer. 2002. Assigning program and data objects to scratchpad for energy reduction. In Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings. 409--415.
[38]
Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated Scratchpad Memory Optimization and Task Scheduling for MPSoC Architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES '06). ACM, New York, NY, USA, 401--410.
[39]
Sumesh Udayakumaran and Rajeev Barua. 2003. Compiler-decided Dynamic Memory Allocation for Scratch-pad Based Embedded Systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES '03). ACM, New York, NY, USA, 276--286.
[40]
M. Verma, K. Petzold, L. Wehmeyer, H. Falk, and P. Marwedel. 2005. Scratchpad sharing strategies for multiprocess embedded systems: a first approach. In 3rd Workshop on Embedded Systems for Real-Time Multimedia, 2005. 115--120.
[41]
Lars Wehmeyer, Urs Helmig, and Peter Marwedel. 2004. Compiler-optimized Usage of Partitioned Memories. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 114--120.
[42]
Lehrstuhl Informatik Xii and R. Banakar. 2001. Comparison of Cache- and Scratch-Pad based Memory Systems with respect to Performance, Area and Energy Consumption. (2001).
[43]
H. Yang and Soonhoi Ha. 2008. ILP based data parallel multi-task mapping/scheduling technique for MPSoC. In 2008 International SoC Design Conference, Vol. 01. I-134--I-137.

Cited By

View all
  • (2018)A performance study of the time-varying cache behaviorThe Journal of Supercomputing10.1007/s11227-017-2144-174:2(665-695)Online publication date: 1-Feb-2018
  1. LMStr: exploring shared hardware controlled scratchpad memory for multicores

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      MEMSYS '17: Proceedings of the International Symposium on Memory Systems
      October 2017
      409 pages
      ISBN:9781450353359
      DOI:10.1145/3132402
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 October 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      • U.S. Army Research Laboratory (ARL)

      Conference

      MEMSYS 2017

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2018)A performance study of the time-varying cache behaviorThe Journal of Supercomputing10.1007/s11227-017-2144-174:2(665-695)Online publication date: 1-Feb-2018

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media