Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures

Published: 05 February 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Contemporary many-core architectures, such as Adapteva Epiphany and Sunway TaihuLight, employ per-core software-controlled Scratchpad Memory (SPM) rather than caches for better performance-per-watt and predictability. In these architectures, a core is allowed to access its own SPM as well as remote SPMs through the Network-On-Chip (NoC). However, the compiler/programmer is required to explicitly manage the movement of data between SPMs and off-chip memory. Utilizing SPMs for multi-threaded applications is even more challenging, as the shared variables across the threads need to be placed appropriately. Accessing variables from remote SPMs with higher access latency further complicates this problem as certain links in the NoC may be heavily contended by multiple threads. Therefore, certain variables may need to be replicated in multiple SPMs to reduce the contention delay and/or the overall access time. We present Coordinated Data Management (CDM), a compile-time framework that automatically identifies shared/private variables and places them with replication (if necessary) to suitable on-chip or off-chip memory, taking NoC contention into consideration. We develop both an exact Integer Linear Programming (ILP) formulation as well as an iterative, scalable algorithm for placing the data variables in multi-threaded applications on many-core SPMs. Experimental evaluation on the Parallella hardware platform confirms that our allocation strategy reduces the overall execution time and energy consumption by 1.84× and 1.83×, respectively, when compared to the existing approaches.

    References

    [1]
    Adapteva. 2014. Epiphany Architecture Reference Manual - Adapteva. Retrieved on January 24, 2019 from http://www.adapteva.com/docs/epiphany_arch_ref.pdf.
    [2]
    Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. Int. J. Parallel Program. 29, 5 (Oct. 2001), 493--544.
    [3]
    Federico Angiolini, Francesco Menichelli, Alberto Ferrero, Luca Benini, and Mauro Olivieri. 2004. A post-compiler approach to scratchpad mapping of code. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). ACM, New York, 259--267.
    [4]
    Oren Avissar, Rajeev Barua, and Dave Stewart. 2002. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst. 1, 1 (Nov. 2002), 6--26.
    [5]
    Ke Bai and Aviral Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, 317--326.
    [6]
    Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 593--598.
    [7]
    Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). ACM, New York, 73--78.
    [8]
    Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. {n.d.}. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of PACT’08.
    [9]
    Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ Algorithm: A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. Program. Lang. Syst. 38, 3 (April 2016), Article 12, 32 pages.
    [10]
    Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, 746--749.
    [11]
    Peter Brauer, Martin Lundqvist, and Aare Mällo. 2016. Improving latency in a signal processing system on the epiphany architecture. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’16). IEEE, 796--800.
    [12]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC.
    [13]
    Thomas Chen, Ram Raghavan, Jason N. Dale, and Eiji Iwata. 2007. Cell broadband engine architecture and its first implementation—A performance view. IBM J. Res. Dev. 51, 5 (2007), 559--572.
    [14]
    Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. 2005. Heap data allocation to scratch-pad memory in embedded systems. J. Embedded Comput. 1, 4 (Dec. 2005), 521--540.
    [15]
    Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Dynamic scratchpad memory management for code in portable systems with an MMU. ACM Trans. Embed. Comput. Syst. 7, 2 (Jan. 2008), Article 11, 38 pages.
    [16]
    Lei Fang, Peng Liu, Qi Hu, Michael C. Huang, and Guofan Jiang. 2013. Building expressive, area-efficient coherence directories. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, Piscataway, NJ, 299--308.
    [17]
    Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference (DAC’04). ACM, New York, 238--243.
    [18]
    Haohuan Fu et al. 2016. The sunway TaihuLight supercomputer: System and applications. Sci. China, Inf. Sci. (2016).
    [19]
    Linley Gwennap. 2011. Adapteva: More flops, less watts. Microprocessor Report (2011).
    [20]
    Abdelsalam A. Helal, Abdelsalam A. Heddaya, and Bharat B. Bhargava. 2006. Replication Techniques in Distributed Systems. Vol. 4. Springer Science 8 Business Media.
    [21]
    Wei Hu, Gang Wang, Jian Chen, Xueqing Lou, and Tianzhou Chen. 2009. Efficient scratchpad memory management based on multi-thread for MPSoC architecture. In Proceedings of the International Conference on Scalable Computing and Communications; 8th International Conference on Embedded Computing (SCALCOM-EMBEDDEDCOM’09). IEEE, 429--434.
    [22]
    Andhi Janapsatya, Aleksandar Ignjatović, and Sri Parameswaran. 2006. A novel instruction scratchpad memory optimization method based on concomitance metric. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference. IEEE Press, 612--617.
    [23]
    Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks (2nd ed.). Morgan and Claypool Publishers. 116--116 pages.
    [24]
    SA Kalray. 2014. Kalray MPPA Manycore 256.
    [25]
    M. Kandemir and A. Choudhary. 2002. Compiler-directed scratch pad memory hierarchy design and management. In Proceedings of the 2002 Design Automation Conference (IEEE Cat. No. 02CH37324). 628--633.
    [26]
    Jussi Kangasharju, James Roberts, and Keith W. Ross. 2002. Object replication strategies in content distribution networks. Comput. Commun. 25, 4 (2002), 376--383.
    [27]
    Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, and Pat Conway. 2003. The AMD Opteron processor for multiprocessor servers. IEEE Micro 2 (2003), 66--76.
    [28]
    Jakob Krarup and Peter Mark Pruzan. 1983. The simple plant location problem: Survey and synthesis. Eur. J. Op. Res. 12, 1 (1983), 36--81.
    [29]
    Lian Li, Hui Feng, and Jingling Xue. 2009. Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. 6, 3, Article 9 (Oct. 2009), 17 pages.
    [30]
    Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, 228--237.
    [31]
    Jing Lu, Ke Bai, and A. Shrivastava. 2013. SSDM: Smart stack data management for software managed multicores (SMMs). In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--8.
    [32]
    Jing Lu, Ke Bai, and Aviral Shrivastava. 2015. Efficient code assignment techniques for local memory on software managed multicores. ACM Trans. Embed. Comput. Syst. 14, 4 (Dec. 2015), Article 71, 24 pages.
    [33]
    Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl, et al. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11.
    [34]
    Nghi Nguyen, Angel Dominguez, and Rajeev Barua. 2009. Memory allocation for embedded systems with a compile-time-unknown scratch-pad size. ACM Trans. Embed. Comput. Syst. 8, 3 (April 2009), Article 21, 32 pages.
    [35]
    Andreas Olofsson, Tomas Nordström, and Zain Ul-Abdin. 2014. Kickstarting high-performance energy-efficient manycore architectures with Epiphany. In Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers.
    [36]
    Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing (HiPC’08). Springer-Verlag, Berlin, 569--582.
    [37]
    Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682--704.
    [38]
    Louis-Noël Pouchet and T Yuki. 2012. PolyBench/C 3.2.
    [39]
    Rajiv A. Ravindran, Pracheeti D. Nagarkar, Ganesh S. Dasika, Eric D. Marsman, Robert M. Senger, Scott A. Mahlke, and Richard B. Brown. 2005. Compiler managed dynamic instruction placement in a low-power code cache. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 179--190.
    [40]
    David A. Richie and James A. Ross. 2016. OpenCL+ OpenSHMEM hybrid programming model for the Adapteva Epiphany architecture. In Workshop on OpenSHMEM and Related Technologies.
    [41]
    Magnus Sjalander, Sally A. McKee, Peter Brauer, David Engdal, and Andras Vajda. 2012. An LTE uplink receiver PHY benchmark and subframe-based power management. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’12). IEEE Computer Society, Washington, DC, 25--34.
    [42]
    Avinash Sodani. 2015. Knights landing (KNL): 2nd generation Intel® Xeon Phi processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS’15). IEEE, 1--24.
    [43]
    Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, 401--410.
    [44]
    Rohan Tabish, Renato Mancuso, Saud Wasly, Ahmed Alhammad, Sujit S. Phatak, Rodolfo Pellizzoni, and Marco Caccamo. 2016. A real-time scratchpad-centric os for multi-core embedded systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’16). IEEE, 1--11.
    [45]
    Brown Deer Technology. 2016. COPRTHR2 API Reference. Retrieved January 24, 2019 from https://bit.ly/2SIEvnf.
    [46]
    Top 500 The List. 2017. List of Top 500 Supercomputers. Retrieved January 24, 2019 from https://www.top500.org/list/2017/11/.
    [47]
    Sumesh Udayakumaran and Rajeev Barua. 2003. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’03). ACM, New York, 276--286.
    [48]
    Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5, 2 (May 2006), 472--511.
    [49]
    Manish Verma and Peter Marwedel. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. Very Large Scale Integr. Syst. 14, 8 (Aug. 2006), 802--815.
    [50]
    Manish Verma, Klaus Petzold, Lars Wehmeyer, Heiko Falk, and Peter Marwedel. 2005. Scratchpad sharing strategies for multiprocess embedded systems: A first approach. In Proceedings of the 3rd Workshop on Embedded Systems for Real-Time Multimedia. IEEE, 115--120.
    [51]
    Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Cache-aware scratchpad allocation algorithm. In Proceedings Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. 1264--1269.
    [52]
    Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Dynamic overlay of scratchpad memory for energy minimization. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’04). ACM, New York, 104--109.
    [53]
    Lars Wehmeyer, Urs Helmig, and Peter Marwedel. 2004. Compiler-optimized usage of partitioned memories. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). ACM, New York, 114--120.
    [54]
    Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 135--146.

    Cited By

    View all
    • (2024)A Survey of MPSoC Management toward Self-AwarenessMicromachines10.3390/mi1505057715:5(577)Online publication date: 26-Apr-2024
    • (2024)Optimizing code allocation for hybrid on-chip memory in IoT systemsIntegration10.1016/j.vlsi.2024.10219597(102195)Online publication date: Jul-2024
    • (2022)A Classy Memory Management System (CyM2S) using an Isolated Dynamic Two-Level Memory Allocation (ID2LMA) Algorithm for the Real Time Embedded SystemsInternational Journal of Electrical and Electronics Research10.37391/ijeer.10025410:2(387-393)Online publication date: 30-Jun-2022
    • Show More Cited By

    Index Terms

    1. Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures

      Recommendations

      Reviews

      Joseph M. Arul

      This paper focuses on improving many-core architectures via software programmable or scratchpad memory (SPM): An SPM contains an array of [static random-access memory, SRAM] cells. A portion of the memory address space is dedicated to the SPM. Any address that falls within this dedicated address space can directly index into the SPM to access the corresponding data. Thus, by maintaining a dedicated area, the "coherency among multiple SPMs" at the software level can be eliminated. This use of software-level access to the data "thereby eliminat[es] the hardware area/power required for cache coherence," as well as cache access. In a many-core architecture environment, data access on many cores can drastically reduce performance due to coherency issues and long delays related to data access from different cores. In a many-core, multi-threaded architecture, as well as on-chip and off-chip, data accesses can lead to nonuniform, long-latency, and irregular data accesses. To overcome these difficulties in nonuniform data accesses, the paper proposes "a compile-time, coordinated data management framework called CDM, for many-core SPMs." For this paper, "the 16-core Epiphany SoC consists of an array of simple RISC processors (eCores) programmable in C connected together in a 2D-mesh NOC and supporting a single shared address space." Because a Xilinx Zynq system on chip (SoC) supports these eCores on the same development board, it is more energy efficient, unlike traditional cache memory. The eCores are not only able to access local memory, but are also capable of accessing remote memory. Several kernel applications from embedded, multithreaded benchmarks are used in the evaluation, including two benchmarks related to the decryption and encryption of data (AESD and AESE) and three long-term evolution (LTE) benchmarks (PHY_ACI, PHY_DEMAP, and PHY_MICF). The authors use a GREEDY approach as their baseline; SNAP-S allows only one copy of data, and SNAP-M uses a replication mechanism. As a result, "the SNAP-M approach provides an average speed-up of 1.84x and an energy reduction of 1.83x when compared to the GREEDY strategy." The SNAP-S approach "provides an average speed-up and energy reduction of 1.09x." Thus, these two approaches effectively speed up as well as reduce the energy usage due to no cache-like memory, which consumes more power when the data is accessed. The authors take advantage of bringing in off-chip data to the on-chip memory and not using cache-like memory; the use of SoC reduces energy consumption. Currently, a new type of memory is on the rise that can drastically reduce power consumption and is faster than DRAM and cache. When such memory comes into use, this paper will be obsolete. The overhead of bringing in off-chip data to the on-chip memory must also be considered. Besides, the SNAP-S speed-up compared to the GREEDY strategy is not significant; only when the data is replicated is significant improvement observed. One would expect a significant reduction in the SNAP-S strategy, because even the remote memory access data is reduced to the local memory accesses; however, that is not seen in the experimental results.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 18, Issue 1
      Special Issue on MEMOCODE 2017 and Regular Papers
      January 2019
      259 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3305158
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 05 February 2019
      Accepted: 01 December 2018
      Revised: 01 July 2018
      Received: 01 December 2017
      Published in TECS Volume 18, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Scratchpad memory management
      2. many-core architectures

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Research Foundation, Prime Minister?s Office, Singapore under its Industry-IHL Partnership Grant and Huawei International Pte. Ltd.

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)57
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Survey of MPSoC Management toward Self-AwarenessMicromachines10.3390/mi1505057715:5(577)Online publication date: 26-Apr-2024
      • (2024)Optimizing code allocation for hybrid on-chip memory in IoT systemsIntegration10.1016/j.vlsi.2024.10219597(102195)Online publication date: Jul-2024
      • (2022)A Classy Memory Management System (CyM2S) using an Isolated Dynamic Two-Level Memory Allocation (ID2LMA) Algorithm for the Real Time Embedded SystemsInternational Journal of Electrical and Electronics Research10.37391/ijeer.10025410:2(387-393)Online publication date: 30-Jun-2022
      • (2022)MASTER: Reclamation of Hybrid Scratchpad Memory to Maximize Energy Saving in Multi-Core Edge SystemsIEEE Transactions on Sustainable Computing10.1109/TSUSC.2021.30494477:4(749-760)Online publication date: 1-Oct-2022
      • (2022)ASCENT: Communication Scheduling for SDF on Bufferless Software-Defined NoCIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312844541:10(3266-3275)Online publication date: 1-Oct-2022
      • (2022)Optimizing data placement and size configuration for morphable NVM based SPM in embedded multicore systemsFuture Generation Computer Systems10.1016/j.future.2022.05.005135(270-282)Online publication date: Oct-2022
      • (2022)Lesser Evil: Embracing Failure to Protect Overall System AvailabilityDistributed Applications and Interoperable Systems10.1007/978-3-031-16092-9_5(57-73)Online publication date: 6-Sep-2022
      • (2021)The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on IoT Nodes in Smart CitiesSensors10.3390/s2112422321:12(4223)Online publication date: 20-Jun-2021
      • (2021)Design Space Optimization of Shared Memory Architecture in Accelerator-rich SystemsACM Transactions on Design Automation of Electronic Systems10.1145/344600126:4(1-31)Online publication date: 13-Mar-2021
      • (2021)ParalOS: A Scheduling & Memory Management Framework for Heterogeneous VPUs2021 24th Euromicro Conference on Digital System Design (DSD)10.1109/DSD53832.2021.00043(221-228)Online publication date: Sep-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media