Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1967677.1967699acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article

An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Published: 11 April 2011 Publication History

Abstract

In this paper, we propose a data partitioning technique for the memory subsystem that consists of a multi-ported scratchpad memory (SPM) unit and a single-ported data cache in coarse-grained reconfigurable arrays (CGRA) architecture. The embedded reconfigurable processor executes programs by switching between the Non-VLIW and VLIW modes depending on the type of the code region to achieve high performance. The VLIW mode exploits code regions with high ILP that require high memory bandwidth and the Non-VLIW mode exploits those with low ILP that require low memory latency. Our data partitioning technique between the SPM and the data cache is based on data interference graph reduction and profiling information. Given an SPM size, it finds the optimal data partitions by taking the VLIW instruction schedule into consideration. We evaluate our data partitioning technique for the CGRA architecture with three representative multimedia applications.

References

[1]
Federico Angiolini, Luca Benini, and Alberto Caprara. Polynomial-time algorithm for on-chip scratchpad memory partitioning. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 318--326, 2003.
[2]
Oren Avissar, Rajeev Barua, and Dave Stewart. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst., 1(1):6--26, 2002.
[3]
Kristof Beyls and Erik H. D'Hollander. Generating cache hints for improved program efficiency. J. Syst. Archit., 51(4):223--250, 2005.
[4]
CACTI 4.2. http://quid.hpl.hp.com:9081/cacti/, 2006.
[5]
Hyungmin Cho, Bernhard Egger, Jaejin Lee, and Heonshik Shin. Dynamic data scratchpad memory management for a memory subsystem with an mmu. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 195--206, 2007.
[6]
Intel Corporation. Intel Itanium 2 Processor Reference Manual For Software Development and Optimization. 2004.
[7]
Eddy De Greef, Francky Catthoor, and Hugo De Man. Array placement for storage size reduction in embedded multimedia systems. In ASAP '97: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, pages 66--, 1997.
[8]
Angel Dominguez, Nghi Nguyen, and Rajeev K. Barua. Recursive function data allocation to scratch-pad memory. In CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 65--74, 2007.
[9]
Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. Heap data allocation to scratch-pad memory in embedded systems. Journal of Embedded Computing, 1(4):521--540, 2005.
[10]
Michael R. Garey and David S. Johnson. Computers and Intractability. Freeman, 1979.
[11]
Antonio González, Carlos Aliagas, and Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. In ICS '95: Proceedings of the 9th international conference on Supercomputing, pages 338--347, 1995.
[12]
AMD Inc. Software Optimization Guide for AMD64 Processors. 2005.
[13]
Texas Instruments Incoporated. Tms320c6000 high performance dsps. http://www.ti.com, 2006.
[14]
ISO/IEC. IS 13818--3 Information Technology - Generic Coding of Moving Pictures and Associated Audio: Audio. 1996. MP3.
[15]
ISO/IEC. IS 14496--10 Information Technology - Coding of Audio Visual Objects: Advanced Video Coding. 2005. H.264.
[16]
ISO/IEC. IS 14496--3 Information Technology - Coding of Audio Visual Objects: Audio. 2005. AAC.
[17]
Toni Juan, Juan J. Navarro, and Olivier Temam. Data caches for superscalar processors. In ICS '97: Proceedings of the 11th international conference on Supercomputing, pages 60--67, 1997.
[18]
Hsien-Hsin S. Lee and Gary S. Tyson. Region-based caching: an energy-delay efficient memory architecture for embedded processors. In CASES '00: Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems, pages 120--127, 2000.
[19]
Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, and Christos Kozyrakis. Comparing memory systems for chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 358--368, 2007.
[20]
ARM Limited. RealView SoC Designer 6.2,. http://www.arm.com/products/DevTools/SoCDesigner.html.
[21]
Guangming Lu, Hartej Singh, Ming-Hau Lee, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. The morphosys parallel reconfigurable system. In Euro-Par '99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing, pages 727--734, 1999.
[22]
Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In DATE '03: Proceedings of the conference on Design, Automation and Test in Europe, page 10296, 2003.
[23]
Bingfeng Mei, Serge Vernalde, Diederik Verkest, and Rudy Lauwereins. Design methodology for a tightly coupled vliw/reconfigurable matrix architecture: A case study. In DATE '04: Proceedings of the conference on Design, automation and test in Europe, page 21224, 2004.
[24]
Wilfried Oed and O. Lange. On the effective bandwidth of interleaved memories in vector processor systems. IEEE Trans. Comput., 34(10):949--957, 1985.
[25]
Taewook Oh, Bernhard Egger, Hyunchul Park, and Scott Mahlke. Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 21--30, 2009.
[26]
Hyunchul Park, Kevin Fan, Manjunath Kudlur, and Scott Mahlke. Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pages 136--146, 2006.
[27]
Hyunchul Park, Kevin Fan, Scott A. Mahlke, Taewook Oh, Heeseok Kim, and Hong-seok Kim. Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 166--176, 2008.
[28]
Yongjun Park, Hyunchul Park, and Scott Mahlke. Cgra express: accelerating execution using dynamic operation fusion. In CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, pages 271--280, 2009.
[29]
Ram Raghavan and John P. Hayes. Reducing interference among vector accesses in interleaved memories. IEEE Trans. Comput., 42(4):471--483, 1993.
[30]
B. Ramakrishna Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In MICRO 27: Proceedings of the 27th annual international symposium on Microarchitecture, pages 63--74, 1994.
[31]
Rajiv Ravindran, Michael Chu, and Scott Mahlke. Compiler-managed partitioned data caches for low power. In LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pages 237--247, 2007.
[32]
Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. On high-bandwidth data cache design for multi-issue processors. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 46--56, 1997.
[33]
Julio Sahuquillo, Salvador Petit, Ana Pont, and Veljko Milutinović. Exploring the performance of split data cache schemes on superscalar processors and symmetric multiprocessors. J. Syst. Archit., 51(8):451--469, 2005.
[34]
Jesús Sánchez and Antonio González. A locality sensitive multi-module cache with explicit management. In ICS '99: Proceedings of the 13th international conference on Supercomputing, pages 51--59, 1999.
[35]
Aviral Shrivastava, Ilya Issenin, and Nikil Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pages 90--96, 2005.
[36]
Gurindar S. Sohi and Manoj Franklin. High-bandwidth data memory systems for superscalar processors. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pages 53--62, 1991.
[37]
Stefan Steinke, Lars Wehmeyer, Bo-Sik Lee, and Peter Marwedel. Assigning program and data objects to scratchpad for energy reduction. In DATE '02: Proceedings of the conference on Design, automation and test in Europe, page 409, 2002.
[38]
Tensilica Inc. Xtensa customizable processors. http://www.tensilica.com, 2007.
[39]
Remko Tronçon, Maurice Bruynooghe, Gerda Janssens, and Francky Catthoor. Storage size reduction by in-place mapping of arrays. In VMCAI '02: Revised Papers from the Third International Workshop on Verification, Model Checking, and Abstract Interpretation, pages 167--181, 2002.
[40]
Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A modified approach to data cache management. In MICRO 28: Proceedings of the 28th annual international symposium on Microarchitecture, pages 93--103, 1995.
[41]
Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In CASES '03: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 276--286, 2003.
[42]
Osman S. Unsal, Israel Koren, C. Mani Krishna, and Csaba Andras Moritz. The minimax cache: An energy-efficient framework for media processors. In HPCA '02: Proceedings of the 8th International Symposium on High-Performance Computer Architecture, page 131, 2002.
[43]
Manish Verma, Stefan Steinke, and Peter Marwedel. Data partitioning for maximal scratchpad usage. In ASP-DAC '03: Proceedings of the 2003 Asia and South Pacific Design Automation Conference, pages 77--83, 2003.
[44]
Lars Wehmeyer, Urs Helmig, and Peter Marwedel. Compiler-optimized usage of partitioned memories. In WMPI '04: Proceedings of the 3rd workshop on Memory performance issues, pages 114--120, 2004.

Cited By

View all

Index Terms

  1. An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    LCTES '11: Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
    April 2011
    182 pages
    ISBN:9781450305556
    DOI:10.1145/1967677
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 5
      LCTES '10
      May 2011
      170 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2016603
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 April 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. coarse grained reconfigurable arrays
    2. compilers
    3. data partitioning
    4. instruction scheduling
    5. vliw

    Qualifiers

    • Research-article

    Conference

    LCTES '11

    Acceptance Rates

    Overall Acceptance Rate 116 of 438 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Customizable embedded processor array for multimedia applicationsIntegration, the VLSI Journal10.1016/j.vlsi.2017.09.00960:C(213-223)Online publication date: 1-Jan-2018
    • (2018)Coarse-Grained Reconfigurable Array ArchitecturesHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_12(427-472)Online publication date: 14-Oct-2018
    • (2016)A Bimodal Scheduler for Coarse-Grained Reconfigurable ArraysACM Transactions on Architecture and Code Optimization10.1145/289347513:2(1-26)Online publication date: 6-Jun-2016
    • (2016)Intra mode power saving methodology for CGRA-based reconfigurable processor architectures2016 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS.2016.7527340(714-717)Online publication date: May-2016
    • (2014)Retargetable automatic generation of compound instructions for CGRA based reconfigurable processor applicationsProceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems10.1145/2656106.2656125(1-9)Online publication date: 12-Oct-2014
    • (2013)Hybrid compile and run-time memory management for a 3D-stacked reconfigurable acceleratorProceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems10.5555/2555729.2555739(1-10)Online publication date: 29-Sep-2013
    • (2013)BilRCIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2012.220774821:7(1285-1298)Online publication date: 1-Jul-2013
    • (2013)Hybrid compile and run-time memory management for a 3D-stacked reconfigurable accelerator2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)10.1109/CASES.2013.6662514(1-10)Online publication date: Sep-2013
    • (2012)Function inlining and loop unrolling for loop acceleration in reconfigurable processorsProceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems10.1145/2380403.2380426(101-110)Online publication date: 7-Oct-2012
    • (2018)Coarse-Grained Reconfigurable Array ArchitecturesHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_12(427-472)Online publication date: 14-Oct-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media