Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2228360.2228598acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Unrolling and retiming of stream applications onto embedded multicore processors

Published: 03 June 2012 Publication History
  • Get Citation Alerts
  • Abstract

    In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream applications distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Programing on SPM based architecture is both challenging and time consuming. In this paper we address the problem of automatic compilation of stream applications onto SPM based embedded multicore processors through unrolling and retiming. In our technique, code overlay and data overlay are implemented to overcome the limited SPM capacity. Smart double buffering and code prefetching are introduced to amortize memory access delays. We evaluated the efficiency of our technique through compiling several stream applications onto the IBM Cell processor and compared their performance with existing approaches.

    References

    [1]
    Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 2007.
    [2]
    I. Buck, T. Foley, and D. Horn et al. Brook for GPUs: stream computing on graphics hardware. ACM, 2004.
    [3]
    W. Che and K. Chatha. Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming. In DAC11, pages 122--127, june 2011.
    [4]
    W. Che, A. Panda, and K. Chatha. Compilation of stream programs for multicore processors that incorporate scratchpad memories. DATE, 2010.
    [5]
    Y. Choi and Y. Lin et al. Stream compilation for real-time embedded multicore systems. In CGO '09, pages 210--220, 2009.
    [6]
    A. Darte and G. Huard. Loop shifting for loop compaction. International Journal of Parallel Programming, 28, 2000.
    [7]
    J. Eker and J. W. Janneck. Cal language report. 2003.
    [8]
    M. I. Gordon and W. Thies et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGOPS Oper. Syst. Rev., 40:151--162, October 2006.
    [9]
    A. Hormati, Y. Choi, and M. Kudlur et al. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT '09, pages 214--223, sept. 2009.
    [10]
    A. Hormati, M. Samadi, and M. Woh et al. Sponge: portable stream programming on graphics engines. SIGPLAN Not., 46(3):381--392, Mar. 2011.
    [11]
    J. A. Kahle, M. N. Day, and H. P. Hofstee et al. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49:589--604, 2005.
    [12]
    E. Kilgariff and R. Fernando. The GeForce 6 series GPU architecture. In SIGGRAPH. ACM, 2005.
    [13]
    M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of ACM SIGPLAN, 2008.
    [14]
    C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5--35, 1991.
    [15]
    C. Liang-Fang. Scheduling And Behavioral Transformations For Parallel Systems. PhD thesis, Princeton University, 1993.
    [16]
    S.-w. Liao and Z. Du et al. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of CGO, 2006.
    [17]
    G. D. Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill Higher Education, 1st edition, 1994.
    [18]
    C. Ostler et al. Ilp and heuristic techniques for system-level design on network processor architectures. TODAES, 2007.
    [19]
    J. Pino and E. Lee. Hierarchical static scheduling of dataflow graphs onto multiple processors. In ASSP, volume 4, 1995.
    [20]
    J. Stratton et al. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008.
    [21]
    W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. 2304:49--84, 2002.
    [22]
    L. Truong. White paper: Low power consumption and a competitive price tag make the six-core TMS320C6472 ideal for high performance applications. Processing Business, oct. 2009.
    [23]
    S. Wasson. Ageia's physx physics processing unit. The tech report, PC hardware explored, Last accessed July 2008.

    Cited By

    View all
    • (2021)Parallel Scheduling of Multiple SDF Graphs Onto Heterogeneous ProcessorsIEEE Access10.1109/ACCESS.2021.30547259(20493-20507)Online publication date: 2021
    • (2019)Mapping and scheduling techniques in NoC: A survey of the state of the art2019 International Conference on Networking and Advanced Systems (ICNAS)10.1109/ICNAS.2019.8807815(1-6)Online publication date: Jun-2019
    • (2018)Partition Based Product Term Retiming for Reliable Low Power Logic StructureAdvances in Information and Communication Networks10.1007/978-3-030-03402-3_13(178-189)Online publication date: 6-Dec-2018
    • Show More Cited By

    Index Terms

    1. Unrolling and retiming of stream applications onto embedded multicore processors

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DAC '12: Proceedings of the 49th Annual Design Automation Conference
      June 2012
      1357 pages
      ISBN:9781450311991
      DOI:10.1145/2228360
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. SPM
      2. multicore
      3. overlay
      4. retiming
      5. stream
      6. unrolling

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      DAC '12
      Sponsor:
      DAC '12: The 49th Annual Design Automation Conference 2012
      June 3 - 7, 2012
      California, San Francisco

      Acceptance Rates

      Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

      Upcoming Conference

      DAC '25
      62nd ACM/IEEE Design Automation Conference
      June 22 - 26, 2025
      San Francisco , CA , USA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Parallel Scheduling of Multiple SDF Graphs Onto Heterogeneous ProcessorsIEEE Access10.1109/ACCESS.2021.30547259(20493-20507)Online publication date: 2021
      • (2019)Mapping and scheduling techniques in NoC: A survey of the state of the art2019 International Conference on Networking and Advanced Systems (ICNAS)10.1109/ICNAS.2019.8807815(1-6)Online publication date: Jun-2019
      • (2018)Partition Based Product Term Retiming for Reliable Low Power Logic StructureAdvances in Information and Communication Networks10.1007/978-3-030-03402-3_13(178-189)Online publication date: 6-Dec-2018
      • (2016)Multiconstraint Static Scheduling of Synchronous Dataflow Graphs Via Retiming and UnfoldingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2015.249516735:6(905-918)Online publication date: Jun-2016
      • (2015)Scheduling stream programs with improving arithmetic unit usage on NoC-based VLIW multi-core architecturesProceedings of the 12th ACM International Conference on Computing Frontiers10.1145/2742854.2742872(1-8)Online publication date: 6-May-2015
      • (2015)Runtime Resource Allocation for Software PipelinesACM Transactions on Parallel Computing10.1145/27423472:1(1-23)Online publication date: 21-May-2015
      • (2013)Mapping on multi/many-core systemsProceedings of the 50th Annual Design Automation Conference10.1145/2463209.2488734(1-10)Online publication date: 29-May-2013

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media