Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1815961.1816020acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Data marshaling for multi-core architectures

Published: 19 June 2010 Publication History

Abstract

Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SE's benefit is limited because most segments access inter-segment data, i.e., data generated by the previous segment. When consecutive segments run on different cores, accesses to inter-segment data incur cache misses, thereby reducing performance. This paper proposes Data Marshaling (DM), a new technique to eliminate cache misses to inter-segment data. DM uses profiling to identify instructions that generate inter-segment data, and adds only 96 bytes/core of storage overhead. We show that DM significantly improves the performance of two promising Staged Execution models, Accelerated Critical Sections and producer-consumer pipeline parallelism, on both homogeneous and heterogeneous multi-core systems. In both models, DM can achieve almost all of the potential of ideally eliminating cache misses to inter-segment data. DM's performance benefit increases with the number of cores.

References

[1]
MySQL database engine 5.0.1. http://www.mysql.com, 2008.
[2]
SQLite database engine version 3.5.8. 2008.
[3]
SysBench: a system performance benchmark v0.4.8. 2008.
[4]
M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In ISCA-32, 2005.
[5]
Apple. Grand Central Dispatch. Tech. Brief, 2009.
[6]
D. H. Bailey et al. NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, 1994.
[7]
A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009.
[8]
C. Bienia et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, 2008.
[9]
A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM TOCS, 2(1):39--59, 1984.
[10]
R. D. Blumofe et al. Cilk: an efficient multithreaded runtime system. In PPoPP, 1995.
[11]
S. Boyd-Wickizer et al. Reinventing scheduling for multicore systems. In HotOS-XII, 2009.
[12]
J. A. Brown and D. M. Tullsen. The shared-thread multiprocessor. In ICS, 2008.
[13]
K. Chakraborty, P. M. Wells, and G. S. Sohi. Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly. In ASPLOS-XII, 2006.
[14]
R. Cooksey et al. A stateless, content-directed data prefetching mechanism. In ASPLOS, 2002.
[15]
A. J. Dorta et al. The OpenMP source code repository. In Euromicro, 2005.
[16]
E. Ebrahimi et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. HPCA, 2009.
[17]
M. Gordon et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In ASPLOS, 2006.
[18]
S. Harizopoulos and A. Ailamaki. StagedDB: Designing database servers for modern hardware. IEEE Data Eng. Bull., June 2005.
[19]
M. Hill and M. Marty. Amdahl's law in the multicore era. IEEE Computer, 41(7), 2008.
[20]
H. Hossain et al. DDCache: Decoupled and delegable cache data and metadata. In PACT, 2009.
[21]
Intel. Source code for Intel threading building blocks.
[22]
Intel. Getting Started with Intel Parallel Studio, 2009.
[23]
D. Joseph and D. Grunwald. Prefetching using Markov predictors. In ISCA, 1997.
[24]
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990.
[25]
C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS, 2002.
[26]
H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.
[27]
J. R. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In USENIX, 2002.
[28]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, 1997.
[29]
C. lin Yang and A. R. Lebeck. Push vs. pull: Data movement for linked data structures. In ICS, 2000.
[30]
M. R. Marty. Cache coherence techniques for multicore processors. PhD thesis, 2008.
[31]
T. Morad et al. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comp Arch Letters, 2006.
[32]
R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.
[33]
NVIDIA Corporation. CUDA SDK code samples, 2009.
[34]
M. K. Qureshi. Adaptive spill-receive for robust high-performance caching in CMPs. HPCA, 2009.
[35]
K. K. Rangan et al. Thread motion: Fine-grained power management for multi-core systems. In ISCA, 2009.
[36]
P. Ranganathan et al. The interaction of software prefetching with ILP processors in shared-memory systems. ISCA, 1997.
[37]
S. Somogyi et al. Spatio-temporal memory streaming. ISCA, 2009.
[38]
R. Strong et al. Fast switching of threads between cores. SIGOPS Oper. Syst. Rev., 43(2), 2009.
[39]
M. A. Suleman et al. ACMP: Balancing hardware efficiency and programmer efficiency. Technical Report TR-HPS-2007-001, Univ. of Texas at Austin, 2007.
[40]
M. A. Suleman et al. An asymmetric multi-core architecture for accelerating critical sections. Technical Report TR-HPS-2008-003, Univ. of Texas at Austin, 2008.
[41]
M. A. Suleman et al. Accelerating critical section execution with asymmetric multi-core architectures. ASPLOS, 2009.
[42]
J. M. Tendler et al. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002.
[43]
W. Thies et al. Streamit: A language for streaming applications. In 11th Conf. on Compiler Construction, 2002.
[44]
P. Trancoso and J. Torrellas. The impact of speeding up critical sections with data prefetching and forwarding. In ICPP, 1996.
[45]
M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997.
[46]
M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005.

Cited By

View all
  • (2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
  • (2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
  • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Data marshaling for multi-core architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
    June 2010
    520 pages
    ISBN:9781450300537
    DOI:10.1145/1815961
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
      ISCA '10
      June 2010
      508 pages
      ISSN:0163-5964
      DOI:10.1145/1816038
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cmp
    2. critical sections
    3. pipelining
    4. staged execution

    Qualifiers

    • Research-article

    Conference

    ISCA '10
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
    • (2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
    • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
    • (2021)Migration in Hardware Transactional Memory on Asymmetric MultiprocessorIEEE Access10.1109/ACCESS.2021.30775399(69346-69364)Online publication date: 2021
    • (2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
    • (2016)μC-StatesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967941(17-30)Online publication date: 11-Sep-2016
    • (2016)Scheduling Techniques for GPU Architectures with Processing-In-Memory CapabilitiesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967940(31-44)Online publication date: 11-Sep-2016
    • (2016)A Survey of Techniques for Architecting and Managing Asymmetric Multicore ProcessorsACM Computing Surveys10.1145/285612548:3(1-38)Online publication date: 8-Feb-2016
    • (2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
    • (2014)The benefit of SMT in the multi-core eraACM SIGARCH Computer Architecture News10.1145/2654822.254195442:1(591-606)Online publication date: 24-Feb-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media