research-article

Data marshaling for multi-core architectures

Authors:

M. Aater Suleman,

Yale N. PattAuthors Info & Claims

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Pages 441 - 450

https://doi.org/10.1145/1815961.1816020

Published: 19 June 2010 Publication History

Abstract

Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SE's benefit is limited because most segments access inter-segment data, i.e., data generated by the previous segment. When consecutive segments run on different cores, accesses to inter-segment data incur cache misses, thereby reducing performance. This paper proposes Data Marshaling (DM), a new technique to eliminate cache misses to inter-segment data. DM uses profiling to identify instructions that generate inter-segment data, and adds only 96 bytes/core of storage overhead. We show that DM significantly improves the performance of two promising Staged Execution models, Accelerated Critical Sections and producer-consumer pipeline parallelism, on both homogeneous and heterogeneous multi-core systems. In both models, DM can achieve almost all of the potential of ideally eliminating cache misses to inter-segment data. DM's performance benefit increases with the number of cores.

References

[1]

MySQL database engine 5.0.1. http://www.mysql.com, 2008.

[2]

SQLite database engine version 3.5.8. 2008.

[3]

SysBench: a system performance benchmark v0.4.8. 2008.

[4]

M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In ISCA-32, 2005.

Digital Library

[5]

Apple. Grand Central Dispatch. Tech. Brief, 2009.

[6]

D. H. Bailey et al. NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, 1994.

[7]

A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009.

Digital Library

[8]

C. Bienia et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, 2008.

Digital Library

[9]

A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM TOCS, 2(1):39--59, 1984.

Digital Library

[10]

R. D. Blumofe et al. Cilk: an efficient multithreaded runtime system. In PPoPP, 1995.

Digital Library

[11]

S. Boyd-Wickizer et al. Reinventing scheduling for multicore systems. In HotOS-XII, 2009.

Digital Library

[12]

J. A. Brown and D. M. Tullsen. The shared-thread multiprocessor. In ICS, 2008.

Digital Library

[13]

K. Chakraborty, P. M. Wells, and G. S. Sohi. Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly. In ASPLOS-XII, 2006.

Digital Library

[14]

R. Cooksey et al. A stateless, content-directed data prefetching mechanism. In ASPLOS, 2002.

Digital Library

[15]

A. J. Dorta et al. The OpenMP source code repository. In Euromicro, 2005.

Digital Library

[16]

E. Ebrahimi et al. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. HPCA, 2009.

[17]

M. Gordon et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In ASPLOS, 2006.

Digital Library

[18]

S. Harizopoulos and A. Ailamaki. StagedDB: Designing database servers for modern hardware. IEEE Data Eng. Bull., June 2005.

[19]

M. Hill and M. Marty. Amdahl's law in the multicore era. IEEE Computer, 41(7), 2008.

Digital Library

[20]

H. Hossain et al. DDCache: Decoupled and delegable cache data and metadata. In PACT, 2009.

Digital Library

[21]

Intel. Source code for Intel threading building blocks.

[22]

Intel. Getting Started with Intel Parallel Studio, 2009.

[23]

D. Joseph and D. Grunwald. Prefetching using Markov predictors. In ISCA, 1997.

Digital Library

[24]

N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In ISCA-17, 1990.

Digital Library

[25]

C. Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS, 2002.

Digital Library

[26]

H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.

[27]

J. R. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In USENIX, 2002.

Digital Library

[28]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, 1997.

Digital Library

[29]

C. lin Yang and A. R. Lebeck. Push vs. pull: Data movement for linked data structures. In ICS, 2000.

Digital Library

[30]

M. R. Marty. Cache coherence techniques for multicore processors. PhD thesis, 2008.

Digital Library

[31]

T. Morad et al. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comp Arch Letters, 2006.

Digital Library

[32]

R. Narayanan et al. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006.

[33]

NVIDIA Corporation. CUDA SDK code samples, 2009.

[34]

M. K. Qureshi. Adaptive spill-receive for robust high-performance caching in CMPs. HPCA, 2009.

[35]

K. K. Rangan et al. Thread motion: Fine-grained power management for multi-core systems. In ISCA, 2009.

Digital Library

[36]

P. Ranganathan et al. The interaction of software prefetching with ILP processors in shared-memory systems. ISCA, 1997.

Digital Library

[37]

S. Somogyi et al. Spatio-temporal memory streaming. ISCA, 2009.

Digital Library

[38]

R. Strong et al. Fast switching of threads between cores. SIGOPS Oper. Syst. Rev., 43(2), 2009.

Digital Library

[39]

M. A. Suleman et al. ACMP: Balancing hardware efficiency and programmer efficiency. Technical Report TR-HPS-2007-001, Univ. of Texas at Austin, 2007.

[40]

M. A. Suleman et al. An asymmetric multi-core architecture for accelerating critical sections. Technical Report TR-HPS-2008-003, Univ. of Texas at Austin, 2008.

[41]

M. A. Suleman et al. Accelerating critical section execution with asymmetric multi-core architectures. ASPLOS, 2009.

Digital Library

[42]

J. M. Tendler et al. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002.

Digital Library

[43]

W. Thies et al. Streamit: A language for streaming applications. In 11th Conf. on Compiler Construction, 2002.

Digital Library

[44]

P. Trancoso and J. Torrellas. The impact of speeding up critical sections with data prefetching and forwarding. In ICPP, 1996.

[45]

M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997.

Digital Library

[46]

M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005.

Digital Library

Cited By

Ghiasi NVijaykumar NOliveira GOrosa LFernandez ISadrosadati MKanellopoulos KHajinazar NLuna JMutlu O(2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TETC.2022.3226132
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Show More Cited By

Index Terms

Data marshaling for multi-core architectures
1. Computer systems organization
  1. Architectures

Recommendations

Data marshaling for multi-core architectures
ISCA '10

Previous research has shown that Staged Execution (SE), i.e., dividing a program into segments and executing each segment at the core that has the data and/or functionality to best run that segment, can improve performance and save power. However, SE's ...
Data Marshaling for Multicore Systems

Dividing a program into segments and executing each segment at the core best suited to run it can improve performance and save power. When consecutive segments run on different cores, accesses to intersegment data incur cache misses. Data Marshaling ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

June 2010

520 pages

ISBN:9781450300537

DOI:10.1145/1815961

General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel

ACM SIGARCH Computer Architecture News Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '10

Sponsor:

SIGARCH

ISCA '10: The 37th Annual International Symposium on Computer Architecture

June 19 - 23, 2010

Saint-Malo, France

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
1,069
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ghiasi NVijaykumar NOliveira GOrosa LFernandez ISadrosadati MKanellopoulos KHajinazar NLuna JMutlu O(2023)ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.322613211:2(388-403)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TETC.2022.3226132
Giannoula CVijaykumar NPapadopoulou NKarakostas VFernandez IGomez-Luna JOrosa LKoziris NGoumas GMutlu O(2021)SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00031(263-276)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00031
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Sustran ZProtic J(2021)Migration in Hardware Transactional Memory on Asymmetric MultiprocessorIEEE Access10.1109/ACCESS.2021.30775399(69346-69364)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3077539
Liu LYang CYin SWei S(2018)CDPM: Context-Directed Pattern Matching Prefetching to Improve Coarse-Grained Reconfigurable Array PerformanceIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.274802637:6(1171-1184)Online publication date: Jun-2018
https://doi.org/10.1109/TCAD.2017.2748026
Kayiran OJog APattnaik AAusavarungnirun RTang XKandemir MLoh GMutlu ODas CZaks AMendelson BRauchwerger LHwu W(2016)μC-StatesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967941(17-30)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967941
Pattnaik ATang XJog AKayiran OMishra AKandemir MMutlu ODas CZaks AMendelson BRauchwerger LHwu W(2016)Scheduling Techniques for GPU Architectures with Processing-In-Memory CapabilitiesProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967940(31-44)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967940
Mittal S(2016)A Survey of Techniques for Architecting and Managing Asymmetric Multicore ProcessorsACM Computing Surveys10.1145/285612548:3(1-38)Online publication date: 8-Feb-2016
https://dl.acm.org/doi/10.1145/2856125
Mutlu OSubramanian L(2014)Research Problems and Opportunities in Memory SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1403021:3(19-55)Online publication date: 12-Oct-2014
https://dl.acm.org/doi/10.14529/jsfi140302
Eyerman SEeckhout L(2014)The benefit of SMT in the multi-core eraACM SIGARCH Computer Architecture News10.1145/2654822.254195442:1(591-606)Online publication date: 24-Feb-2014
https://dl.acm.org/doi/10.1145/2654822.2541954
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents