article

Comparing memory systems for chip multiprocessors

Authors:

Jacob Leverich,

Hideho Arakida,

Alex Solomatnikov,

Amin Firoozshahian,

Christos KozyrakisAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 35, Issue 2

Pages 358 - 368

https://doi.org/10.1145/1273440.1250707

Published: 09 June 2007 Publication History

Abstract

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.

References

[1]

S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):66--76, Dec. 1996.

Digital Library

[2]

V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate versus IPC: the End of the Road for Conventional Microarchitectures. In Proceedings of the 27th Intl. Symp. on Computer Architecture, June 2000.

Digital Library

[3]

J. Ahn et al. Evaluating the Imagine Stream Architecture. In Proceedings of the 31st Intl. Symp. on Computer Architecture, May 2004.

Digital Library

[4]

J. Andrews and N. Backer. Xbox360 System Architecture. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.

[5]

L. A. Barroso et al. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Intl. Symp. on Computer Architecture, Vancouver, Canada, June 2000.

Digital Library

[6]

I. Buck. GPU Computing: Programming a Massively Parallel Processor, Mar. 2005. Keynote presentation at the International Symposium on Code Generation and Optimization, San Jose, CA.

Digital Library

[7]

T. Chiueh. A Generational Algorithm to Multiprocessor Cache Coherence. In International Conference on Parallel Processing, pages 20--24, Oct. 1993.

Digital Library

[8]

D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kauffman, 1999.

Digital Library

[9]

W. Dally et al. Merrimac: Supercomputing with Streams. In Proceedings of the 2003 Conf. on Supercomputing, Nov. 2003.

Digital Library

[10]

J. D. Davis, J. Laudon, and K. Olukotun. Maximizing CMP Throughput with Mediocre Cores. In Proceedings of the 14th Intl. Conf. on Parallel Architectures and Compilation Techniques, Sept. 2005.

Digital Library

[11]

M. Drake, H. Hoffmann, R. Rabbah, and S. Amarasinghe. MPEG-2 Decoding in a Stream Programming Language. In Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium, Rhodes Island (IPDPS), Apr. 2006.

Digital Library

[12]

W. Eatherton. The Push of Network Processing to the Top of the Pyramid, Oct. 2005. Keynote presentation at the Symposium on Architectures for Networking and Communication Systems, Princeton, NJ.

[13]

K. Fatahalian et al. Sequoia: Programming The Memory Hierarchy. In Supercomputing Conference, Nov. 2006.

Digital Library

[14]

T. Foley and J. Sugerman. KD-Tree Acceleration Structures for a GPU Raytracer. In Proceedings of the Graphics Hardware Conf., July 2005.

Digital Library

[15]

M. I. Gordon et al. A Stream Compiler for Communication-exposed Architectures. In Proceedings of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2002.

Digital Library

[16]

M. Gschwind et al. A Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.

[17]

J. Gummaraju and M. Rosenblum. Stream Programming on General-Purpose Processors. In Proceedings of the 38th Intl. Symp. on Microarchitecture, Nov. 2005.

Digital Library

[18]

R. Ho, K. Mai, and M. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4), Apr. 2001.

[19]

R. Ho, K. Mai, and M. Horowitz. Efficient On-chip Global Interconnects, June 2003.

[20]

M. Horowitz and W. Dally. How Scaling Will Change Processor Architecture. In International Solid-State Circuits Conference, pages 132--133, Feb. 2004.

[21]

Independent JPEG Group. IJG's JPEG Software Release 6b, 1998.

[22]

D. Jani, G. Ezer, and J. Kim. Long Words and Wide Ports: Reinventing the Configurable Processor. In Conf. Record of Hot Chips 16, Stanford, CA, Aug. 2004.

[23]

N. Jayasena. Memory Hierarchy Design for Stream Computing. PhD thesis, Stanford University, 2005.

Digital Library

[24]

A. C. Klaiber and H. M. Levy. A Comparison of Message Passing and Shared Memory Architectures for Data Parallel Programs. In Proceedings of the 21th Intl. Symp. on Computer Architecture, Apr. 1994.

Digital Library

[25]

P. Kongetira. A 32-way Multithreaded Sparc Processor. In Conf. Record of Hot Chips 16, Stanford, CA, Aug. 2004.

[26]

R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling. In Proceedings of the 32nd Intl. Symp. on Computer Architecture, June 2005.

Digital Library

[27]

B. Lewis and D. J. Berg. Multithreaded Programming with Pthreads. Prentice Hall, 1998.

Digital Library

[28]

M. Li et al. ALP: Efficient Support for All Levels of Parallelism for Complex Media Applications. Technical Report UIUCDCS-R-2005-2605, UIUC CS, July 2005.

[29]

A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning. ACM SIGPLAN Notices, 36(7):103--112, July 2001.

Digital Library

[30]

Y. Lin. A Programmable Vector Coprocessor Architecture for Wireless Applications. In Proceedings of the 3rd Workshop on Application Specific Processors, Sept. 2004.

[31]

M. Loghi and M. Pncino. Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions. In Proceedings of the Design Automation and Test in Europe Conf., Mar. 2005.

Digital Library

[32]

E. Machnicki. Ultra High Performance Scalable DSP Family for Multimedia. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.

[33]

K. Mai et al. Smart Memories: a Modular Reconfigurable Architecture. In Proceedings of the 27th Intl. Symp. on Computer Architecture, June 2000.

Digital Library

[34]

MIPS32 Architecture For Programmers Volume II: The MIPS32 Instruction Set. MIPS Technologies, Inc., 2001.

[35]

A. Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In Proceedings of the 32nd Intl. Symp. on Computer Architecture, June 2005.

Digital Library

[36]

K. Sankaralingam. TRIPS: A Polymorphous Architecture for Exploiting ILP, TLP, and DLP. ACM Trans. Archit. Code Optim., 1(1):62--93, Mar. 2004.

Digital Library

[37]

J. Suh et al. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-intensive Signal Processing Kernels. In Proceedings of the 30th Intl. Symp. on Computer Architecture, June 2003.

Digital Library

[38]

D. Tarjan, S. Thoziyoor, and N. P. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, HP Labs, 2006.

[39]

M. Taylor et al. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams. In Proceedings of the 31st Intl. Symp. on Computer Architecture, May 2004.

Digital Library

[40]

Tensilica Software Tools. http://www.tensilica.com/products/software.htm.

[41]

S. P. VanderWiel and D. J. Lilja. Data Prefetch Mechanisms. ACM Computing Surveys, 32(2):174--199, 2000.

Digital Library

[42]

D. Wang et al. DRAMsim: A Memory-System Simulator. SIGARCH Computer Architecture News, 33(4), 2005.

Digital Library

[43]

Z. Wang et al. Using the Compiler to Improve Cache Replacement Decisions. In Proceedings of the Conf. on Parallel Architectures and Compilation Techniques, Sept. 2002.

Digital Library

[44]

Z. Wang et al. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In Proceedings of the 30th Intl. Symp. on Computer Architecture, June 2003.

Digital Library

[45]

T.-Y. Yeh. The Low-Power High-Performance Architecture of the PWRficient Processor Family. In Conf. Record of Hot Chips 17, Stanford, CA, Aug. 2005.

Cited By

Badawi MLu ZHemani A(2016)Service-Guaranteed Multi-port Packet Memory for Parallel Protocol Processing Architecture2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)10.1109/PDP.2016.53(408-412)Online publication date: Feb-2016
https://doi.org/10.1109/PDP.2016.53
Hughes C(2015)Single-Instruction Multiple-Data ExecutionSynthesis Lectures on Computer Architecture10.2200/S00647ED1V01Y201505CAC03210:1(1-121)Online publication date: 27-May-2015
https://doi.org/10.2200/S00647ED1V01Y201505CAC032
Vaumourin GThomas DAlexandre GBarthou D(2015)Specific read only data management for memory hierarchy optimizationACM SIGBED Review10.1145/2724942.272495111:4(55-60)Online publication date: 22-Jan-2015
https://dl.acm.org/doi/10.1145/2724942.2724951
Show More Cited By

Index Terms

Comparing memory systems for chip multiprocessors

Recommendations

Comparing memory systems for chip multiprocessors
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, ...
Comparative evaluation of memory models for chip multiprocessors

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 35, Issue 2

May 2007

527 pages

ISSN:0163-5964

DOI:10.1145/1273440

Issue’s Table of Contents

ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
June 2007
542 pages
ISBN:9781595937063
DOI:10.1145/1250662
General Chair:
Dean Tullsen
University of California, San Diego
,
Program Chair:
Brad Calder
Microsoft & University of California, San Diego

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Published in SIGARCH Volume 35, Issue 2

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
2,195
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)4

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Badawi MLu ZHemani A(2016)Service-Guaranteed Multi-port Packet Memory for Parallel Protocol Processing Architecture2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)10.1109/PDP.2016.53(408-412)Online publication date: Feb-2016
https://doi.org/10.1109/PDP.2016.53
Hughes C(2015)Single-Instruction Multiple-Data ExecutionSynthesis Lectures on Computer Architecture10.2200/S00647ED1V01Y201505CAC03210:1(1-121)Online publication date: 27-May-2015
https://doi.org/10.2200/S00647ED1V01Y201505CAC032
Vaumourin GThomas DAlexandre GBarthou D(2015)Specific read only data management for memory hierarchy optimizationACM SIGBED Review10.1145/2724942.272495111:4(55-60)Online publication date: 22-Jan-2015
https://dl.acm.org/doi/10.1145/2724942.2724951
Alvarez LVilanova LGonzalez MMartorell XNavarro NAyguade E(2015)Hardware–Software Coherence Protocol for the Coexistence of Caches and Local MemoriesIEEE Transactions on Computers10.1109/TC.2013.19464:1(152-165)Online publication date: Jan-2015
https://doi.org/10.1109/TC.2013.194
Alvarez LMoreto MCasas MCastillo EMartorell XLabarta JAyguade EValero M(2015)Runtime-Guided Management of Scratchpad Memories in Multicore ArchitecturesProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.26(379-391)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.26
Yang QFu JPoss RJesshope C(2014)On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed cachesACM Transactions on Embedded Computing Systems10.1145/256793113:3s(1-21)Online publication date: 28-Mar-2014
https://dl.acm.org/doi/10.1145/2567931
Shriraman AZhao HDwarkadas S(2013)An Application-Tailored Approach to Hardware Cache CoherenceComputer10.1109/MC.2013.25846:10(40-47)Online publication date: 1-Oct-2013
https://dl.acm.org/doi/10.1109/MC.2013.258
Kachris CNikiforos GPapaefstathiou VKavadias SKatevenis M(2013)NP-SARCJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2012.11.00159:1(39-47)Online publication date: 1-Jan-2013
https://dl.acm.org/doi/10.1016/j.sysarc.2012.11.001
Abellán JFernández JAcacio M(2013)Design of an efficient communication infrastructure for highly contended locks in many-core CMPsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2012.06.01073:7(972-985)Online publication date: Jul-2013
https://doi.org/10.1016/j.jpdc.2012.06.010
Alvarez LVilanova LGonzalez MMartorell XNavarro NAyguade EHollingsworth J(2012)Hardware-software coherence protocol for the coexistence of caches and local memoriesProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389117(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389117
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents