article

Free access

The effect of sharing on the cache and bus performance of parallel programs

Authors:

S. J. Eggers and

R. H. KatzAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 17, Issue 2

Pages 257 - 270

https://doi.org/10.1145/68182.68206

Published: 01 April 1989 Publication History

PDF eReader

Abstract

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.

Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.

References

[1]

A. Agarwal, J. Hennessy and M. Horowitz, "Cache Performance of Operation System and Multiprogramming Workloads", ACM Transactions on Computer Systems, 6, 4 (November 1988), 393-431.

Digital Library

Google Scholar

[2]

A. Agarwal and A. Gupta, "Memory-Reference Characteristics of Multiprocessor Applications under MACH", Proceedings of the 1988 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, 16, 1 (1988), 215-225.

Digital Library

Google Scholar

[3]

C. Alexander, W. Keshlear, F. Cooper and F. Briggs, "Cache Memory Performance in a UNIX Environment", Computer Ardu'tecture News, 1,4, 3 (June 1986), 14-70.

Digital Library

Google Scholar

[4]

I. Archibald and J. Baer, "An Evaluation of Cache Coherence Solutiom in Shared-B~ Multiprocessors", ACM Transactions on Computer Systems, 4, 4 (November 1986), 273-298.

Digital Library

Google Scholar

[5]

A. Casotto, F. Romeo and A. Sangiovanni- VincentellL "A Parallel Simulated Annealing Algorithm for the Placement of Macro-Cells", IEEE International Conference on Computer-Aided Design, Santa Clara, CA (November 1986), 30-33.

Google Scholar

[6]

D. F. Cheriton, A. Gupta, P. D. Boyle and H. A. Goosen, "The VMP Multiprocessor: initial Experience, Refinements and Performance Evaluation", Proceedings 15 th Annual International Symposium on Computer Architecture, Honolulu, HA (May 1988), 410-421.

Digital Library

Google Scholar

[7]

S. Devadas and A. R. Newton, "Topological Optimization of Multiple Level Array Logic", IEEE Transactions on Computer-Aided Design (November 1987).

Digital Library

Google Scholar

[8]

S. J. Eggers and R. H. Katz, "A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation", Proceedings 15th Annual international Symposium on Computer Architecture, Honolulu HA (May 1988), 373-383.

Digital Library

Google Scholar

[9]

S. J. Eggers and R. H. Katz, "Evaluation of the Performance of Four Snooping Cache Coherency Protocols", submitted for publication (1988).

Google Scholar

[10]

G.A. Gibson, "SpurBus Specification", to appear as Computer Science Division Technical Report, University of California, Berkeley (December 1988).

Google Scholar

[11]

J. R. Goodman, "Cache Memory Optimization to Reduce Processor/Memory Traffic", Journal of VLSI and Computer Systems, 2, 1 & 2 (1987), 61- 86.

Digital Library

Google Scholar

[12]

M.D. Hill, S. J. Eggers, $. R. Larus, G. S. Taylor, G. Adams, B. K. Bose, G. A. Gibson, P. M. Hansen, J. Keller, S. I. Kong, C. G. Lee, D. Lee, J. M. Pendleton, S. A. Ritchie, D. A. Wood, B. G. Zorn, P. N. Hilfinger, D. Hodges, R. H. Katz, J. Ousterhout and D. A. Patterson, "SPUR: A VLSI Multiprocessor Workstation", IEEE Computer, 19, 11 (November 1986), 8-22.

Digital Library

Google Scholar

[13]

M. D. Hill, "Aspects of Cache Memory and Instruction Buffer Performance", Technical Report No. UCB/Computer Science Dpt. 87/381, University of California, Berkeley (November 1987).

Digital Library

Google Scholar

[14]

R. Katz, S. Eggers, D. Wood, C. L. Perkins and R. Sheldon, "Implementing a Cache Consistency Protocol", Proceedings 12th Annual International Symposium on Computer Architecture, 13, 3 (June 1985), 276-283.

Digital Library

Google Scholar

[15]

H.T. Ma, S. Devadas, R. Wei and A. Sangiovanni- Vincentelli, "Logic Verification Algorithms and their Parallel Implementation", Proceedings of the 24th Design Automation Conference(July 1987), 283-29O.

Digital Library

Google Scholar

[16]

S. McGrogan, R. Olson and N. Toda, "Paralielizing Large Existing Programs - Methodology and Experiences", Proceedings of Spring COMPCON (March 1986), 458-466.

Google Scholar

[17]

D.A. Patterson, "Reduced Instruction Computers", Communications of the ACM, 28, 1 (January 1985), 8-21.

Digital Library

Google Scholar

[18]

S. Przybylski, M. Horowitz and J. Hennessy, "Performance Tradeoffs in Cache Design", Proceedings of the 15th Annual International Symposium on Computer Architecture, Honolulu, Hawaii (May 1988), 290-298.

Digital Library

Google Scholar

[19]

C. Ruggieri and T. P. Murtagh, "Lifetime Analysis of Dynamically Allocated Objects", Conference Record of the 15th Annual ACM Symposium on Principles of Programming Languages, San Diego (January 1988), 285-293.

Digital Library

Google Scholar

[20]

R.L. Sites and A. Agarwal, "Multiprocessor Cache Analysis Using ATUM", Proceedings 15th Annual International Symposium on Computer Architecture, Honolulu, HA (May 1988), 186-195.

Digital Library

Google Scholar

[21]

A. J. Smith, "Cache Evaluation and the Impact of Workload Choice", Proceedings of 12th Annual International Symposium of Computer Architecture, 13, 3 (June 1985), 64-73.

Digital Library

Google Scholar

[22]

A. L Smith, "Line (Block) Size Choice for CPU Caches", IEEE Trans. on Computers, C-36, 9 (September 1987).

Digital Library

Google Scholar

[23]

D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, J. Pendleton, S. A. Ritchie, G. S. Taylor, R. H. Katz and D. A. Patterson, "An In-Cache Address Translation Mechanism", 13th Annual International Symposium on Computer Architecture, Tokyo, Japan (June 1986), 358-365.

Digital Library

Google Scholar

[24]

D. A. Wood, S. J. Eggers and G. A. Gibson, "SPUR Memory System Architecture", Technical Report No. UCB/Computer Science Dpt./87f394, University of California, Berkeley (December 1987).

Digital Library

Google Scholar

Cited By

View all

Fonseca ARafael JCabral B(2014)Eve: A Parallel Event-Driven Programming LanguageEuro-Par 2014: Parallel Processing Workshops10.1007/978-3-319-14313-2_15(170-181)Online publication date: 2014
https://doi.org/10.1007/978-3-319-14313-2_15
Lopriore L(1989)Software-controlled cache coherence protocol for multicache systemsInformation Processing Letters10.1016/0020-0190(89)90190-733:3(125-130)Online publication date: Nov-1989
https://doi.org/10.1016/0020-0190(89)90190-7
Amigó EFang HMizzaro SZhai C(2018)Report on the SIGIR 2017 Workshop on Axiomatic Thinking for Information Retrieval and Related Tasks (ATIR)ACM SIGIR Forum10.1145/3190580.319059651:3(99-106)Online publication date: 22-Feb-2018
https://dl.acm.org/doi/10.1145/3190580.3190596
Show More Cited By

Index Terms

The effect of sharing on the cache and bus performance of parallel programs

Recommendations

The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III: Proceedings of the third international conference on Architectural support for programming languages and operating systems

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In ...
Read More
The Effect of Sharing on the Cache and Bus Performance of Parallel
Read More
Effective cache prefetching on bus-based multiprocessors

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Read More

Reviews

Reviewer: Andrew Robert Huber

How does the sharing resulting from writing an application program as a set of parallel processes affect cache performance__?__ The authors investigate this question for shared-memory multiprocessors with a single bus. They use trace-driven simulation to examine the performance of four applications written explicitly for parallel execution. The parallel programming model used is single-program-multiple-data: <__?__Pub Fmt italic>N<__?__Pub Fmt /italic> processes each execute identical instructions on their own part of the shared data. This corresponds to many real-world applications written for some small number of processors, with each process dedicated to its own processor. The applications are actual CAD programs written for <__?__Pub Fmt italic>N<__?__Pub Fmt /italic> = 5, 11, 12, and 12 processors. The hardware simulated is RISC-like. The unsurprising answer is an unequivocal “it depends”— <__?__Pub Caret>on the sharing the application does. Applications whose processes exhibit locality (multiple consecutive writes to shared data within a cache block) behave much like nonparallel programs. Applications with fine-grain sharing (where multiple processes contend for shared data within cache blocks) do not. In either case, cache miss ratios and bus utilization are higher than in nonparallel programs because of extra misses caused by the cache invalidations necessary to maintain cache consistency. For programs with locality, this shows up as a smaller improvement in the miss ratio as cache block size or total cache size increases. For programs with fine-grain sharing, the extra misses can be sufficient to increase the miss ratio for large block or cache size. The results for bus utilization are similar. The paper is competently organized and presented. The usual caveats apply since the model and applications used, while representative, are limited, and the traces include only application references. It would have been interesting to see how the metrics varied with the number of processes. The results will be of interest to cache designers of shared memory multiprocessors and to programmers interested enough in performance to reorganize applications to take cache parameters into account.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 17, Issue 2

Special issue: Proceedings of ASPLOS-III: the third international conference on architecture support for programming languages and operating systems

April 1989

291 pages

ISSN:0163-5964

DOI:10.1145/68182

Editor:
Joel Emer

Issue’s Table of Contents

ASPLOS III: Proceedings of the third international conference on Architectural support for programming languages and operating systems
April 1989
303 pages
ISBN:0897913000
DOI:10.1145/70082
Chairman:
Joel Emer,
General Chair:
John Hennessy
Stanford University

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 1989

Published in SIGARCH Volume 17, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

155
Total Citations
View Citations
753
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)6

Other Metrics

View Author Metrics

Citations

Cited By

View all

Fonseca ARafael JCabral B(2014)Eve: A Parallel Event-Driven Programming LanguageEuro-Par 2014: Parallel Processing Workshops10.1007/978-3-319-14313-2_15(170-181)Online publication date: 2014
https://doi.org/10.1007/978-3-319-14313-2_15
Lopriore L(1989)Software-controlled cache coherence protocol for multicache systemsInformation Processing Letters10.1016/0020-0190(89)90190-733:3(125-130)Online publication date: Nov-1989
https://doi.org/10.1016/0020-0190(89)90190-7
Amigó EFang HMizzaro SZhai C(2018)Report on the SIGIR 2017 Workshop on Axiomatic Thinking for Information Retrieval and Related Tasks (ATIR)ACM SIGIR Forum10.1145/3190580.319059651:3(99-106)Online publication date: 22-Feb-2018
https://dl.acm.org/doi/10.1145/3190580.3190596
Barmpatsalou KCruz TMonteiro ESimoes P(2018)Current and Future Trends in Mobile Device ForensicsACM Computing Surveys10.1145/317784751:3(1-31)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1145/3177847
Eaton EKoenig SSchulz CMaurelli FLee JEckroth JCrowley MFreedman RCardona-Rivera RMachado TWilliams T(2018)Blue sky ideas in artificial intelligence education from the EAAI 2017 new and future AI educator programAI Matters10.1145/3175502.31755093:4(23-31)Online publication date: 16-Feb-2018
https://dl.acm.org/doi/10.1145/3175502.3175509
Yu SXiao NDeng MLiu FChen W(2017)Redesign the Memory Allocator for Non-Volatile Main MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/299765113:3(1-26)Online publication date: 14-Apr-2017
https://dl.acm.org/doi/10.1145/2997651
Li BHU YWang YYe JLi X(2017)Power-Utility-Driven Write Management for MLC PCMACM Journal on Emerging Technologies in Computing Systems10.1145/299764813:3(1-22)Online publication date: 20-Apr-2017
https://dl.acm.org/doi/10.1145/2997648
(2017)Low-level implementation of the SISC protocol for thread-level speculation on a multi-core architectureParallel Computing10.1016/j.parco.2017.07.00767:C(1-19)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.parco.2017.07.007
Torrie EMartonosi MHall MTseng C(2016)Memory Referencing Behavior in Compiler-Parallelized ApplicationsInternational Journal of Parallel Programming10.1007/BF0335675424:4(349-376)Online publication date: 26-May-2016
https://doi.org/10.1007/BF03356754
Ben-Asher YBen-Asher Y(2012)Performance and Overhead MeasurementsMulticore Programming Using the ParC Language10.1007/978-1-4471-2164-0_8(259-277)Online publication date: 2012
https://doi.org/10.1007/978-1-4471-2164-0_8
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations