article

Free access

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Authors:

Edward Rothberg,

Jaswinder Pal Singh,

Anoop GuptaAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 21, Issue 2

Pages 14 - 26

https://doi.org/10.1145/173682.165126

Published: 01 May 1993 Publication History

PDF eReader

Abstract

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?

In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.

We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.

References

[1]

David H. Bailey. FFTs in External or Hierarchical Memories. Journal of Supercomputing, 4:23-25, 1990.

Digital Library

Google Scholar

[2]

Geoffrey Fox et al. Solving Problems on Concurrent Processors, Volume I: General Techniques and Regular Problems. Prentice Hall, 1988.

Digital Library

Google Scholar

[3]

Lars Hemquist. Hierarchical N-body methods. Computer Physics Communications, 48:107-115, 1988.

Crossref

Google Scholar

[4]

H.T. Kung. Memory requirements for balanced computer architectures. In Proceedings of the 13th Annual International Symposium on Computer Architecture, 1986.

Digital Library

Google Scholar

[5]

Gordon Moore. VLSI: Some fundamental challenges. IEEE Spectrum, pages 30-37, April 1979.

Google Scholar

[6]

jason Nieh and Marc Levoy. Volume rendering on scalable shared-memory MIMD architectures. In Proceedings of the Boston Workshop on Volume Visualization, October 1992.

Digital Library

Google Scholar

[7]

John K. Salmon. Parallel Hierarchical N-body Methods. PhD thesis, California Institute of Technology, December 1990.

Digital Library

Google Scholar

[8]

Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Implications of hierarchical N-body techniques for multiprocessor architecture. Technical Report CSL-TR-92-506, Stanford University, 1992.

Google Scholar

[9]

Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling parallel programs for multiprocessors: Methodology and examples. IEEE Computer, 26(7), July 1993. To appear. Also Stanford Univeristy Tech. Report no. CSL- TR-92-541, 1992.

Digital Library

Google Scholar

[10]

Jaswinder Pal Singh, Chris Holt, Takashi Totsuka, Anoop Gupta, and John L. Hennessy. Load balancing and data locality in hierarchical N-body methods. Journal of Parallel and Distributed Computing. To appear. Prelim. version available as Stanford Univeristy Tech. Report no. CSL-TR- 92-505, Jan. 1992.

Google Scholar

[11]

R. van de Geijn. Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC/860 systems. Technical Report CS-91-28, University of Texas at Austin, Ausu~t 1991.

Digital Library

Google Scholar

[12]

Charles van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM, 1992.

Digital Library

Google Scholar

Cited By

View all

Stenström PSkeppstedt J(2005)A performance tuning approach for shared-memory multiprocessorsEuro-Par'97 Parallel Processing10.1007/BFb0002718(72-83)Online publication date: 26-Sep-2005
https://doi.org/10.1007/BFb0002718
(2003)ReferencesInterconnection Networks10.1016/B978-155860852-8/50015-8(569-592)Online publication date: 2003
https://doi.org/10.1016/B978-155860852-8/50015-8
Cypher RHo AKonstantinidou SMessina P(1996)A quantitative study of parallel scientific applications with explicit communicationThe Journal of Supercomputing10.1007/BF0012809710:1(5-24)Online publication date: 1996
https://doi.org/10.1007/BF00128097
Show More Cited By

Index Terms

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Recommendations

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors
ISCA '93: Proceedings of the 20th annual international symposium on computer architecture

The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of ...
Effective cache prefetching on bus-based multiprocessors

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Cache memory design and performance issues in shared-memory multiprocessors

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 21, Issue 2

Special Issue: Proceedings of the 20th annual international symposium on Computer architecture (ISCA '93)

May 1993

348 pages

ISSN:0163-5964

DOI:10.1145/173682

Editor:
Doug DeGroot
Texas Instruments Inc., Dallas, TX

Issue’s Table of Contents

ISCA '93: Proceedings of the 20th annual international symposium on computer architecture
June 1993
361 pages
ISBN:0818638109
DOI:10.1145/165123
Chairman:
Alan Jay Smith

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1993

Published in SIGARCH Volume 21, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

84
Total Citations
View Citations
626
Total Downloads

Downloads (Last 12 months)105
Downloads (Last 6 weeks)27

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Stenström PSkeppstedt J(2005)A performance tuning approach for shared-memory multiprocessorsEuro-Par'97 Parallel Processing10.1007/BFb0002718(72-83)Online publication date: 26-Sep-2005
https://doi.org/10.1007/BFb0002718
(2003)ReferencesInterconnection Networks10.1016/B978-155860852-8/50015-8(569-592)Online publication date: 2003
https://doi.org/10.1016/B978-155860852-8/50015-8
Cypher RHo AKonstantinidou SMessina P(1996)A quantitative study of parallel scientific applications with explicit communicationThe Journal of Supercomputing10.1007/BF0012809710:1(5-24)Online publication date: 1996
https://doi.org/10.1007/BF00128097
Clark TScott LWlodek SMcCammon JKarin S(1995)I/O limitations in parallel molecular dynamicsProceedings of the 1995 ACM/IEEE conference on Supercomputing10.1145/224170.224220(23-es)Online publication date: 8-Dec-1995
https://dl.acm.org/doi/10.1145/224170.224220
Difallah DChecco ADemartini GCudré-Mauroux P(2019)Deadline-Aware Fair Scheduling for Multi-Tenant Crowd-Powered SystemsACM Transactions on Social Computing10.1145/33010032:1(1-29)Online publication date: 21-Feb-2019
https://dl.acm.org/doi/10.1145/3301003
Ruiz-Correa SRuiz-Correa IOlmos-Carrillo CRendón-Huerta FRamirez-Salazar BNguyen LGatica-Perez D(2019)Mi Casa es su Casa? Examining Airbnb Hospitality Exchange Practices in a Developing EconomyACM Transactions on Social Computing10.1145/32998172:1(1-24)Online publication date: 6-Feb-2019
https://dl.acm.org/doi/10.1145/3299817
Garcia VRico AVillavieja CCarpenter PNavarro NRamirez A(2017)Adaptive Runtime-Assisted Block Prefetching on Chip-MultiprocessorsInternational Journal of Parallel Programming10.1007/s10766-016-0431-845:3(530-550)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10766-016-0431-8
Alistarh DAspnes JCensor-Hillel KGilbert SGuerraoui R(2014)Tight Bounds for Asynchronous RenamingJournal of the ACM10.1145/259763061:3(1-51)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2597630
Čadek MKrčál MMatoušek JSergeraert FVokřínek LWagner U(2014)Computing All Maps into a SphereJournal of the ACM10.1145/259762961:3(1-44)Online publication date: 2-Jun-2014
https://dl.acm.org/doi/10.1145/2597629
Koller RVerma ARangaswami R(2011)Estimating Application Cache Requirement for Provisioning Caches in Virtualized SystemsProceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems10.1109/MASCOTS.2011.67(55-62)Online publication date: 25-Jul-2011
https://dl.acm.org/doi/10.1109/MASCOTS.2011.67
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations