article

Free access

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Authors:

Andrew Erlichson,

John HennessyAuthors Info & Claims

ACM SIGPLAN Notices, Volume 31, Issue 9

Pages 210 - 220

https://doi.org/10.1145/248209.237187

Published: 01 September 1996 Publication History

Abstract

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters.To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight processors per node plus additional reserved protocol processors) that range from 6.9 on the communication-intensive FFT program to 21.6 on Ocean (both from the SPLASH 2 suite). In general, clustering is effective in reducing internode miss rates, but as the cluster size increases, increases in the remote latency, mostly due to increased TLB synchronization cost, offset the advantages. For communication-intensive applications, such as FFT, the overhead of sending out network requests, the limited network bandwidth, and the long network latency prevent the achievement of good performance. Overall, this approach still appears promising, but our results indicate that large low latency networks may be needed to make cluster-based virtual shared-memory machines broadly useful as large-scale shared-memory multiprocessors.

References

[1]

Anant Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D Kranz, J. Kubiatowicz, Beng-Hong Lira, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2-13, June 1995.]]

Digital Library

[2]

Brian Bershad and Matthew J. Zekauskas. Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, Carnegie Mellon University Technical Report No. CMU-CS 91-170, September 1991.]]

[3]

J.B. Carter. Design of the Munin Distributed Shared Memory System, Journal of Parallel and Distributed Computing, 29(2):219-27, September 1995.]]

Digital Library

[4]

A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software versus Hardware Shared-memory Implementation: a Case Study, In Proceedings of the 21st Annual International Symposium on Computer Architecture, pp. 106-17, April 1994.]]

Digital Library

[5]

Rohit Chandra, K. Gharachorloo, V. Soundararajan, and A. Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Protocols, In Proceedings of International Conference on Supercomputing '94, pp. 274-288. July 1994.]]

Digital Library

[6]

Jeffery Chase, F. Amador, E. Lazowska, H. Levy, and R. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors, in Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]]

Digital Library

[7]

D.R. Cheriton, H. Goosen and P. Boyle. Multi-level Shared Caching Techniques for Scalability in VMP-MC, In Proceedings of the 16th International Symposium on Computer Architecture, pp. 16-24, May 1989.]]

Digital Library

[8]

M. Dubois, J. C. Wang, L. A. Barroso, K. L. Lee, and Y. Chen. Delayed Consistency and its Effect on the Miss Rate of Parallel Programs, Proceedings of SuperComputing '95, pp. 197-206, November 1991.]]

Digital Library

[9]

Andrew Erlichson, Basem Nayfeh, Jaswinder P. Singh and Kunle Olukotun. The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications Driven Investigation, Proceedings of SuperComputing '95, Dec. I995.]]

Digital Library

[10]

Ewing Lusk. Portable Programs for Parallel Processors, Holt, Rinehart, and Winston, New York, 1987]]

Digital Library

[11]

K. Gharachofioo, Dan Lenoski, James Laudon, P. Gibbons, Anoop Gupta, and John Hennessy. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, In Proceedings of the 17th International Symposium on Computer Architecture, pp. 15-26, May 1990.]]

Digital Library

[12]

Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors, In Proceedings of the Seventh SIAM International Conference on Parallel Processing for Scientific Computing, pp. 313-18, February 1995.]]

[13]

Kirk Johnson, M. F. Kaashoek and D. Wallach. CRL: Highperformance All-software Distributed Shared Memory, In Fifteenth A C Symposium on Operating Systems Principles, pp. 213-28, December 1995.]]

Digital Library

[14]

Magnus Karlsson and Per Stenstrom. Performance Evaluation of a Cluster-Based Muluprocessor Built from ATM Switches and Bus- Based Multiprocessor Servers, In Proceedings of the Second International Symposium on High-Performance Computer Architecture, pp. 4-13, February 1996.]]

Digital Library

[15]

Peter Keleher. Lazy Release Consistency for Distributed Shared Memory, PhD Thesis, Rice University, Houston, January 1995.]]

Digital Library

[16]

Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. Lazy Release Consistency for Software Distributed Shared Memory, In Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 13-21, May 1992.]]

Digital Library

[17]

P. Keleher, Alan Cox, S. Dwarkadas and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems, In Proceedings of USENIX Winter 1994 Conference, pp. 115-32, January 1994.]]

Digital Library

[18]

Jeff Kuskin, David Ofelt, Mark Heinnch, John Heinlein, Richard Simoni, K, Gharachofioo, J. Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum and John Hennessy, The Stanford FLASH Multiprocessor. in Proceedings of the 21st international Symposium on Computer Architecture, pp. 18-21, April 1994.]]

Digital Library

[19]

W. Leler. System-level Parallel Programming Based on Linda, In Proceedings of the Third North American Transputer Users Group, pp. 175-9, April 1990.]]

[20]

Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.]]

Digital Library

[21]

Ron Minnich. Mether-NFS: A Modified NFS which supports Virtual Shared Memory, In Proceedings of Symposium on Experiences with Distributed and Multiprocessor Systems IV, pp. 89-107, September 1993.]]

[22]

Bryan S. Rosenburg. Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared- Memory Multiprocessors, In Proceedings of the Twelfth A CM Symposium on Operating System Principles, pp. 147-158, December 1989.]]

Digital Library

[23]

Dan Scales and Monica Lam. The Design and Evaluation of a Shared Object System for Distributed Memory Machines, In Proceedings of I st Symposium on Operation Systems Design and Implementation, pp. 101~ 14, November 1994.]]

Digital Library

[24]

Michael Y. Thompson, J. M. Barton, T. Jermoluk, and J. Wagner. Translation Lookaside Buffer Synchronization in a Multiprocssor System, In Proceeding of USENlX Association Winter Conference, pp. 297-302, February 1988.]]

[25]

Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy~ The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pp. 219-229, October 1994.]]

Digital Library

[26]

Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors, Stanford University Technical Report No. CSL-TR-93-593, December 1993.]]

Digital Library

[27]

Steven Cameron Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 24-36, june 1995.]]

Digital Library

[28]

Donald Yeung, John Kubiatowicz, and Anant Agarwal. MGS: A Multi-Grain Shared Memory System, in Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 44-55, April 1996.]]

Digital Library

Index Terms

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

Recommendations

SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
ASPLOS VII: Proceedings of the seventh international conference on Architectural support for programming languages and operating systems

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory

One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

Copyright © 1996 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1996

Published in SIGPLAN Volume 31, Issue 9

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

69
Total Citations
View Citations
720
Total Downloads

Downloads (Last 12 months)136
Downloads (Last 6 weeks)20

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents