Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Scaling application performance on a cache-coherent multiprocessor

Published: 01 May 1999 Publication History

Abstract

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to large processor counts on real applications. We examine this question using an aggressive case-study machine, the SGI Origin2000, up to 128 processors. We show for the first time that scalable performance can indeed be achieved in this programming model on a wide range of applications, including challenging kernels like FFT. However, this does not come easily, even for applications considered to be already highly optimized, and is very often not simply a matter of increasing problem size. Rather, substantial further application restructuring is often needed, which is usually quite algorithmic in nature. We examine how the restructurings compare with those needed for performance portability to shared virtual memory on clusters, and we comment on common programming guidelines for performance portability and scalability as well as on how the programming difficulty compares with that of explicit message passing. We also examine where applications spend their time on this large machine, the impact of special hardware features that the machine provides, and the impact of mapping to the network topology.

References

[1]
G. A. Abandah and E. S. Davidson. Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance. In Proceedings of the 25th International Sympo.sium on Computer Architecture, June 1998.
[2]
A. Agarwal and et al. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22th International Symposium on Computer Architecuture, pages 2-13, June 1995.
[3]
C. Chen, J. P. Singh, and R. Altman. Parallel hierarchical protein structure determination in the presence of uncertainty. In SIAM Conference on Parallel Processing, 1999.
[4]
D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. yon Eicken. LogP: Towards a realistic model of parallel computation. In Proceedin9s of 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Pray ramming, May 1993.
[5]
C. Holt, J. P. Singh, and-J. Hennessy. Application and architectural bottlenecks in large scale distributed shared memory machines. In Proceedings of the ~,3th Annual International Symposium on Computer Architecture, pages 134-145, May 1996.
[6]
D. Jiang, H. Shah, and J. Singh. Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors. In Proceedings of the 1997 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997.
[7]
D. Jiang and J. P. Singh. Parallel Shear-Warp volume rendering on shared address space multiprocessors. In Proceedings of the 199"{ ACM SIGPLAN Symposium on Principles and PracticeJ~f. Parallel Programming, June 1997.
[8]
D. Jimag and P. Singll.A methodology and all evaluation of the SGI Origin2000. In Proceedings of A CM Sigmetrics98/Performance 98, June 1998.
[9]
A. Koztov and J. P. Singh. Parallel probabilistic inference on cache-coherent multiprocessors. IEEE Computer, 1996.
[10]
S. Kumar, D. Jiang, Ft. Chandra, and J. P. Singh. Evaluating synchronization on shared address space multiprocessors: Methodology mid performance. In Proceedings of A CM Sigmetrics'99, May 1999.
[11]
P.G. Lacroute. Fast Volume Rendering Using a Shear- Warp Factorization of the Viewing Transformation. PhD thesis, Stanford Univerzity~ 1995.
[12]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 2$th Annual International Symposium on Computer Architecture, June 1997.
[13]
D. Lenoski, J. Laudon, J. Truman, D. Nakahira. L. Stevens, A. Gupta, and J. Hennessy. The DASH prototype: Logic overhead and performance. IEEE Transactions on Parallel and Distributed Systems, 4:41-61, 1993.
[14]
S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-}evel shared memory. In Proceedings of the 2tst Annual International Symposium on Computer Architecture, pages 325-337, April 1994.
[15]
J. P. Singh, A. Gupta, and J. L. Hennessy. Implications of hierarchical N-body techniques for multiprocessor architecture. ACM Transactions on Computer Systems, May 1995.
[16]
J. P. Singh, A. Gupta, and M. Levoy. -Parallel visualizatio~l algorithms: Performance and architectural implications. Computer, 27:45-55, 1994.
[17]
J. P. Singh, T. Joe, J. L. Hennessy, and A. Gupta. An empirical comparison of the KSR-1 ALLCACHE azld Sta~fford DASH multiprocessor~. In Supercomputing '93, November 1993.
[18]
R. Thekkath, A. P. Singh, J. P. Singh, J. L. Hennessy, and S. John. An evaluation of the Convex Exemplar SP-1200. In Proc. Intl. Parallel Processing Symposium: April 1997.
[19]
HJ. . Wasserman, O. M. Lubecl~, Y. Luo, and F. Bassetti. Performance evaluation of the SGI Origin2000: A memorycentric characterization of LANL ASCI applications. In Supercomputing '97, Nov 1997.
[20]
S. Woo, M. Ohara, E. Torrie~ J. P. Singh, mad A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22th Annual International Symposium on Computer Architecture, June 1995.

Cited By

View all
  • (2008)Performance Evaluation of Clusters with ccNUMA Nodes - A Case StudyProceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications10.1109/HPCC.2008.111(320-327)Online publication date: 25-Sep-2008
  • (2006)Efficient synchronization for embedded on-chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2006.88414714:10(1049-1062)Online publication date: 1-Oct-2006
  • (2006)Contribution of Communications to Dependability in Massively-Defective General-Purpose NanoarchitecturesProceedings of the 12th IEEE International Symposium on On-Line Testing10.1109/IOLTS.2006.18(219-228)Online publication date: 10-Jul-2006
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 27, Issue 2
Special Issue: Proceedings of the 26th annual international symposium on Computer architecture (ISCA '99)
May 1999
298 pages
ISSN:0163-5964
DOI:10.1145/307338
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '99: Proceedings of the 26th annual international symposium on Computer architecture
    May 1999
    317 pages
    ISBN:0769501702

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1999
Published in SIGARCH Volume 27, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)12
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2008)Performance Evaluation of Clusters with ccNUMA Nodes - A Case StudyProceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications10.1109/HPCC.2008.111(320-327)Online publication date: 25-Sep-2008
  • (2006)Efficient synchronization for embedded on-chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2006.88414714:10(1049-1062)Online publication date: 1-Oct-2006
  • (2006)Contribution of Communications to Dependability in Massively-Defective General-Purpose NanoarchitecturesProceedings of the 12th IEEE International Symposium on On-Line Testing10.1109/IOLTS.2006.18(219-228)Online publication date: 10-Jul-2006
  • (2005)Message-Passing and Shared-Data Programming Models " Wish vs. RealityProceedings of the 19th International Symposium on High Performance Computing Systems and Applications10.1109/HPCS.2005.34(131-139)Online publication date: 15-May-2005
  • (2005)An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applicationsInternational Journal of Parallel Programming10.1007/s10766-004-1187-033:4(307-350)Online publication date: 1-Aug-2005
  • (2004)Page migration with dynamic space-sharing scheduling policiesInternational Journal of Parallel Programming10.1023/B:IJPP.0000035815.13969.ec32:4(263-288)Online publication date: 1-Aug-2004
  • (2003)Message passing and shared address space parallelism on an SMP clusterParallel Computing10.1016/S0167-8191(02)00222-329:2(167-186)Online publication date: 1-Feb-2003
  • (2002)Barrier synchronization on a loaded SMP using two-phase waiting algorithmsProceedings 16th International Parallel and Distributed Processing Symposium10.1109/IPDPS.2002.1015592(8 pp)Online publication date: 2002
  • (2002)UPMLIB: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory MultiprocessorsLanguages, Compilers, and Run-Time Systems for Scalable Computers10.1007/3-540-40889-4_7(85-99)Online publication date: 26-Jul-2002
  • (2002)Thread Migration and Load Balancing in Heterogeneous EnvironmentsLanguages, Compilers, and Run-Time Systems for Scalable Computers10.1007/3-540-40889-4_20(260-271)Online publication date: 26-Jul-2002
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media