article

Free access

Scaling application performance on a cache-coherent multiprocessor

Authors:

Dongming Jiang,

Jaswinder Pal SinghAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 27, Issue 2

Pages 305 - 316

https://doi.org/10.1145/307338.301005

Published: 01 May 1999 Publication History

Abstract

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to large processor counts on real applications. We examine this question using an aggressive case-study machine, the SGI Origin2000, up to 128 processors. We show for the first time that scalable performance can indeed be achieved in this programming model on a wide range of applications, including challenging kernels like FFT. However, this does not come easily, even for applications considered to be already highly optimized, and is very often not simply a matter of increasing problem size. Rather, substantial further application restructuring is often needed, which is usually quite algorithmic in nature. We examine how the restructurings compare with those needed for performance portability to shared virtual memory on clusters, and we comment on common programming guidelines for performance portability and scalability as well as on how the programming difficulty compares with that of explicit message passing. We also examine where applications spend their time on this large machine, the impact of special hardware features that the machine provides, and the impact of mapping to the network topology.

References

[1]

G. A. Abandah and E. S. Davidson. Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance. In Proceedings of the 25th International Sympo.sium on Computer Architecture, June 1998.

Digital Library

[2]

A. Agarwal and et al. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22th International Symposium on Computer Architecuture, pages 2-13, June 1995.

Digital Library

[3]

C. Chen, J. P. Singh, and R. Altman. Parallel hierarchical protein structure determination in the presence of uncertainty. In SIAM Conference on Parallel Processing, 1999.

[4]

D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, E. Santos, R. Subramonian, and T. yon Eicken. LogP: Towards a realistic model of parallel computation. In Proceedin9s of 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Pray ramming, May 1993.

Digital Library

[5]

C. Holt, J. P. Singh, and-J. Hennessy. Application and architectural bottlenecks in large scale distributed shared memory machines. In Proceedings of the ~,3th Annual International Symposium on Computer Architecture, pages 134-145, May 1996.

Digital Library

[6]

D. Jiang, H. Shah, and J. Singh. Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors. In Proceedings of the 1997 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997.

Digital Library

[7]

D. Jiang and J. P. Singh. Parallel Shear-Warp volume rendering on shared address space multiprocessors. In Proceedings of the 199"{ ACM SIGPLAN Symposium on Principles and PracticeJ~f. Parallel Programming, June 1997.

Digital Library

[8]

D. Jimag and P. Singll.A methodology and all evaluation of the SGI Origin2000. In Proceedings of A CM Sigmetrics98/Performance 98, June 1998.

Digital Library

[9]

A. Koztov and J. P. Singh. Parallel probabilistic inference on cache-coherent multiprocessors. IEEE Computer, 1996.

[10]

S. Kumar, D. Jiang, Ft. Chandra, and J. P. Singh. Evaluating synchronization on shared address space multiprocessors: Methodology mid performance. In Proceedings of A CM Sigmetrics'99, May 1999.

Digital Library

[11]

P.G. Lacroute. Fast Volume Rendering Using a Shear- Warp Factorization of the Viewing Transformation. PhD thesis, Stanford Univerzity~ 1995.

Digital Library

[12]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 2$th Annual International Symposium on Computer Architecture, June 1997.

Digital Library

[13]

D. Lenoski, J. Laudon, J. Truman, D. Nakahira. L. Stevens, A. Gupta, and J. Hennessy. The DASH prototype: Logic overhead and performance. IEEE Transactions on Parallel and Distributed Systems, 4:41-61, 1993.

Digital Library

[14]

S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-}evel shared memory. In Proceedings of the 2tst Annual International Symposium on Computer Architecture, pages 325-337, April 1994.

Digital Library

[15]

J. P. Singh, A. Gupta, and J. L. Hennessy. Implications of hierarchical N-body techniques for multiprocessor architecture. ACM Transactions on Computer Systems, May 1995.

Digital Library

[16]

J. P. Singh, A. Gupta, and M. Levoy. -Parallel visualizatio~l algorithms: Performance and architectural implications. Computer, 27:45-55, 1994.

Digital Library

[17]

J. P. Singh, T. Joe, J. L. Hennessy, and A. Gupta. An empirical comparison of the KSR-1 ALLCACHE azld Sta~fford DASH multiprocessor~. In Supercomputing '93, November 1993.

Digital Library

[18]

R. Thekkath, A. P. Singh, J. P. Singh, J. L. Hennessy, and S. John. An evaluation of the Convex Exemplar SP-1200. In Proc. Intl. Parallel Processing Symposium: April 1997.

Digital Library

[19]

HJ. . Wasserman, O. M. Lubecl~, Y. Luo, and F. Bassetti. Performance evaluation of the SGI Origin2000: A memorycentric characterization of LANL ASCI applications. In Supercomputing '97, Nov 1997.

Digital Library

[20]

S. Woo, M. Ohara, E. Torrie~ J. P. Singh, mad A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22th Annual International Symposium on Computer Architecture, June 1995.

Digital Library

Cited By

Kayi AKornkven EEl-Ghazawi TAl-Bahra SNewby G(2008)Performance Evaluation of Clusters with ccNUMA Nodes - A Case StudyProceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications10.1109/HPCC.2008.111(320-327)Online publication date: 25-Sep-2008
https://dl.acm.org/doi/10.1109/HPCC.2008.111
Monchiero MPalermo GSilvano CVilla O(2006)Efficient synchronization for embedded on-chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2006.88414714:10(1049-1062)Online publication date: 1-Oct-2006
https://dl.acm.org/doi/10.1109/TVLSI.2006.884147
Collet JZajac PCrouzet YNapieralski A(2006)Contribution of Communications to Dependability in Massively-Defective General-Purpose NanoarchitecturesProceedings of the 12th IEEE International Symposium on On-Line Testing10.1109/IOLTS.2006.18(219-228)Online publication date: 10-Jul-2006
https://dl.acm.org/doi/10.1109/IOLTS.2006.18
Show More Cited By

Index Terms

Scaling application performance on a cache-coherent multiprocessor
1. Hardware
  1. Hardware validation
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management

Recommendations

Scaling application performance on a cache-coherent multiprocessor
ISCA '99: Proceedings of the 26th annual international symposium on Computer architecture

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to ...
Performance characteristics of openMP constructs, and application benchmarks on a large symmetric multiprocessor
ICS '03: Proceedings of the 17th annual international conference on Supercomputing

With the increasing popularity of small to large-scale symmetric multiprocessor (SMP) systems, there has been a dire need to have sophisticated, and flexible development and runtime environments for efficient and rapid development of parallel ...
Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications

Presently, massively parallel processors (MPPs) are available only in a few commercial models. A sequence of three ASCI Teraflops MPPs has appeared before the new millennium. This paper evaluates six MPP systems through STAP benchmark experiments. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 27, Issue 2

Special Issue: Proceedings of the 26th annual international symposium on Computer architecture (ISCA '99)

May 1999

298 pages

ISSN:0163-5964

DOI:10.1145/307338

Chairman:
William J. Dally
Stanford Univ.

Issue’s Table of Contents

ISCA '99: Proceedings of the 26th annual international symposium on Computer architecture
May 1999
317 pages
ISBN:0769501702
Chairmen:
Allan Gottlieb
NYU and NEC Research Institute
,
William J. Dally
Stanford Univ.

Copyright © 1999 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1999

Published in SIGARCH Volume 27, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
409
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)12

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kayi AKornkven EEl-Ghazawi TAl-Bahra SNewby G(2008)Performance Evaluation of Clusters with ccNUMA Nodes - A Case StudyProceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications10.1109/HPCC.2008.111(320-327)Online publication date: 25-Sep-2008
https://dl.acm.org/doi/10.1109/HPCC.2008.111
Monchiero MPalermo GSilvano CVilla O(2006)Efficient synchronization for embedded on-chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2006.88414714:10(1049-1062)Online publication date: 1-Oct-2006
https://dl.acm.org/doi/10.1109/TVLSI.2006.884147
Collet JZajac PCrouzet YNapieralski A(2006)Contribution of Communications to Dependability in Massively-Defective General-Purpose NanoarchitecturesProceedings of the 12th IEEE International Symposium on On-Line Testing10.1109/IOLTS.2006.18(219-228)Online publication date: 10-Jul-2006
https://dl.acm.org/doi/10.1109/IOLTS.2006.18
(2005)Message-Passing and Shared-Data Programming Models " Wish vs. RealityProceedings of the 19th International Symposium on High Performance Computing Systems and Applications10.1109/HPCS.2005.34(131-139)Online publication date: 15-May-2005
https://dl.acm.org/doi/10.1109/HPCS.2005.34
Iyer RPerdue JRauchwerger LAmato NBhuyan L(2005)An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applicationsInternational Journal of Parallel Programming10.1007/s10766-004-1187-033:4(307-350)Online publication date: 1-Aug-2005
https://dl.acm.org/doi/10.1007/s10766-004-1187-0
Corbalan JMartorell XLabarta J(2004)Page migration with dynamic space-sharing scheduling policiesInternational Journal of Parallel Programming10.1023/B:IJPP.0000035815.13969.ec32:4(263-288)Online publication date: 1-Aug-2004
https://dl.acm.org/doi/10.1023/B%3AIJPP.0000035815.13969.ec
Shan HSingh JOliker LBiswas R(2003)Message passing and shared address space parallelism on an SMP clusterParallel Computing10.1016/S0167-8191(02)00222-329:2(167-186)Online publication date: 1-Feb-2003
https://dl.acm.org/doi/10.1016/S0167-8191%2802%2900222-3
Tsafrir DFeitelson D(2002)Barrier synchronization on a loaded SMP using two-phase waiting algorithmsProceedings 16th International Parallel and Distributed Processing Symposium10.1109/IPDPS.2002.1015592(8 pp)Online publication date: 2002
https://doi.org/10.1109/IPDPS.2002.1015592
Nikolopoulos DPapatheodorou TPolychronopoulos2 CLabarta3 JAyguadé3 E(2002)UPMLIB: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory MultiprocessorsLanguages, Compilers, and Run-Time Systems for Scalable Computers10.1007/3-540-40889-4_7(85-99)Online publication date: 26-Jul-2002
https://doi.org/10.1007/3-540-40889-4_7
Thitikamol KKeleher P(2002)Thread Migration and Load Balancing in Heterogeneous EnvironmentsLanguages, Compilers, and Run-Time Systems for Scalable Computers10.1007/3-540-40889-4_20(260-271)Online publication date: 26-Jul-2002
https://doi.org/10.1007/3-540-40889-4_20
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents