Article

Fast Address Translation Techniques for Distributed Shared Memory Compilers

Authors:

Francois Cantonnet,

Tarek A. El-Ghazawi,

Pascal Lorenz,

Jaafer GaberAuthors Info & Claims

IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

Page 52.2

https://doi.org/10.1109/IPDPS.2005.219

Published: 04 April 2005 Publication History

Publisher Site

Abstract

The Distributed Shared Memory (DSM) model is designed to leverage the ease ofprogramming of the shared memory paradigm, while enabling the highperformance by expressing locality as in the messagepassing model. Experience, however, has shown that DSM programming languages, such as UPC, may be unable to deliver the expected high level of performance. Initial investigations have shown that among the major reasons is the overhead of translating from the UPC memory model to the target architecture virtualaddresses space, which can be very costly. Experimental measurements have shown this overhead increasing execution time by up to three orders of magnitude. Previous work has also shown that some of this overhead can be avoided by hand-tuning, which on the other hand can significantly decrease the UPC ease of use. In addition, such tuning can only improve the performance of local shared accesses but not remote shared accesses. Therefore, a new technique that resembles the Translation Look Aside Buffers (TLBs) is proposed here. This technique, which is called the Memory Model Translation Buffer (MMTB) has been implemented in the GCC-UPC compiler using two alternative strategies, full-table (FT) and reduced-table (RT). It will be shown that the MMTB strategies can lead to a performance boost of up to 700%, enabling ease-of-programming while performing at a similar performance to hand-tuned UPC and MPI codes.

References

[1]

Brooks, Eugene and Warren Karen, Development and Evaluation of an Efficient Parallel Programming Methodology, Spanning Uniprocessor, Symmetric Shared-memory Multiprocessor and Distributed-memory massively Parallel Architectures, Poster SuperComputing 1995, San Diego, CA, December 3-8, 1995.

Google Scholar

[2]

Cantonnet François, Yao Yiyi, Annareddy Smita, Mohamed Ahmed, El-Ghazawi Tarek, Performance Monitoring and Evaluation of a UPC Implementation on a NUMA architecture, International Parallel and Distributed Processing Symposium (IPDPS), Performance Modeling, Evaluation and Optimization of Parallel and Distributed Systems (PMEO) workshop, 2003, Nice France.

Digital Library

Google Scholar

[3]

Carlson William and Draper Jesse, Distributed Data Access in AC, Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara, CA, July 19-21, 1995, pp.39-47.

Digital Library

Google Scholar

[4]

Culler, Dusseau Andrea, Goldstein Seth Copen, Krishnamurthy Arvind, Lumetta Steven, Von Eicken Thorsten and Yelick Katherine, Parallel Programming in Split-C, Proceedings of SuperComputing 1993, Portland, OR, November 15-19, 1993.

Digital Library

Google Scholar

[5]

El-Ghazawi Tarek, Programming in UPC, Tutorial (http://upc.gwu.edu), April 2001.

Google Scholar

[6]

El-Ghazawi Tarek and Chauvin Sébastien, UPC Benchmarking Issues, 30th Annual Conference IEEE International Conference on Parallel Processing, 2001 (ICPP01) Pages: 365-372.

Digital Library

Google Scholar

[7]

El-Ghazawi Tarek and Cantonnet François, UPC Performance and Potential: A NPB Experimental Study, SuperComputing 2002, IEEE, Baltimore MD, November 2002.

Digital Library

Google Scholar

[8]

El-Ghazawi Tarek, Carlson William and Draper Jesse, UPC Language Specifications v1.1 (http://upc.gwu.edu), October 2003.

Google Scholar

[9]

Gaeke Brian and Yelick Katherine, GUPS (Giga-Updates per Second) Benchmark, Berkeley, 2002.

Google Scholar

[10]

Intrepid, The GCC UPC Compiler for SGI Origin Family v3.2.3.5 (http://www.intrepid.com/upc/)

Google Scholar

[11]

ISO/IEC 9899:1999, Programming languages -- C, December 1999.

Google Scholar

[12]

McCalpin John, Sustainable memory bandwidth in current high performance computers, Technical report, Advanced Systems Division, SGI., October 12, 1995.

Google Scholar

[13]

NAS Parallel Benchmark Suite, NASA Advanced Supercomputing, 2002, http://www.nas.nasa.gov/Software/NPB

Google Scholar

Cited By

View all

Serres OKayi AAnbar AEl-Ghazawi T(2015)Enabling PGAS Productivity with Hardware Support for Shared Address MappingACM Transactions on Architecture and Code Optimization10.1145/284268612:4(1-26)Online publication date: 22-Dec-2015
https://dl.acm.org/doi/10.1145/2842686
Kayraklioglu EEl-Ghazawi TBalaji PXu C(2015)Assessing memory access performance of chapel through synthetic benchmarksProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.157(1147-1150)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.157
Bakhouya MSerres OEl-Ghazawi T(2008)A PGAS-Based Algorithm for the Longest Common Subsequence ProblemProceedings of the 14th international Euro-Par conference on Parallel Processing10.1007/978-3-540-85451-7_70(654-664)Online publication date: 26-Aug-2008
https://dl.acm.org/doi/10.1007/978-3-540-85451-7_70
Show More Cited By

Index Terms

Fast Address Translation Techniques for Distributed Shared Memory Compilers
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Performance
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Operational analysis
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
        Buffering
        Memory management
        Distributed memory

Recommendations

Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As ...
Scalable directory architecture for distributed shared memory chip multiprocessors

Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Efficient address remapping in distributed shared-memory systems

As processor performance continues to improve at a rate much higher than DRAM and network performance, we are approaching a time when large-scale distributed shared memory systems will have remote memory latencies measured in tens of thousands of ...

Comments

Information & Contributors

Information

Published In

IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01

April 2005

ISBN:0769523129

Publisher

IEEE Computer Society

United States

Publication History

Published: 04 April 2005

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Serres OKayi AAnbar AEl-Ghazawi T(2015)Enabling PGAS Productivity with Hardware Support for Shared Address MappingACM Transactions on Architecture and Code Optimization10.1145/284268612:4(1-26)Online publication date: 22-Dec-2015
https://dl.acm.org/doi/10.1145/2842686
Kayraklioglu EEl-Ghazawi TBalaji PXu C(2015)Assessing memory access performance of chapel through synthetic benchmarksProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.157(1147-1150)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.157
Bakhouya MSerres OEl-Ghazawi T(2008)A PGAS-Based Algorithm for the Longest Common Subsequence ProblemProceedings of the 14th international Euro-Par conference on Parallel Processing10.1007/978-3-540-85451-7_70(654-664)Online publication date: 26-Aug-2008
https://dl.acm.org/doi/10.1007/978-3-540-85451-7_70
Bakhouya MGaber JEl-Ghazawi T(2007)Towards a complexity model for design and analysis of PGAS-based algorithmsProceedings of the Third international conference on High Performance Computing and Communications10.5555/2401945.2402020(672-682)Online publication date: 26-Sep-2007
https://dl.acm.org/doi/10.5555/2401945.2402020
Barton CCasçaval CAlmási GZheng YFarreras MChatterje SAmaral JSchwartzbach MBall T(2006)Shared memory programming for large scale machinesProceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/1133981.1133995(108-117)Online publication date: 11-Jun-2006
https://dl.acm.org/doi/10.1145/1133981.1133995
Barton CCasçaval CAlmási GZheng YFarreras MChatterje SAmaral J(2006)Shared memory programming for large scale machinesACM SIGPLAN Notices10.1145/1133255.113399541:6(108-117)Online publication date: 11-Jun-2006
https://dl.acm.org/doi/10.1145/1133255.1133995
Brown JWen Z(2005)Toward an application support layerProceedings of the 6th international conference on Parallel Processing and Applied Mathematics10.1007/11752578_110(912-919)Online publication date: 11-Sep-2005
https://dl.acm.org/doi/10.1007/11752578_110

Abstract

References

Cited By

Index Terms

Recommendations

Moving Address Translation Closer to Memory in Distributed Shared-Memory Multiprocessors

Scalable directory architecture for distributed shared memory chip multiprocessors

Efficient address remapping in distributed shared-memory systems

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations