research-article

Open access

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Authors:

Olivier Serres,

Abdullah Kayi,

Ahmad Anbar,

Tarek El-GhazawiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 4

Article No.: 52, Pages 1 - 26

https://doi.org/10.1145/2842686

Published: 22 December 2015 Publication History

PDF eReader

Abstract

Due to its rich memory model, the partitioned global address space (PGAS) parallel programming model strikes a balance between locality-awareness and the ease of use of the global address space model. Although locality-awareness can lead to high performance, supporting the PGAS memory model is associated with penalties that can hinder PGAS’s potential for scalability and speed of execution. This is because mapping the PGAS memory model to the underlying system requires a mapping process that is done in software, thereby introducing substantial overhead for shared accesses even when they are local. Compiler optimizations have not been sufficient to offset this overhead. On the other hand, manual code optimizations can help, but this eliminates the productivity edge of PGAS. This article proposes a processor microarchitecture extension that can perform such address mapping in hardware with nearly no performance overhead. These extensions are then availed to compilers through extensions to the processor instructions. Thus, the need for manual optimizations is eliminated and the productivity of PGAS languages is unleashed. Using Unified Parallel C (UPC), a PGAS language, we present a case study of a prototype compiler and architecture support. Two different implementations of the system were realized. The first uses a full-system simulator, gem5, which evaluates the overall performance gain of the new hardware support. The second uses an FPGA Leon3 soft-core processor to verify implementation feasibility and to parameterize the cost of the new hardware. The new instructions show promising results on all tested codes, including the NAS Parallel Benchmark kernels in UPC. Performance improvements of up to 5.5× for unmodified codes, sometimes surpassing hand-optimized performance, were demonstrated. We also show that our four-core FPGA prototype requires less than 2.4% of the overall chip’s area.

References

[1]

Michail Alvanosl, José Nelson Amaral, Ettore Tiotto, Montse Farreras, and Xavier Martorell. 2014. Reducing compiler-inserted instrumentation in Unified-Parallel-C code generation. In Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’14). IEEE, Los Alamitos, CA, 270--277.

Abstract

References

Cited By

Index Terms

Recommendations

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Productivity and performance using partitioned global address space languages

A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations