Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

Published: 22 December 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Due to its rich memory model, the partitioned global address space (PGAS) parallel programming model strikes a balance between locality-awareness and the ease of use of the global address space model. Although locality-awareness can lead to high performance, supporting the PGAS memory model is associated with penalties that can hinder PGAS’s potential for scalability and speed of execution. This is because mapping the PGAS memory model to the underlying system requires a mapping process that is done in software, thereby introducing substantial overhead for shared accesses even when they are local. Compiler optimizations have not been sufficient to offset this overhead. On the other hand, manual code optimizations can help, but this eliminates the productivity edge of PGAS. This article proposes a processor microarchitecture extension that can perform such address mapping in hardware with nearly no performance overhead. These extensions are then availed to compilers through extensions to the processor instructions. Thus, the need for manual optimizations is eliminated and the productivity of PGAS languages is unleashed. Using Unified Parallel C (UPC), a PGAS language, we present a case study of a prototype compiler and architecture support. Two different implementations of the system were realized. The first uses a full-system simulator, gem5, which evaluates the overall performance gain of the new hardware support. The second uses an FPGA Leon3 soft-core processor to verify implementation feasibility and to parameterize the cost of the new hardware. The new instructions show promising results on all tested codes, including the NAS Parallel Benchmark kernels in UPC. Performance improvements of up to 5.5× for unmodified codes, sometimes surpassing hand-optimized performance, were demonstrated. We also show that our four-core FPGA prototype requires less than 2.4% of the overall chip’s area.

    References

    [1]
    Michail Alvanosl, José Nelson Amaral, Ettore Tiotto, Montse Farreras, and Xavier Martorell. 2014. Reducing compiler-inserted instrumentation in Unified-Parallel-C code generation. In Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’14). IEEE, Los Alamitos, CA, 270--277.
    [2]
    Remzi H. Arpaci, David E. Culler, Arvind Krishnamurthy, Steve G. Steinberg, and Katherine Yelick. 1995. Empirical evaluation of the CRAY-T3D: A compiler perspective. In ACM SIGARCH Computer Architecture News 23, 2, 320--331.
    [3]
    David H. Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020. NASA Ames Research Center, Moffett Field, CA.
    [4]
    Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Compututer Architecture News 39, 2, 1--7.
    [5]
    Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. 2012. Accuracy evaluation of gem5 simulator system. In Proceedings of the 7th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC’12). IEEE, Los Alamitos, CA, 1--7.
    [6]
    François Cantonnet, Tarek El-Ghazawi, Pascal Lorenz, and Jaafar Gaber. 2005. Fast address translation techniques for distributed shared memory compilers. In Proceedings of the 19th International Parallel and Distributed Processing Symposium. 52b.
    [7]
    François Cantonnet, Yiyi Yao, Smita Annareddy, Ahmed Mohamed, and Tarek El-Ghazawi. 2003. Performance monitoring and evaluation of a UPC implementation on a NUMA architecture. In Proceedings of the International Conference on Parallel and Distributed Parallel Systems (IPDPS’03). 274.2.
    [8]
    François Cantonnet, Yiyi Yao, Mohamed Zahran, and Tarek El-Ghazawi. 2004. Productivity analysis of the UPC language. In Proceeding of the 18th International Parallel and Distributed Processing Symposium. 254--260.
    [9]
    William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. 1999. Introduction to UPC and Language Specification. Technical Report. Center for Computing Sciences, Institute for Defense Analyses, Alexandria, VA.
    [10]
    Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3, 291--312.
    [11]
    Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices 40, 10, 519--538.
    [12]
    Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, and Katherine Yelick. 2003. A performance analysis of the Berkeley UPC compiler. In Proceedings of the 17th Annual International Conference on Supercomputing, Vol. 4. ACM, New York, NY, 63--73.
    [13]
    Barnaby Dalton, Gabriel Tanase, Michail Alvanos, George Almasi, and Ettore Tiotto. 2014. Memory management techniques for exploiting RDMA in PGAS languages. In Proceedings of the Workshop on Languages and Compilers and Parallel Computing (LCPC’14).
    [14]
    Martin Danek, Leos Kafka, Lukas Kohout, and Jaroslav Sykora. 2010. Instruction set extensions for multi-threading in LEON3. In Proceedings of the 2010 IEEE 13th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS’10). IEEE, Los Alamitos, CA, 237--242.
    [15]
    Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, and Wolfgang De Meuter. 2015. Partitioned global address space languages. ACM Computing Surveys 47, 4, Article No. 62.
    [16]
    Nan Dun and Kenjiro Taura. 2012. An empirical performance study of Chapel programming language. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW’12). IEEE, Los Alamitos, CA, 497--506.
    [17]
    Kemal Ebcioglu, Vivik Sarkar, Tarek El-Ghazawi, and John Urbanic. 2006. An experiment in measuring the productivity of three parallel programming languages. In Proceedings of the Workshop on Productivity and Performance in High-End Computing (P-PHEC’06). IEEE, Los Alamitos, CA.
    [18]
    Tarek El-Ghazawi and François Cantonnet. 2002. UPC performance and potential: A NPB experimental study. In Proceedings of the ACM/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--26.
    [19]
    Tarek El-Ghazawi, François Cantonnet, Yiyi Yao, Smita Annareddy, and Ahmed S. Mohamed. 2006. Benchmarking parallel compilers: A UPC case study. Future Generation Computer Systems 22, 7, 764--775.
    [20]
    Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. Wiley.
    [21]
    Tarek El-Ghazawi and Sébastien Chauvin. 2001. UPC benchmarking issues. In Proceedings of the International Conference on Parallel Processing (ICPP’01). IEEE, Los Alamitos, CA, 365--372.
    [22]
    Holger Fröning and Heiner Litz. 2010. Efficient hardware support for the partitioned global address space. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops, and Phd Forum (IPDPSW’10). 1--6.
    [23]
    David Grove, Josh Milthorpe, and Olivier Tardieu. 2014. Supporting array programming in X10. In Proceedings of the ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14). ACM, New York, NY, Article No. 38.
    [24]
    Pierre Guironnet De Massas and Paul Amblard. 2006. Experiments around SPARC Leon-2 for MPEG encoding. In Proceedings of the International Conference on Mixed Design of Integrated Circuits and System (MIXDES’06). 285--289.
    [25]
    Matthias M. Mueller. 2000. Efficient address translation. Interner Bericht. Universität Karlsruhe, Fakultät für Informatik 2000, 12.
    [26]
    Steven L. Scott. 1996. Synchronization and communication in the T3E multiprocessor. ACM SIGPLAN NoticesSIGOPS Operating Systems Review 30, 5, 26--36.
    [27]
    Olivier Serres, Ahmad Anbar, Saumil G. Merchant, Abdullah Kayi, and Tarek El-Ghazawi. 2011a. Address translation optimization for Unified Parallel C multi-dimensional arrays. In Proceedings of the 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’11). IEEE, Los Alamitos, CA.
    [28]
    Olivier Serres, Vikram K. Narayana, and Tarek El-Ghazawi. 2011b. An architecture for reconfigurable multi-core explorations. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’11).
    [29]
    UPC Consortium. 2005. UPC Language Specifications V1.2. (May 2005).
    [30]
    UPC NPB. 2014. UPC NAS Parallel Benchmarks threads.seas.gwu.edu/sites/npb-upc. Retrieved November 11, 2015, from http://www.gwu.edu/∼upc/docs/upc_specs_1.2.pdf.
    [31]
    Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and performance using partitioned global address space languages. In Proceedings of the International Workshop on Parallel Symbolic Computation (PASCO’07). ACM, New York, NY, 24--32.
    [32]
    Zhang Zhang and Steven R. Seidel. 2005. Benchmark measurements of current UPC platforms. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (PMEOPDS’05).

    Cited By

    View all
    • (2021)The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architecturesFuture Generation Computer Systems10.1016/j.future.2021.07.021Online publication date: Jul-2021
    • (2017)HPC-Oriented Toolchain for Hardware Simulators2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.108(653-654)Online publication date: Sep-2017
    • (2016)PGAS Access Overhead Characterization in Chapel2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.193(1568-1577)Online publication date: May-2016

    Index Terms

    1. Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: A UPC Case Study

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
          January 2016
          848 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/2836331
          Issue’s Table of Contents
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 22 December 2015
          Accepted: 01 October 2015
          Revised: 01 September 2015
          Received: 01 May 2015
          Published in TACO Volume 12, Issue 4

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. PGAS
          2. UPC
          3. address translation
          4. hardware support
          5. productivity

          Qualifiers

          • Research-article
          • Research
          • Refereed

          Funding Sources

          • National Science Foundation

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)56
          • Downloads (Last 6 weeks)10
          Reflects downloads up to 30 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2021)The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architecturesFuture Generation Computer Systems10.1016/j.future.2021.07.021Online publication date: Jul-2021
          • (2017)HPC-Oriented Toolchain for Hardware Simulators2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.108(653-654)Online publication date: Sep-2017
          • (2016)PGAS Access Overhead Characterization in Chapel2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.193(1568-1577)Online publication date: May-2016

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Full Access

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media