Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

LAPPS: Locality-Aware Productive Prefetching Support for PGAS

Published: 28 August 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce locality-aware productive prefetching support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborious manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.

    References

    [1]
    2017. Chapel Language Spefications - Version 0.984. Retrieved February 02, 2018 from https://chapel-lang.org/spec/spec-0.98.pdf.
    [2]
    2018. prefetch/noprefetch | Intel Software. Retrieved February 20, 2018 from https://software.intel.com/en-us/node/524554.
    [3]
    2018. Using the GNU Compiler Collection (GCC): Other Builtins. Retrieved February 20, 2018 from https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.
    [4]
    Michail Alvanos, Montse Farreras, Ettore Tiotto, José Nelson Amaral, and Xavier Martorell. 2013. Improving communication in PGAS environments: Static and dynamic coalescing in UPC. In Proceedings of the 27th International ACM International Conference on Supercomputing. ACM, 129--138.
    [5]
    Michail Alvanos, Gabriel Tanase, Montse Farreras, Ettore Tiotto, Josè Nelson Amaral, and Xavier Martorell. 2013. Improving performance of all-to-all communication through loop scheduling in pgas environments. In Proceedings of the 27th International ACM International Conference on Supercomputing. ACM, 457--458.
    [6]
    Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, 483--485.
    [7]
    Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2015. PHLAME: Hierarchical locality exploitation using the PGAS model. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS’15). 82--89.
    [8]
    Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Transactions on Architecture and Code Optimization (TACO) 13, 2 (2016), 16.
    [9]
    Rajkishore Barik, Jisheng Zhao, David Grove, Igor Peshansky, Zoran Budimlic, and Vivek Sarkar. 2011. Communication optimizations for distributed-memory X10 programs. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). 1101--1113.
    [10]
    Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley.
    [11]
    Bradford L. Chamberlain. 2001. The Design and Implementation of a Region-Based Parallel Programming Language. University of Washington.
    [12]
    Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3 (Aug. 2007), 291--312.
    [13]
    Bradford L. Chamberlain, Sung-eun Choi, Steven J. Deitz, David Iten, and Vassily Litvinov. 2011. Authoring user-defined domain maps in chapel. In Proceedings of Cray Users Group.
    [14]
    Satish Chandra, Vijay Saraswat, Vivek Sarkar, and Rastislav Bodik. 2008. Type inference for locality analysis of distributed data structures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, New York, 11--22.
    [15]
    Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the 4th Conference on Partitioned Global Address Space Programming Model (PGAS’10). ACM, New York, 2:1--2:3.
    [16]
    Wei-Yu Chen, Costin Iancu, and Katherine Yelick. 2005. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 267--278.
    [17]
    Sung-Eun Choi and L. Snyder. 1997. Quantifying the effects of communication optimizations. In Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162). 218--222.
    [18]
    Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, François Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarría-Miranda. 2005. An evaluation of global address space languages: Co-array Fortran and unified parallel C. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’05). ACM, New York, 36--47.
    [19]
    Tarek El-Ghazawi and François Cantonnet. 2002. UPC performance and potential: A NPB experimental study. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 1--26.
    [20]
    Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2003. UPC: Distributed Shared-Memory Programming. Wiley-Interscience.
    [21]
    Tarek El-Ghazawi and Sébastien Chauvin. 2001. UPC benchmarking issues. In Proceedings of the International Conference on Parallel Processing, 2001. IEEE, 365--372.
    [22]
    Michael P. Ferguson and Daniel Buettner. 2015. Caching puts and gets in a PGAS language runtime. IEEE, 13--24.
    [23]
    Riyaz Haque and David Richards. 2016. Optimizing PGAS overhead in a multi-locale chapel implementation of CoMD. In Proceedings of the F1st Workshop on PGAS Applications. IEEE, 25--32. https://e-reports-ext.llnl.gov/pdf/838618.pdf.
    [24]
    Akihiro Hayashi, Jisheng Zhao, Michael Ferguson, and Vivek Sarkar. 2015. LLVM-based communication optimizations for PGAS programs. ACM, 1--11.
    [25]
    Costin Iancu, Wei Chen, and Katherine Yelick. 2008. Performance portable optimizations for loops containing communication operations. In Proceedings of the 22nd Annual International Conference on Supercomputing. ACM, 266--276.
    [26]
    Engin Kayraklioglu, Wo Chang, and Tarek El-Ghazawi. 2017. Comparative performance and optimization of Chapel in modern manycore architectures. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). 1105--1114.
    [27]
    Engin Kayraklioglu, Olivier Serres, Ahmad Anbar, Hashem Elezabi, and Tarek El-Ghazawi. 2016. PGAS access overhead characterization in Chapel. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). IEEE, 1568--1577.
    [28]
    Charles H. Koelbel and Mary E. Zosel. 1993. The High Performance FORTRAN Handbook. MIT Press, Cambridge, MA.
    [29]
    Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04).
    [30]
    Paul E. McKenney and John D. Slingwine. 1998. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems. 509--518.
    [31]
    John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’91). ACM, New York, 106--113.
    [32]
    Matthias M. Müller. 1999. KaHPF: Compiler generated data prefetching for HPF. In High Performance Computing in Science and Engineering 99. Springer, Berlin, 474--482.
    [33]
    Matthias M. Müller, Thomas M. Warschko, and Walter F. Tichy. 1998. Prefetching on the cray-T3E. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). ACM, New York, 361--368.
    [34]
    Robert W. Numrich and John Reid. 1998. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.
    [35]
    Arun Raman, Greta Yorsh, Martin Vechev, and Eran Yahav. 2011. Sprint: Speculative prefetching of remote data. In ACM SIGPLAN Notices 46, 10 (2011), 259--274.
    [36]
    John Reid. 2008. The new features of Fortran 2008. SIGPLAN Fortran Forum 27, 2 (Aug. 2008), 8--21.
    [37]
    Alberto Sanz, Rafael Asenjo, Juan López, Rafael Larrosa, Angeles Navarro, Vassily Litvinov, Sung-Eun Choi, and Bradford L. Chamberlain. 2012. Global data re-allocation via communication aggregation in Chapel. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). 235--242.
    [38]
    Kyle B. Wheeler, Richard C. Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--8.
    [39]
    Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing runtime systems with exascale ambitions using the parallel research kernels. In High Performance Computing. Springer, Cham, 321--339. http://link.springer.com/chapter/10.1007/978-3-319-41321-1_17
    [40]
    Rob F. Van der Wijngaart, Tim Mattson, Jeff Hammond, Srinivas Sridharan, and Evangelos Georganas. 2017. Parallel Research Kernels. Retrieved September 11, 2017 from https://github.com/ParRes/Kernels/blob/master/doc/par-res-kern-report-v1.3.pdf.
    [41]
    Rob F. Van der Wijngaart and Tim G. Mattson. 2014. The parallel research kernels. In Proceedings of the 2014 IEEE High Performance Extreme Computing Conference (HPEC’14). 1--6.
    [42]
    Rob F. Van der Wijngaart, Srinivas Sridharan, Abdullah Kayi, Gabriele Jost, Jeff Hammond, Tim G. Mattson, and Jacob E. Nelson. 2015. Using the parallel research kernels to study PGAS models. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models. 76--81.
    [43]
    Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS extension for C++. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1105--1114.

    Cited By

    View all
    • (2024)Adaptive Prefetching for Fine-grain Communication in PGAS Programs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00071(740-751)Online publication date: 27-May-2024
    • (2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 28-Aug-2023
    • (2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
    September 2018
    322 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3274266
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 August 2018
    Accepted: 01 June 2018
    Revised: 01 March 2018
    Received: 01 November 2017
    Published in TACO Volume 15, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chapel
    2. PGAS
    3. prefetching
    4. runtime system

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)88
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Adaptive Prefetching for Fine-grain Communication in PGAS Programs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00071(740-751)Online publication date: 27-May-2024
    • (2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 28-Aug-2023
    • (2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
    • (2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
    • (2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
    • (2021)Locality-Based Optimizations in the Chapel CompilerLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_1(3-17)Online publication date: 13-Oct-2021
    • (2020)An Automated Machine Learning Approach for Data Locality Optimizations in Chapel2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00113(671-671)Online publication date: May-2020
    • (2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
    • (2018)Chapel Aggregation Library (CAL)2018 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM.2018.00009(34-43)Online publication date: Nov-2018

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media