research-article

Open access

LAPPS: Locality-Aware Productive Prefetching Support for PGAS

Authors:

Engin Kayraklioglu,

Michael P. Ferguson,

Tarek El-GhazawiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 3

Article No.: 28, Pages 1 - 26

https://doi.org/10.1145/3233299

Published: 28 August 2018 Publication History

All formats PDF

Abstract

Prefetching is a well-known technique to mitigate scalability challenges in the Partitioned Global Address Space (PGAS) model. It has been studied as either an automated compiler optimization or a manual programmer optimization. Using the PGAS locality awareness, we define a hybrid tradeoff. Specifically, we introduce locality-aware productive prefetching support for PGAS. Our novel, user-driven approach strikes a balance between the ease-of-use of compiler-based automated prefetching and the high performance of the laborious manual prefetching. Our prototype implementation in Chapel shows that significant scalability and performance improvements can be achieved with minimal effort in common applications.

References

[1]

2017. Chapel Language Spefications - Version 0.984. Retrieved February 02, 2018 from https://chapel-lang.org/spec/spec-0.98.pdf.

[2]

2018. prefetch/noprefetch | Intel Software. Retrieved February 20, 2018 from https://software.intel.com/en-us/node/524554.

[3]

2018. Using the GNU Compiler Collection (GCC): Other Builtins. Retrieved February 20, 2018 from https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.

[4]

Michail Alvanos, Montse Farreras, Ettore Tiotto, José Nelson Amaral, and Xavier Martorell. 2013. Improving communication in PGAS environments: Static and dynamic coalescing in UPC. In Proceedings of the 27th International ACM International Conference on Supercomputing. ACM, 129--138.

Digital Library

[5]

Michail Alvanos, Gabriel Tanase, Montse Farreras, Ettore Tiotto, Josè Nelson Amaral, and Xavier Martorell. 2013. Improving performance of all-to-all communication through loop scheduling in pgas environments. In Proceedings of the 27th International ACM International Conference on Supercomputing. ACM, 457--458.

Digital Library

[6]

Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, 483--485.

Digital Library

[7]

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2015. PHLAME: Hierarchical locality exploitation using the PGAS model. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS’15). 82--89.

Digital Library

[8]

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Transactions on Architecture and Code Optimization (TACO) 13, 2 (2016), 16.

Digital Library

[9]

Rajkishore Barik, Jisheng Zhao, David Grove, Igor Peshansky, Zoran Budimlic, and Vivek Sarkar. 2011. Communication optimizations for distributed-memory X10 programs. In Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). 1101--1113.

Digital Library

[10]

Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley.

Digital Library

[11]

Bradford L. Chamberlain. 2001. The Design and Implementation of a Region-Based Parallel Programming Language. University of Washington.

[12]

Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel programmability and the Chapel language. International Journal of High Performance Computing Applications 21, 3 (Aug. 2007), 291--312.

Digital Library

[13]

Bradford L. Chamberlain, Sung-eun Choi, Steven J. Deitz, David Iten, and Vassily Litvinov. 2011. Authoring user-defined domain maps in chapel. In Proceedings of Cray Users Group.

[14]

Satish Chandra, Vijay Saraswat, Vivek Sarkar, and Rastislav Bodik. 2008. Type inference for locality analysis of distributed data structures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, New York, 11--22.

Digital Library

[15]

Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS community. In Proceedings of the 4th Conference on Partitioned Global Address Space Programming Model (PGAS’10). ACM, New York, 2:1--2:3.

Digital Library

[16]

Wei-Yu Chen, Costin Iancu, and Katherine Yelick. 2005. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE, 267--278.

Digital Library

[17]

Sung-Eun Choi and L. Snyder. 1997. Quantifying the effects of communication optimizations. In Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162). 218--222.

Digital Library

[18]

Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, François Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarría-Miranda. 2005. An evaluation of global address space languages: Co-array Fortran and unified parallel C. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’05). ACM, New York, 36--47.

Digital Library

[19]

Tarek El-Ghazawi and François Cantonnet. 2002. UPC performance and potential: A NPB experimental study. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 1--26.

Digital Library

[20]

Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2003. UPC: Distributed Shared-Memory Programming. Wiley-Interscience.

Digital Library

[21]

Tarek El-Ghazawi and Sébastien Chauvin. 2001. UPC benchmarking issues. In Proceedings of the International Conference on Parallel Processing, 2001. IEEE, 365--372.

Digital Library

[22]

Michael P. Ferguson and Daniel Buettner. 2015. Caching puts and gets in a PGAS language runtime. IEEE, 13--24.

Digital Library

[23]

Riyaz Haque and David Richards. 2016. Optimizing PGAS overhead in a multi-locale chapel implementation of CoMD. In Proceedings of the F1st Workshop on PGAS Applications. IEEE, 25--32. https://e-reports-ext.llnl.gov/pdf/838618.pdf.

Digital Library

[24]

Akihiro Hayashi, Jisheng Zhao, Michael Ferguson, and Vivek Sarkar. 2015. LLVM-based communication optimizations for PGAS programs. ACM, 1--11.

Digital Library

[25]

Costin Iancu, Wei Chen, and Katherine Yelick. 2008. Performance portable optimizations for loops containing communication operations. In Proceedings of the 22nd Annual International Conference on Supercomputing. ACM, 266--276.

Digital Library

[26]

Engin Kayraklioglu, Wo Chang, and Tarek El-Ghazawi. 2017. Comparative performance and optimization of Chapel in modern manycore architectures. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). 1105--1114.

[27]

Engin Kayraklioglu, Olivier Serres, Ahmad Anbar, Hashem Elezabi, and Tarek El-Ghazawi. 2016. PGAS access overhead characterization in Chapel. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). IEEE, 1568--1577.

[28]

Charles H. Koelbel and Mary E. Zosel. 1993. The High Performance FORTRAN Handbook. MIT Press, Cambridge, MA.

Digital Library

[29]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04).

Digital Library

[30]

Paul E. McKenney and John D. Slingwine. 1998. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems. 509--518.

[31]

John M. Mellor-Crummey and Michael L. Scott. 1991. Scalable reader-writer synchronization for shared-memory multiprocessors. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’91). ACM, New York, 106--113.

Digital Library

[32]

Matthias M. Müller. 1999. KaHPF: Compiler generated data prefetching for HPF. In High Performance Computing in Science and Engineering 99. Springer, Berlin, 474--482.

[33]

Matthias M. Müller, Thomas M. Warschko, and Walter F. Tichy. 1998. Prefetching on the cray-T3E. In Proceedings of the 12th International Conference on Supercomputing (ICS’98). ACM, New York, 361--368.

Digital Library

[34]

Robert W. Numrich and John Reid. 1998. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.

Digital Library

[35]

Arun Raman, Greta Yorsh, Martin Vechev, and Eran Yahav. 2011. Sprint: Speculative prefetching of remote data. In ACM SIGPLAN Notices 46, 10 (2011), 259--274.

Digital Library

[36]

John Reid. 2008. The new features of Fortran 2008. SIGPLAN Fortran Forum 27, 2 (Aug. 2008), 8--21.

Digital Library

[37]

Alberto Sanz, Rafael Asenjo, Juan López, Rafael Larrosa, Angeles Navarro, Vassily Litvinov, Sung-Eun Choi, and Bradford L. Chamberlain. 2012. Global data re-allocation via communication aggregation in Chapel. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’12). 235--242.

Digital Library

[38]

Kyle B. Wheeler, Richard C. Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--8.

[39]

Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing runtime systems with exascale ambitions using the parallel research kernels. In High Performance Computing. Springer, Cham, 321--339. http://link.springer.com/chapter/10.1007/978-3-319-41321-1_17

[40]

Rob F. Van der Wijngaart, Tim Mattson, Jeff Hammond, Srinivas Sridharan, and Evangelos Georganas. 2017. Parallel Research Kernels. Retrieved September 11, 2017 from https://github.com/ParRes/Kernels/blob/master/doc/par-res-kern-report-v1.3.pdf.

[41]

Rob F. Van der Wijngaart and Tim G. Mattson. 2014. The parallel research kernels. In Proceedings of the 2014 IEEE High Performance Extreme Computing Conference (HPEC’14). 1--6.

[42]

Rob F. Van der Wijngaart, Srinivas Sridharan, Abdullah Kayi, Gabriele Jost, Jeff Hammond, Tim G. Mattson, and Jacob E. Nelson. 2015. Using the parallel research kernels to study PGAS models. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models. 76--81.

Digital Library

[43]

Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS extension for C++. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1105--1114.

Digital Library

Cited By

Rolinger TSussman A(2024)Adaptive Prefetching for Fine-grain Communication in PGAS Programs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00071(740-751)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00071
Welch AHernandez OPoole S(2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_3
Hideshima TSato STaura K(2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
https://doi.org/10.2197/ipsjjip.30.464
Show More Cited By

Index Terms

LAPPS: Locality-Aware Productive Prefetching Support for PGAS
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

Caching Puts and Gets in a PGAS Language Runtime
PGAS '15: Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models

We investigated a software cache for PGAS PUT and GET operations. The cache is implemented as a software write-back cache with dirty bits, local memory consistency operations, and programmer-guided prefetch. This cache supports programmer productivity ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3

September 2018

322 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3274266

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2018

Accepted: 01 June 2018

Revised: 01 March 2018

Received: 01 November 2017

Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
608
Total Downloads

Downloads (Last 12 months)88
Downloads (Last 6 weeks)13

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rolinger TSussman A(2024)Adaptive Prefetching for Fine-grain Communication in PGAS Programs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00071(740-751)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00071
Welch AHernandez OPoole S(2023)Extending OpenSHMEM with Aggregation Support for Improved Message Rate PerformanceEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_3(32-46)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_3
Hideshima TSato STaura K(2022)Cost-aware Programming on Page-based Distributed Shared MemoryJournal of Information Processing10.2197/ipsjjip.30.46430(464-475)Online publication date: 2022
https://doi.org/10.2197/ipsjjip.30.464
Wu QEkanayake ALi RBeard JJohn L(2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545044
Kayraklioglu EFavry EEl-Ghazawi T(2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TPDS.2021.3051348
Kayraklioglu ERonaghan EFerguson MChamberlain B(2021)Locality-Based Optimizations in the Chapel CompilerLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_1(3-17)Online publication date: 13-Oct-2021
https://dl.acm.org/doi/10.1007/978-3-030-99372-6_1
Kayraklioglu EEl-Ghazawi T(2020)An Automated Machine Learning Approach for Data Locality Optimizations in Chapel2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00113(671-671)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00113
Kayraklioglu EFavry EEl-Ghazawi T(2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
https://doi.org/10.1109/CCGRID.2019.00050
Jenkins LZalewski MFerguson M(2018)Chapel Aggregation Library (CAL)2018 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM.2018.00009(34-43)Online publication date: Nov-2018
https://doi.org/10.1109/PAW-ATM.2018.00009

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents