research-article

Open access

Exploiting Hierarchical Locality in Deep Parallel Architectures

Authors:

Ahmad Anbar,

Olivier Serres,

Engin Kayraklioglu,

Abdel-Hameed A. Badawy,

Tarek El-GhazawiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 2

Article No.: 16, Pages 1 - 25

https://doi.org/10.1145/2897783

Published: 14 June 2016 Publication History

PDF eReader

Abstract

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.

References

[1]

Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres, and Tarek El-Ghazawi. 2014. Where should the threads go? Leveraging hierarchical data locality to solve the thread affinity dilemma. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). IEEE. http://hpcl.seas.gwu.edu/eprints/38/.

Crossref

Google Scholar

[2]

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy, and Tarek El-Ghazawi. 2015. PHLAME: Hierarchical locality exploitation using the PGAS model. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS’15). IEEE Computer Society, Washington, DC, 82--89.

Digital Library

Google Scholar

[3]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks&Mdash;summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165.

Digital Library

Google Scholar

[4]

Dan Bonachea. 2002. GASNet Specification, V1.1. Technical Report. Berkeley, CA, USA.

Digital Library

Google Scholar

[5]

Francois Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. Hwloc: A generic framework for managing hardware affinities in HPC applications. In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP’10). IEEE Computer Society, Washington, DC, 180--186.

Digital Library

Google Scholar

[6]

Jehoshua Bruck, Ching-Tien Ho, Eli Upfal, Shlomo Kipnis, and Derrick Weathersby. 1997. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8, 11 (Nov. 1997), 1143--1156.

Digital Library

Google Scholar

[7]

B. L. Chamberlain, D. Callahan, and H. P. Zima. 2007. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21, 3 (Aug. 2007), 291--312. 1094342007078442

Digital Library

Google Scholar

[8]

P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. 2010. Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30, 2 (March 2010), 16--29. 10.1109/MM.2010.31

Digital Library

Google Scholar

[9]

E. H. M. Cruz, M. Diener, and P. O. A. Navaux. 2012. Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 532--543.

Digital Library

Google Scholar

[10]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (Jan. 1998), 46--55.

Digital Library

Google Scholar

[11]

Karen Devine, Erik Boman, Robert Heaphy, Bruce Hendrickson, and Courtenay Vaughan. 2002. Zoltan data management services for parallel dynamic applications. Comput. Sci. Eng. 4, 2 (2002), 90--97.

Digital Library

Google Scholar

[12]

Tarek El-Ghazawi and François Cantonnet. 2002. UPC performance and potential: A NPB experimental study. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, Los Alamitos, CA, 1--26.

Digital Library

Google Scholar

[13]

Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2003. UPC: Distributed Shared-Memory Programming. Wiley-Interscience, New York, NY.

Digital Library

Google Scholar

[14]

Tarek El-Ghazawi, W. Carlson, T. Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming.

Digital Library

Google Scholar

[15]

Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting. Budapest, Hungary, 97--104.

Google Scholar

[16]

Habanero-C. 2015. Homepage. Retrieved from https://wiki.rice.edu/confluence/display/HABANERO/ Habanero-C.

Google Scholar

[17]

Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing throughput on a dragonfly network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, Piscataway, NJ, 336--347.

Digital Library

Google Scholar

[18]

E. Jeannot, G. Mercier, and F. Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25, 4 (April 2014), 993--1002.

Digital Library

Google Scholar

[19]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.

Digital Library

Google Scholar

[20]

R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson, L. Carrington, G. Chiu, R. Colwell, W. Dally, J. Dongarra, and others. 2014. Top ten exascale research challenges. DOE ASCAC Subcommittee Report (February 2014).

Google Scholar

[21]

Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (Feb. 2002), 50--58.

Digital Library

Google Scholar

[22]

Deepak Majeti, Rajkishore Barik, Jisheng Zhao, Max Grossman, and Vivek Sarkar. 2014. Compiler-driven data layout transformation for heterogeneous platforms. In Euro-Par 2013: Parallel Processing Workshops. Springer, Berlin, 188--197.

Google Scholar

[23]

E. H. Molina da Cruz, M. A. Zanata Alves, A. Carissimi, P. O. A. Navaux, C. P. Ribeiro, and J. Mehaut. 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW). 551--558.

Digital Library

Google Scholar

[24]

PGAS. 2015. Partitioned Global Address Space Languages. (2015). http://pgas.org Accessed: 7/2015.

Google Scholar

[25]

Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, Anthony Curtis, and Karl Feind. 2011. OpenSHMEM - Toward a unified RMA model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer, Berlin, 1379--1391.

Crossref

Google Scholar

[26]

Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg, and Torsten Hoefler. 2013. Bandwidth-optimal all-to-all exchanges in fat tree networks. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY 139--148.

Digital Library

Google Scholar

[27]

Paul Sack and William Gropp. 2012. Faster topology-aware collective algorithms through non-minimal communication. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY 45--54.

Digital Library

Google Scholar

[28]

Sameer S. Shende and Allen D. Malony. 2006. The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20, 2 (May 2006), 287--311.

Digital Library

Google Scholar

[29]

Hung-Hsun Su, M. Billingsley, and A. D. George. 2008. Parallel performance wizard: A performance analysis tool for partitioned global-address-space programming. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008. 1--8. 10.1109/IPDPS.2008.4536476

Crossref

Google Scholar

[30]

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. Mclay, K. Schulz, and D. K. Panda. 2011. Design and evaluation of network topology-/speed- aware broadcast algorithms for InfiniBand clusters. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE Computer Society, Washington, DC, 317--325.

Digital Library

Google Scholar

[31]

Jingjin Wu, Zhiling Lan, Xuanxing Xiong, Nickolay Y. Gnedin, and Andrey V. Kravtsov. 2012. Hierarchical task mapping of cell-based AMR cosmology simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, Washington, DC, 1--10.

Digital Library

Google Scholar

[32]

Katherine Yelick, Vivek Sarkar, John Mellor-Crummey, James Demmel, Krste Asanovi, Armando Fox, Mattan Erez, Dan Quinlan, Surendra Byna, Marc Day, Tony Drummond, Paul Hargrove, Steven Hofmeyr, Costin Iancu, Khaled Ibrahim, Frank Mueller, Leonid Oliker, Eric Roman, John Shalf, David Skinner, Erich Strohmaier, Brian Van Straalen, Samuel Williams, and Yili Zheng. 2015. DEGAS: Dynamic Exascale Global Address Space; slides available online, retrieved 5/2015, http://goo.gl/IrfsIs.

Google Scholar

[33]

Wei Zheng, Lu Tang, and R. Sakellariou. 2015. A priority-based scheduling heuristic to maximize parallelism of ready tasks for DAG applications. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 596--605.

Digital Library

Google Scholar

Cited By

View all

Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Kayraklioglu EFavry EEl-Ghazawi T(2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TPDS.2021.3051348
Cruz EDiener MPilla LNavaux P(2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3309711
Show More Cited By

Index Terms

Exploiting Hierarchical Locality in Deep Parallel Architectures
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

PHLAME: Hierarchical Locality Exploitation Using the PGAS Model
PGAS '15: Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models

Parallel computers are becoming deeply hierarchical. Locality aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality ...
Asynchronous PGAS runtime for Myrinet networks
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and ...
Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and Compilers

Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...

Reviews

Reviewer: Maulik A Dave

Locality awareness in programs can be used to improve their execution performance on parallel computers. Modern parallel computers have many levels of parallelism; many cores on a chip and many chips in a node are examples. Locality awareness is the affinity of parallel threads to the distributed data. The paper describes a runtime system that takes locality awareness information expressed in the program and maps it to a multilevel parallel computer to improve execution times of the program. It describes the internals of the system and performance improvement results of various experiments. The paper describes the parallel hierarchical locality abstraction model (PHLAME). The first section presents bandwidth graphs for a modern parallel computer, showing bandwidths at various levels of its organization. The motivation of the work is to reduce communications among the threads executing on different cores/processors. The second section gives a survey of similar projects in the past few years. In the third section, the internals of the PHLAME runtime system are described. The PHLAME implementation model consists of a locality-aware programming model, a mappings evaluation mechanism, mapping strategies, descriptive models, and a runtime systems mapping. The formalism contains a machine description, application profiling, fitness of integrating threads, partitioning of thread interaction graphs, and a PHLAME adaptive selection test algorithm. Partitioning algorithms such as clustering, restricted splitting, and nonrestricted splitting are described. The adaptive selection algorithm quickly chooses the best partitioning algorithm. The next section describes the performances of these algorithms on a real-life multilevel parallel computer. The benchmarks chosen for performance measuring experiments are network-attached storage (NAS) parallel benchmarks written in message passing interface (MPI) and unified parallel C (UPC). The last section discusses the improvement in performance. The performance gains are shown to vary from two to 80 percent, which means that this optimization approach can be useful along with other optimizations. The paper claims it is unique because it extracts multilevel communications improvement from single-level locality-aware programs. This means that resources in rewriting existing algorithms can be saved. On the other hand, the approach opens a new application for graph partitioning algorithms and packages. The formalism in the paper is dedicated to explaining the setup of the approach. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 2

June 2016

200 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2952301

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Accepted: 01 February 2016

Revised: 01 January 2016

Received: 01 September 2015

Published in TACO Volume 13, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
514
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)16

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Kayraklioglu EFavry EEl-Ghazawi T(2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TPDS.2021.3051348
Cruz EDiener MPilla LNavaux P(2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3309711
Kayraklioglu EFavry EEl-Ghazawi T(2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
https://doi.org/10.1109/CCGRID.2019.00050
Kayraklioglu EFerguson MEl-Ghazawi T(2018)LAPPSACM Transactions on Architecture and Code Optimization10.1145/323329915:3(1-26)Online publication date: 28-Aug-2018
https://dl.acm.org/doi/10.1145/3233299
Khaleghzadeh HDeldari HReddy RLastovetsky A(2018)Hierarchical multicore thread mapping via estimation of remote communicationThe Journal of Supercomputing10.1007/s11227-017-2176-674:3(1321-1340)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1007/s11227-017-2176-6
Diener MWhite SKale LCampbell MBodony DFreund JPeña ABalaji PGropp WThakur R(2017)Improving the memory access locality of hybrid MPI applicationsProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127038(1-10)Online publication date: 25-Sep-2017
https://dl.acm.org/doi/10.1145/3127024.3127038
Kayraklioglu EChang WEl-Ghazawi T(2017)Comparative Performance and Optimization of Chapel in Modern Manycore Architectures2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.126(1105-1114)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.126
Diener MCruz EAlves MNavaux PKoren I(2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1145/3006385

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

PHLAME: Hierarchical Locality Exploitation Using the PGAS Model

Asynchronous PGAS runtime for Myrinet networks

Exploiting reuse locality on inclusive shared last-level caches

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations