Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Exploiting Hierarchical Locality in Deep Parallel Architectures

Published: 14 June 2016 Publication History

Abstract

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.

References

[1]
Ahmad Anbar, Abdel-Hameed Badawy, Olivier Serres, and Tarek El-Ghazawi. 2014. Where should the threads go? Leveraging hierarchical data locality to solve the thread affinity dilemma. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). IEEE. http://hpcl.seas.gwu.edu/eprints/38/.
[2]
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed Badawy, and Tarek El-Ghazawi. 2015. PHLAME: Hierarchical locality exploitation using the PGAS model. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS’15). IEEE Computer Society, Washington, DC, 82--89.
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks&Mdash;summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing’91). ACM, New York, NY, 158--165.
[4]
Dan Bonachea. 2002. GASNet Specification, V1.1. Technical Report. Berkeley, CA, USA.
[5]
Francois Broquedis, Jérôme Clet-Ortega, Stéphanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010. Hwloc: A generic framework for managing hardware affinities in HPC applications. In Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP’10). IEEE Computer Society, Washington, DC, 180--186.
[6]
Jehoshua Bruck, Ching-Tien Ho, Eli Upfal, Shlomo Kipnis, and Derrick Weathersby. 1997. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8, 11 (Nov. 1997), 1143--1156.
[7]
B. L. Chamberlain, D. Callahan, and H. P. Zima. 2007. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl. 21, 3 (Aug. 2007), 291--312. 1094342007078442
[8]
P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. 2010. Cache hierarchy and memory subsystem of the AMD opteron processor. IEEE Micro 30, 2 (March 2010), 16--29. 10.1109/MM.2010.31
[9]
E. H. M. Cruz, M. Diener, and P. O. A. Navaux. 2012. Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 532--543.
[10]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 1 (Jan. 1998), 46--55.
[11]
Karen Devine, Erik Boman, Robert Heaphy, Bruce Hendrickson, and Courtenay Vaughan. 2002. Zoltan data management services for parallel dynamic applications. Comput. Sci. Eng. 4, 2 (2002), 90--97.
[12]
Tarek El-Ghazawi and François Cantonnet. 2002. UPC performance and potential: A NPB experimental study. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press, Los Alamitos, CA, 1--26.
[13]
Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2003. UPC: Distributed Shared-Memory Programming. Wiley-Interscience, New York, NY.
[14]
Tarek El-Ghazawi, W. Carlson, T. Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming.
[15]
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users’ Group Meeting. Budapest, Hungary, 97--104.
[16]
Habanero-C. 2015. Homepage. Retrieved from https://wiki.rice.edu/confluence/display/HABANERO/ Habanero-C.
[17]
Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J. Wright, and Laxmikant V. Kale. 2014. Maximizing throughput on a dragonfly network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, Piscataway, NJ, 336--347.
[18]
E. Jeannot, G. Mercier, and F. Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25, 4 (April 2014), 993--1002.
[19]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359--392.
[20]
R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson, L. Carrington, G. Chiu, R. Colwell, W. Dally, J. Dongarra, and others. 2014. Top ten exascale research challenges. DOE ASCAC Subcommittee Report (February 2014).
[21]
Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. Computer 35, 2 (Feb. 2002), 50--58.
[22]
Deepak Majeti, Rajkishore Barik, Jisheng Zhao, Max Grossman, and Vivek Sarkar. 2014. Compiler-driven data layout transformation for heterogeneous platforms. In Euro-Par 2013: Parallel Processing Workshops. Springer, Berlin, 188--197.
[23]
E. H. Molina da Cruz, M. A. Zanata Alves, A. Carissimi, P. O. A. Navaux, C. P. Ribeiro, and J. Mehaut. 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW). 551--558.
[24]
PGAS. 2015. Partitioned Global Address Space Languages. (2015). http://pgas.org Accessed: 7/2015.
[25]
Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, Anthony Curtis, and Karl Feind. 2011. OpenSHMEM - Toward a unified RMA model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer, Berlin, 1379--1391.
[26]
Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg, and Torsten Hoefler. 2013. Bandwidth-optimal all-to-all exchanges in fat tree networks. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY 139--148.
[27]
Paul Sack and William Gropp. 2012. Faster topology-aware collective algorithms through non-minimal communication. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY 45--54.
[28]
Sameer S. Shende and Allen D. Malony. 2006. The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20, 2 (May 2006), 287--311.
[29]
Hung-Hsun Su, M. Billingsley, and A. D. George. 2008. Parallel performance wizard: A performance analysis tool for partitioned global-address-space programming. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008. 1--8. 10.1109/IPDPS.2008.4536476
[30]
H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K. Tomko, R. Mclay, K. Schulz, and D. K. Panda. 2011. Design and evaluation of network topology-/speed- aware broadcast algorithms for InfiniBand clusters. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER’11). IEEE Computer Society, Washington, DC, 317--325.
[31]
Jingjin Wu, Zhiling Lan, Xuanxing Xiong, Nickolay Y. Gnedin, and Andrey V. Kravtsov. 2012. Hierarchical task mapping of cell-based AMR cosmology simulations. In Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE Computer Society, Washington, DC, 1--10.
[32]
Katherine Yelick, Vivek Sarkar, John Mellor-Crummey, James Demmel, Krste Asanovi, Armando Fox, Mattan Erez, Dan Quinlan, Surendra Byna, Marc Day, Tony Drummond, Paul Hargrove, Steven Hofmeyr, Costin Iancu, Khaled Ibrahim, Frank Mueller, Leonid Oliker, Eric Roman, John Shalf, David Skinner, Erich Strohmaier, Brian Van Straalen, Samuel Williams, and Yili Zheng. 2015. DEGAS: Dynamic Exascale Global Address Space; slides available online, retrieved 5/2015, http://goo.gl/IrfsIs.
[33]
Wei Zheng, Lu Tang, and R. Sakellariou. 2015. A priority-based scheduling heuristic to maximize parallelism of ready tasks for DAG applications. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 596--605.

Cited By

View all
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
  • (2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
  • Show More Cited By

Recommendations

Reviews

Maulik A Dave

Locality awareness in programs can be used to improve their execution performance on parallel computers. Modern parallel computers have many levels of parallelism; many cores on a chip and many chips in a node are examples. Locality awareness is the affinity of parallel threads to the distributed data. The paper describes a runtime system that takes locality awareness information expressed in the program and maps it to a multilevel parallel computer to improve execution times of the program. It describes the internals of the system and performance improvement results of various experiments. The paper describes the parallel hierarchical locality abstraction model (PHLAME). The first section presents bandwidth graphs for a modern parallel computer, showing bandwidths at various levels of its organization. The motivation of the work is to reduce communications among the threads executing on different cores/processors. The second section gives a survey of similar projects in the past few years. In the third section, the internals of the PHLAME runtime system are described. The PHLAME implementation model consists of a locality-aware programming model, a mappings evaluation mechanism, mapping strategies, descriptive models, and a runtime systems mapping. The formalism contains a machine description, application profiling, fitness of integrating threads, partitioning of thread interaction graphs, and a PHLAME adaptive selection test algorithm. Partitioning algorithms such as clustering, restricted splitting, and nonrestricted splitting are described. The adaptive selection algorithm quickly chooses the best partitioning algorithm. The next section describes the performances of these algorithms on a real-life multilevel parallel computer. The benchmarks chosen for performance measuring experiments are network-attached storage (NAS) parallel benchmarks written in message passing interface (MPI) and unified parallel C (UPC). The last section discusses the improvement in performance. The performance gains are shown to vary from two to 80 percent, which means that this optimization approach can be useful along with other optimizations. The paper claims it is unique because it extracts multilevel communications improvement from single-level locality-aware programs. This means that resources in rewriting existing algorithms can be saved. On the other hand, the approach opens a new application for graph partitioning algorithms and packages. The formalism in the paper is dedicated to explaining the setup of the approach. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 2
June 2016
200 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2952301
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016
Accepted: 01 February 2016
Revised: 01 January 2016
Received: 01 September 2015
Published in TACO Volume 13, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. PGAS
  2. PHAST
  3. PHLAME
  4. hierarchical locality exploitation
  5. productivity

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)16
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2021)A Machine-Learning-Based Framework for Productive Locality ExploitationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.305134832:6(1409-1424)Online publication date: 1-Jun-2021
  • (2019)EagerMapACM Transactions on Parallel Computing10.1145/33097115:4(1-24)Online publication date: 8-Mar-2019
  • (2019)A Machine Learning Approach for Productive Data Locality Exploitation in Parallel Computing Systems2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00050(361-370)Online publication date: May-2019
  • (2018)LAPPSACM Transactions on Architecture and Code Optimization10.1145/323329915:3(1-26)Online publication date: 28-Aug-2018
  • (2018)Hierarchical multicore thread mapping via estimation of remote communicationThe Journal of Supercomputing10.1007/s11227-017-2176-674:3(1321-1340)Online publication date: 1-Mar-2018
  • (2017)Improving the memory access locality of hybrid MPI applicationsProceedings of the 24th European MPI Users' Group Meeting10.1145/3127024.3127038(1-10)Online publication date: 25-Sep-2017
  • (2017)Comparative Performance and Optimization of Chapel in Modern Manycore Architectures2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.126(1105-1114)Online publication date: May-2017
  • (2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media