Article

Programming for parallelism and locality with hierarchically tiled arrays

Authors:

Ganesh Bikshandi,

Daniel Hoeflinger,

Gheorghe Almasi,

Basilio B. Fraguela,

María J. Garzarán,

Christoph von PraunAuthors Info & Claims

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 48 - 57

https://doi.org/10.1145/1122971.1122981

Published: 29 March 2006 Publication History

Abstract

Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.In this paper, a data type - Hierarchically Tiled Arrays or HTAs - that facilitates the direct manipulation of tiles is introduced. HTA operations are overloaded array operations. We argue that the implementation of HTAs in sequential OO languages transforms these languages into powerful tools for the development of high-performance parallel codes and codes with high degree of locality. To support this claim, we discuss our experiences with the implementation of HTAs for MATLAB and C++ and the rewriting of the NAS benchmarks and a few other programs into HTA-based parallel form.

References

[1]

Intel Math Kernel Library. http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm.

[2]

Nas Parallel Benchmarks. Website. http://www.nas.nasa.gov/Software/NPB/.

[3]

High Performance Fortran Forum. High Performance Fortran Specification Version 2.0, January 1997.

Digital Library

[4]

R. C. Armstrong and A. Cheung. POET (Parallel Object-oriented Environment and Toolkit) and Frameworks for Scientific Distributed Computing. In Proc. of 30th Hawaii International Conference on System Sciences (HICSS 1997), pages 54--63, Maui, Hawai, 1997.

Digital Library

[5]

G. H. Barnes, R. M. Brown, M. Kato, D. Kuck, D. Slotnick, and R. Stokes. The ILLIAC IV Computer. IEEE Trans., 8(17):746--757, 1968.

Digital Library

[6]

G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.

[7]

L. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969.

Digital Library

[8]

W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999.

[9]

B. Chamberlain, S.Choi, E. Lewis, C. Lin, L. Synder, and W. Weathersby. The Case for High Level Parallel Programming in ZPL. IEEE Computational Science and Engineering, 5(3):76--86, July--September 1998.

Digital Library

[10]

B. Chapman, P. Mehrotra, and H. P. Zima. Vienna Fortrana Fortran Language Extension for Distributed Memory Multiprocessors. Languages, Compilers and Run-time Environments for Distributed Memory Machines, pages 39--62, 1992.

Digital Library

[11]

P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, and V. Sarkar. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Procs. of the Conf. on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) -- Onward! Track, Oct 2005.

Digital Library

[12]

S. J. Deitz. Renewed Hope for Data Parallelism: Unintegrated Support for Task Parallelism in ZPL. Technical Report UW-CSE-03-12-04, University of Washington, Dec 2003.

[13]

R. A. V. D. Geijn and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Concurrency: Practice and Experience, 9(4):255--274, Apr 1997.

Digital Library

[14]

A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderamet. PVM: Parallel Virtual Machine: A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, 1994.

Digital Library

[15]

W. Gropp, E. Lusk, and A. Skjellum. Using MPI (2nd ed.): Portable Parallel Programming with the Message-Passing Interface". MIT Press, 1999.

Digital Library

[16]

S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling Fortran D for MIMD Distributed-memory Machines. Commun. ACM, 35(8):66--80, 1992.

Digital Library

[17]

P. Husbands and C. Isbell. Matlab*p: A Tool for Interactive Supercomputing. In Procs. of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999.

[18]

C. Koelbel and P. Mehrotra. An Overview of High Performance Fortran. SIGPLAN Fortran Forum, 11(4):9--16, 1992.

Digital Library

[19]

T. A. Ngo. The Role of Performance Models in Parallel Programming and Languages. PhD thesis, Department of Computer Science and Engineering, University of Washington, 1997.

Digital Library

[20]

J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable shared-memory programming model for distributed memory computers. In Supercomputing '94: Proc. of the 1994 Conf. on Supercomputing, pages 340--ff., Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

Digital Library

[21]

R. W. Numrich and J. Reid. Co-array Fortran for Parallel Programming. SIGPLAN Fortran Forum, 17(2):1--31, 1998.

Digital Library

[22]

D. Pham and et al. The Design and Implementation of a First-generation Cell Processor. In Procs. of the IEEE Solid-State Circuits Symposium, February 2005.

[23]

J. V. W. Reynders, P. J. Hinker, J. C. Cummings, S. R. Atlas, S. Banerjee, W. F. Humphrey, S. R. Karmesin, K. Keahey, M. Srikant, and M. D. Tholburn. POOMA: A Framework for Scientific Simulations of Paralllel Architectures. In G. V. Wilson and P. Lu, editors, Parallel Programming in C++, pages 547--588. MIT Press, 1996.

[24]

A. E. Trefethen, V. S. Menon, C. Chang, G. Czajkowski, C. Myers, and L. N. Trefethen. MultiMATLAB: MATLAB on Multiple Processors. Technical Report TR96-1586, May 1996.

Digital Library

[25]

R. Whaley, A. Petitet, and J. Dongarra. Automated Empirical Optimizations of Sofware and the ATLAS Project. Parallel Computing, 27(1-2):3--35, 2001.

Digital Library

[26]

M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. In PLDI, pages 30--44. ACM Press, 1991.

Digital Library

[27]

K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A High-Performance Java Dialect. In Workshop on Java for High-Performance Network Computing, February 1998.

Cited By

Steil TReza TPriest BPearce RMohror KArnold DBadia R(2023)Embracing Irregular Parallelism in HPC with YGMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607103(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607103
Dykes TFoyer CRichardson HSvedin MPodobas AJansson NMarkidis STate AMcIntosh-Smith S(2021)Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC54578.2021.00005(10-21)Online publication date: Nov-2021
https://doi.org/10.1109/P3HPC54578.2021.00005
Phaosawasdi ARodrigues CChen LWu P(2021)CubeGen: Code Generation for Accelerated GEMM-Based Convolution with TilingLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_11(147-163)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_11
Show More Cited By

Index Terms

Programming for parallelism and locality with hierarchically tiled arrays
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
      2. Language types
        Concurrent programming languages

Recommendations

The Hierarchically Tiled Arrays programming approach
LCR '04: Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems

In this paper, we show our initial experience with a class of objects, called Hierarchically Tiled Arrays (HTAs), that encapsulate parallelism. HTAs allow the construction of single-threaded parallel programs where a master process distributes tasks to ...
Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems
PDP '09: Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing

Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of programming ...
Hierarchically tiled arrays for parallelism and locality
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

Parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives containing information about data and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2006

258 pages

ISBN:1595931899

DOI:10.1145/1122971

General Chair:
Josep Torrellas
University of Illinois
,
Program Chair:
Siddhartha Chatterjee
IBM Research

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP06

Sponsor:

PPoPP06: ACM SIGPLAN 2006 Symposium on Principles and Practice of Parallel Programming 2006

March 29 - 31, 2006

New York, New York, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

80
Total Citations
View Citations
932
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Steil TReza TPriest BPearce RMohror KArnold DBadia R(2023)Embracing Irregular Parallelism in HPC with YGMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607103(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607103
Dykes TFoyer CRichardson HSvedin MPodobas AJansson NMarkidis STate AMcIntosh-Smith S(2021)Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC54578.2021.00005(10-21)Online publication date: Nov-2021
https://doi.org/10.1109/P3HPC54578.2021.00005
Phaosawasdi ARodrigues CChen LWu P(2021)CubeGen: Code Generation for Accelerated GEMM-Based Convolution with TilingLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_11(147-163)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_11
Devarajan HKougkas ABateman KSun X(2020)HCL: Distributing Parallel Data Structures in Extreme Scales2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00035(248-258)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00035
Marton FAgus MGobbetti E(2019)A framework for GPU‐accelerated exploration of massive time‐varying rectilinear scalar volumesComputer Graphics Forum10.1111/cgf.1367138:3(53-66)Online publication date: 10-Jul-2019
https://doi.org/10.1111/cgf.13671
Bachan JBaden SHofmeyr SJacquelin MKamil ABonachea DHargrove PAhmed H(2019)UPC++: A High-Performance Communication Framework for Asynchronous Computation2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00104(963-973)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00104
Cho HKwon OMidkiff S(2019)HDArray: Parallel Array Interface for Distributed Heterogeneous DevicesLanguages and Compilers for Parallel Computing10.1007/978-3-030-34627-0_13(176-184)Online publication date: 13-Nov-2019
https://doi.org/10.1007/978-3-030-34627-0_13
Yang CPichel JPadua D(2019)Dataflow Execution of Hierarchically Tiled ArraysEuro-Par 2019: Parallel Processing10.1007/978-3-030-29400-7_22(304-316)Online publication date: 26-Aug-2019
https://dl.acm.org/doi/10.1007/978-3-030-29400-7_22
Walker D(2018)Morton ordering of 2D arrays for efficient access to hierarchical memoryInternational Journal of High Performance Computing Applications10.5555/3195474.319548532:1(189-203)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3195474.3195485
Fürlinger KKowalewski RFuchs TLehmann BEndo TYokokawa MHanawa TTatebe O(2018)Investigating the performance and productivity of DASH using the Cowichan problemsProceedings of Workshops of HPC Asia10.1145/3176364.3176366(11-20)Online publication date: 31-Jan-2018
https://dl.acm.org/doi/10.1145/3176364.3176366
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents