research-article

Fusion of Parallel Array Operations

Authors:

Mads R.B. Kristensen,

Simon A.F. Lund,

James AveryAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 71 - 85

https://doi.org/10.1145/2967938.2967945

Published: 11 September 2016 Publication History

Abstract

We address the problem of fusing array operations based on criteria such as shape compatibility, data reuse, and minimizing for data reuse, the fusion problem has been formulated as a static weighted graph partitioning problem (known as the Weighted Loop Fusion problem). We show that this scheme cannot accurately track data reuse between multiple independent loops, since it overestimates total data reuse of certain cases. Our formulation in terms of partitions allows use of realistic cost functions that can track resource usage accurately. We give correctness proofs, and prove that WSP can maximize data reuse in programs exactly, in contrast to prior work.

For the exact optimal solution, which is NP-hard to find, we present a branch-and-bound algorithm together with a polynomial-time preconditioner that reduces the problem size significantly in practice. We further present a polynomial-time greedy approximation that is fast enough to use for JIT-compilation and gives near-optimal results in practice.

All algorithms have been implemented in the automatic parallelization platform Bohrium, run on a set of benchmarks, and compared to existing methods from the literature.

References

[1]

A. Allam, J. Ramanujam, G. Baumgartner, and P. Sadayappan. Memory minimization for tensor contractions using integer linear programming. In Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06, pages 382--382, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[2]

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Oct. 2001.

Digital Library

[3]

M. Berkelaar, K. Eikland, and P. Notebaert. lpsolve: Open source (Mixed-Integer) Linear Programming system.

[4]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI '90, pages 53--65, New York, NY, USA, 1990. ACM.

Digital Library

[5]

S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software: Practice and Experience, 24(1):51--77, 1994.

Digital Library

[6]

B. Chamberlain, S.-E. Choi, C. Lewis, C. Lin, L. Snyder, and W. Weathersby. Zpl: a machine independent programming language for parallel computers. Software Engineering, IEEE Transactions on, 26(3):197--211, Mar 2000.

Digital Library

[7]

E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiway cuts. In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, pages 241--251. ACM, 1992.

Digital Library

[8]

A. Darte and G. Huard. New results on array contraction {memory optimization}. In Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, pages 359--370. IEEE, 2002.

Digital Library

[9]

C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel Distrib. Comput., 64(1):108--134, 2004.

Digital Library

[10]

G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 281--295. Springer, 1993.

Digital Library

[11]

K. Kennedy. Fast greedy weighted fusion. International Journal of Parallel Programming, 29(5):463--491, 2001.

Digital Library

[12]

K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 301--320. Springer, 1993.

Digital Library

[13]

M. R. B. Kristensen, J. E. Avery, T. Blum, S. A. F. Lund, and B. Vinter. Battling memory requirements of array programming through streaming. In Proceedings of First International Workshop on Performance Portable Programming Models for Accelerators (P3MA), Lecture Notes in Computer Science. Springer, 2016.

[14]

M. R. B. Kristensen, S. A. F. Lund, T. Blum, and K. Skovhede. Separating NumPy API from Implementation. In 5th Workshop on Python for High Performance and Scientific Computing (PyHPC'14), 2014.

[15]

M. R. B. Kristensen, S. A. F. Lund, T. Blum, K. Skovhede, and B. Vinter. Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster. In 4th Workshop on Python for High Performance and Scientific Computing (PyHPC'13), 2013.

[16]

M. R. B. Kristensen, S. A. F. Lund, T. Blum, K. Skovhede, and B. Vinter. Bohrium: a Virtual Machine Approach to Portable Parallelism. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 312--321. IEEE, 2014.

Digital Library

[17]

E. C. Lewis, C. Lin, and L. Snyder. The implementation and evaluation of fusion and contraction in array languages. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 50--59, New York, NY, USA, 1998. ACM.

Digital Library

[18]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. In Parallel Computing, pages 201--214. ACM, 1998.

Digital Library

[19]

A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. SIGPLAN Not., 36(7):103--112, June 2001.

Digital Library

[20]

D. Loveman. High performance fortran. Parallel & Distributed Technology: Systems & Applications, IEEE, 1(1):25--42, 1993.

Digital Library

[21]

S. A. F. Lund and B. Vinter. Automatic mapping of array operations to specific architectures. In submission to Elsevier journal on Parallel Computing, 2015. Ref. PARCO-D-15-00170.

[22]

N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst., 8(2):193--209, Feb. 1997.

Digital Library

[23]

K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18(4):424--453, July 1996.

Digital Library

[24]

N. Megiddo and V. Sarkar. Optimal weighted loop fusion for parallel programs. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, pages 282--291. ACM, 1997.

Digital Library

[25]

S. Mehta, P.-H. Lin, and P.-C. Yew. Revisiting loop fusion in the polyhedral framework. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15--19, 2014, pages 233--246, 2014.

Digital Library

[26]

A. Robinson, B. Lippmeier, and G. Keller. Fusing filters with integer linear programming. In Functional High Performance Computing, 2014.

Digital Library

[27]

S.-B. Scholz. Single Assignment C: Efficient support for high-level array operations in a functional setting. J. Funct. Program., 13(6):1005--1059, Nov. 2003.

Digital Library

[28]

C. Y. Shei, A. Yoga, M. Ramesh, and A. Chauhan. Matlab parallelization through scalarization. In 2011 15th Workshop on Interaction between Compilers and Computer Architectures, pages 44--53, Feb 2011.

Digital Library

[29]

S. K. Singhai and K. S. McKinley. A parametrized loop fusion algorithm for improving parallelism and cache locality. Comput. J., 40(6):340--355, 1997.

[30]

Y. Song, C. Wang, and Z. Li. A polynomial-time algorithm for memory space reduction. International Journal of Parallel Programming, 33(1):1--33, 2005.

Digital Library

[31]

S. Van Der Walt, S. Colbert, and G. Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22--30, 2011.

Digital Library

[32]

T. Wolle, H. L. Bodlaender, et al. A note on edge contraction. Technical report, Technical Report UU-CS-2004, 2004.

Cited By

Spiegelberg LYesantharao RSchwarzkopf MKraska TLi GLi ZIdreos SSrivastava D(2021)TuplexProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457244(1718-1731)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457244
Ziraksima MLotfi SIzadkhah H(2019)Using an evolutionary approach based on shortest common supersequence problem for loop fusionSoft Computing10.1007/s00500-019-04338-zOnline publication date: 21-Sep-2019
https://doi.org/10.1007/s00500-019-04338-z
(2018)Array streaming for array programmingInternational Journal of Computational Science and Engineering10.5555/3292750.329275217:3(263-282)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3292750.3292752
Show More Cited By

Index Terms

Fusion of Parallel Array Operations

Recommendations

Legate NumPy: accelerated and distributed array computing
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

NumPy is a popular Python library used for performing array-based numerical computations. The canonical implementation of NumPy used by most programmers runs on a single CPU core and is parallelized to use multiple cores for some operations. This ...
Algebraic Transformation of Descriptive Vector Byte-code Sequences
Middleware Doctoral Symposium'16: Proceedings of the Doctoral Symposium of the 17th International Middleware Conference

Both high-productivity and high-performance are two often sought after aspects of scientific programming. Python gives the programmer high-productivity, but even with NumPy it is often not high-performant because of the GIL1, which makes it inherently ...
An Abstraction for Distributed Stencil Computations Using Charm++
Asynchronous Many-Task Systems and Applications
Abstract
Python has emerged as a popular programming language for scientific computing in recent years, thanks to libraries like Numpy and SciPy. Numpy, in particular, is widely utilized for prototyping numerical solvers using methods such as finite ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Sponsor:
sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Long Beach , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
291
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Spiegelberg LYesantharao RSchwarzkopf MKraska TLi GLi ZIdreos SSrivastava D(2021)TuplexProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457244(1718-1731)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457244
Ziraksima MLotfi SIzadkhah H(2019)Using an evolutionary approach based on shortest common supersequence problem for loop fusionSoft Computing10.1007/s00500-019-04338-zOnline publication date: 21-Sep-2019
https://doi.org/10.1007/s00500-019-04338-z
(2018)Array streaming for array programmingInternational Journal of Computational Science and Engineering10.5555/3292750.329275217:3(263-282)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.5555/3292750.3292752
Häfner DJacobsen REden CKristensen MJochum MNuterman RVinter B(2018)Veros v0.1 – a fast and versatile ocean simulator in pure PythonGeoscientific Model Development10.5194/gmd-11-3299-201811:8(3299-3312)Online publication date: 16-Aug-2018
https://doi.org/10.5194/gmd-11-3299-2018
Carabaño JWesterholm JSarjakoski T(2018)A compiler approach to map algebraGeoinformatica10.1007/s10707-017-0312-322:2(211-235)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s10707-017-0312-3
Li LGeda RHayes AChen YChaudhari PZhang ESzegedy M(2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingACM SIGMETRICS Performance Evaluation Review10.1145/3143314.307852045:1(6-6)Online publication date: 5-Jun-2017
https://dl.acm.org/doi/10.1145/3143314.3078520
Li LGeda RHayes AChen YChaudhari PZhang ESzegedy M(2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/30844511:1(1-21)Online publication date: 13-Jun-2017
https://dl.acm.org/doi/10.1145/3084451
Li LGeda RHayes AChen YChaudhari PZhang ESzegedy MHajek BOh SChaintreau AGolubchik LZhang Z(2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingProceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems10.1145/3078505.3078520(6-6)Online publication date: 5-Jun-2017
https://dl.acm.org/doi/10.1145/3078505.3078520
Carabano JWesterholm J(2017)From Python Scripting to Parallel Spatial Modeling: Cellular Automata Simulations of Land Use, Hydrology and Pest Dynamics2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP.2017.18(511-518)Online publication date: 2017
https://doi.org/10.1109/PDP.2017.18
Herrmann JKho JUçar BKaya KÇatalyürek Ü(2017)Acyclic Partitioning of Large Directed Acyclic GraphsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2017.101(371-380)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1109/CCGRID.2017.101
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents