Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2967938.2967945acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Fusion of Parallel Array Operations

Published: 11 September 2016 Publication History

Abstract

We address the problem of fusing array operations based on criteria such as shape compatibility, data reuse, and minimizing for data reuse, the fusion problem has been formulated as a static weighted graph partitioning problem (known as the Weighted Loop Fusion problem). We show that this scheme cannot accurately track data reuse between multiple independent loops, since it overestimates total data reuse of certain cases. Our formulation in terms of partitions allows use of realistic cost functions that can track resource usage accurately. We give correctness proofs, and prove that WSP can maximize data reuse in programs exactly, in contrast to prior work.
For the exact optimal solution, which is NP-hard to find, we present a branch-and-bound algorithm together with a polynomial-time preconditioner that reduces the problem size significantly in practice. We further present a polynomial-time greedy approximation that is fast enough to use for JIT-compilation and gives near-optimal results in practice.
All algorithms have been implemented in the automatic parallelization platform Bohrium, run on a set of benchmarks, and compared to existing methods from the literature.

References

[1]
A. Allam, J. Ramanujam, G. Baumgartner, and P. Sadayappan. Memory minimization for tensor contractions using integer linear programming. In Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06, pages 382--382, Washington, DC, USA, 2006. IEEE Computer Society.
[2]
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Oct. 2001.
[3]
M. Berkelaar, K. Eikland, and P. Notebaert. lpsolve: Open source (Mixed-Integer) Linear Programming system.
[4]
D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI '90, pages 53--65, New York, NY, USA, 1990. ACM.
[5]
S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software: Practice and Experience, 24(1):51--77, 1994.
[6]
B. Chamberlain, S.-E. Choi, C. Lewis, C. Lin, L. Snyder, and W. Weathersby. Zpl: a machine independent programming language for parallel computers. Software Engineering, IEEE Transactions on, 26(3):197--211, Mar 2000.
[7]
E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiway cuts. In Proceedings of the twenty-fourth annual ACM symposium on Theory of computing, pages 241--251. ACM, 1992.
[8]
A. Darte and G. Huard. New results on array contraction {memory optimization}. In Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on, pages 359--370. IEEE, 2002.
[9]
C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. J. Parallel Distrib. Comput., 64(1):108--134, 2004.
[10]
G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, volume 757 of Lecture Notes in Computer Science, pages 281--295. Springer, 1993.
[11]
K. Kennedy. Fast greedy weighted fusion. International Journal of Parallel Programming, 29(5):463--491, 2001.
[12]
K. Kennedy and K. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, volume 768 of Lecture Notes in Computer Science, pages 301--320. Springer, 1993.
[13]
M. R. B. Kristensen, J. E. Avery, T. Blum, S. A. F. Lund, and B. Vinter. Battling memory requirements of array programming through streaming. In Proceedings of First International Workshop on Performance Portable Programming Models for Accelerators (P3MA), Lecture Notes in Computer Science. Springer, 2016.
[14]
M. R. B. Kristensen, S. A. F. Lund, T. Blum, and K. Skovhede. Separating NumPy API from Implementation. In 5th Workshop on Python for High Performance and Scientific Computing (PyHPC'14), 2014.
[15]
M. R. B. Kristensen, S. A. F. Lund, T. Blum, K. Skovhede, and B. Vinter. Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster. In 4th Workshop on Python for High Performance and Scientific Computing (PyHPC'13), 2013.
[16]
M. R. B. Kristensen, S. A. F. Lund, T. Blum, K. Skovhede, and B. Vinter. Bohrium: a Virtual Machine Approach to Portable Parallelism. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International, pages 312--321. IEEE, 2014.
[17]
E. C. Lewis, C. Lin, and L. Snyder. The implementation and evaluation of fusion and contraction in array languages. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 50--59, New York, NY, USA, 1998. ACM.
[18]
A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. In Parallel Computing, pages 201--214. ACM, 1998.
[19]
A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. SIGPLAN Not., 36(7):103--112, June 2001.
[20]
D. Loveman. High performance fortran. Parallel & Distributed Technology: Systems & Applications, IEEE, 1(1):25--42, 1993.
[21]
S. A. F. Lund and B. Vinter. Automatic mapping of array operations to specific architectures. In submission to Elsevier journal on Parallel Computing, 2015. Ref. PARCO-D-15-00170.
[22]
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst., 8(2):193--209, Feb. 1997.
[23]
K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18(4):424--453, July 1996.
[24]
N. Megiddo and V. Sarkar. Optimal weighted loop fusion for parallel programs. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, pages 282--291. ACM, 1997.
[25]
S. Mehta, P.-H. Lin, and P.-C. Yew. Revisiting loop fusion in the polyhedral framework. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15--19, 2014, pages 233--246, 2014.
[26]
A. Robinson, B. Lippmeier, and G. Keller. Fusing filters with integer linear programming. In Functional High Performance Computing, 2014.
[27]
S.-B. Scholz. Single Assignment C: Efficient support for high-level array operations in a functional setting. J. Funct. Program., 13(6):1005--1059, Nov. 2003.
[28]
C. Y. Shei, A. Yoga, M. Ramesh, and A. Chauhan. Matlab parallelization through scalarization. In 2011 15th Workshop on Interaction between Compilers and Computer Architectures, pages 44--53, Feb 2011.
[29]
S. K. Singhai and K. S. McKinley. A parametrized loop fusion algorithm for improving parallelism and cache locality. Comput. J., 40(6):340--355, 1997.
[30]
Y. Song, C. Wang, and Z. Li. A polynomial-time algorithm for memory space reduction. International Journal of Parallel Programming, 33(1):1--33, 2005.
[31]
S. Van Der Walt, S. Colbert, and G. Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22--30, 2011.
[32]
T. Wolle, H. L. Bodlaender, et al. A note on edge contraction. Technical report, Technical Report UU-CS-2004, 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bohrium
  2. high-productivity
  3. hpc
  4. numpy
  5. python

Qualifiers

  • Research-article

Conference

PACT '16
Sponsor:
  • IFIP WG 10.3
  • IEEE TCCA
  • SIGARCH
  • IEEE CS TCPP

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2021)TuplexProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457244(1718-1731)Online publication date: 9-Jun-2021
  • (2019)Using an evolutionary approach based on shortest common supersequence problem for loop fusionSoft Computing10.1007/s00500-019-04338-zOnline publication date: 21-Sep-2019
  • (2018)Array streaming for array programmingInternational Journal of Computational Science and Engineering10.5555/3292750.329275217:3(263-282)Online publication date: 1-Jan-2018
  • (2018)Veros v0.1 – a fast and versatile ocean simulator in pure PythonGeoscientific Model Development10.5194/gmd-11-3299-201811:8(3299-3312)Online publication date: 16-Aug-2018
  • (2018)A compiler approach to map algebraGeoinformatica10.1007/s10707-017-0312-322:2(211-235)Online publication date: 1-Apr-2018
  • (2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingACM SIGMETRICS Performance Evaluation Review10.1145/3143314.307852045:1(6-6)Online publication date: 5-Jun-2017
  • (2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/30844511:1(1-21)Online publication date: 13-Jun-2017
  • (2017)A Simple Yet Effective Balanced Edge Partition Model for Parallel ComputingProceedings of the 2017 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems10.1145/3078505.3078520(6-6)Online publication date: 5-Jun-2017
  • (2017)From Python Scripting to Parallel Spatial Modeling: Cellular Automata Simulations of Land Use, Hydrology and Pest Dynamics2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP.2017.18(511-518)Online publication date: 2017
  • (2017)Acyclic Partitioning of Large Directed Acyclic GraphsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2017.101(371-380)Online publication date: 14-May-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media