Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3460945.3464955acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

Predictive data locality optimization for higher-order tensor computations

Published: 20 June 2021 Publication History

Abstract

Automating locality optimization is still an open problem for compiler writers. Compiler-based approaches, guided by analytical cost models have achieved some success in matching high performance libraries on a restricted set of computations such as general matrix multiply (GEMM). On the other hand, library-based approaches may present some open scalability concerns. Recent developments in convolutional neural networks has seen an explosion of models, each with differing combinations of parameters. Manually tuning each of these configurations can take many development months. Further, these operations are called multiple times during machine learning training, which necessitates highly optimized implementations. 2D convolutional operators are unique in that they consist of 7-deep loop nests with different loops carrying reuse for different tensors, making the problem of identifying an optimal loop ordering hard. We devise a machine learning-based compiler which learns a regression model, correlating performance with the loop order. We integrate this model with other traditional compiler analysis for transformations such as loop unrolling and vectorization, relying on the MultiLevel Intermediate Representation (MLIR) compiler framework. We achieve an average speedup of 1.67x and 1.41x against oneDNN for 2D convolution forward and weight update kernels respectively. We are also at 0.88x and 0.96x the performance of oneDNN’s best performing implementation which applies additional data layout transformations.

References

[1]
[n.d.]. oneDNN. https://github.com/oneapi-src/oneDNN Accessed: 2021-04-01.
[2]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and et al. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph., 38, 4 (2019), Article 121, July, 12 pages. issn:0730-0301 https://doi.org/10.1145/3306346.3322967
[3]
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In International Symposium on Code Generation and Optimization (CGO’06). 11 pp.–305. issn:null https://doi.org/10.1109/CGO.2006.37
[4]
L. Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman. 2004. Finding Effective Compilation Sequences. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’04). Association for Computing Machinery, New York, NY, USA. 231–239. isbn:1581138067 https://doi.org/10.1145/997163.997196
[5]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2Vec: Learning Distributed Representations of Code. Proc. ACM Program. Lang., 3, POPL (2019), Article 40, Jan., 29 pages. issn:2475-1421 https://doi.org/10.1145/3290353
[6]
Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018. A Survey on Compiler Autotuning Using Machine Learning. ACM Comput. Surv., 51, 5 (2018), Article 96, Sept., 42 pages. issn:0360-0300 https://doi.org/10.1145/3197978
[7]
A. H. Ashouri, G. Mariani, G. Palermo, and C. Silvano. 2014. A Bayesian network approach for compiler auto-tuning for embedded processors. In 2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia). 90–97. https://doi.org/10.1109/ESTIMedia.2014.6962349
[8]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166.
[9]
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219–232.
[10]
M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE, 93, 2 (2005), Feb, 216–231. issn:1558-2256 https://doi.org/10.1109/JPROC.2004.840301
[11]
Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard, Elton Ashton, Edwin V. Bonilla, John Thomson, Christopher K. I. Williams, and Michael F. P. O’Boyle. 2011. Milepost GCC: Machine Learning Enabled Self-tuning Compiler. Int. J. Parallel Program., 39, 3 (2011), 296–327. http://dblp.uni-trier.de/db/journals/ijpp/ijpp39.html##FursinKMCTNYMZCa11
[12]
Justin Gottschlich, Armando Solar-Lezama, Nesime Tatbul, Michael Carbin, Martin Rinard, Regina Barzilay, Saman Amarasinghe, Joshua B Tenenbaum, and Tim Mattson. 2018. The three pillars of machine programming. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 69–80.
[13]
Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: end-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.
[14]
Q. Huang, A. Haj-Ali, W. Moses, J. Xiang, I. Stoica, K. Asanovic, and J. Wawrzynek. 2019. AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 308–308. https://doi.org/10.1109/FCCM.2019.00049
[15]
C. Kartsaklis, O. Hernandez, C. Hsu, T. Ilsche, W. Joubert, and R. L. Graham. 2012. HERCULES: A Pattern Driven Code Transformation System. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 574–583. https://doi.org/10.1109/IPDPSW.2012.69
[16]
P. A. Kulkarni, D. B. Whalley, G. S. Tyson, and J. W. Davidson. 2006. Exhaustive optimization phase order space exploration. In International Symposium on Code Generation and Optimization (CGO’06). 13 pp.–318. issn:null https://doi.org/10.1109/CGO.2006.15
[17]
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arxiv:2002.11054.
[18]
Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, and P Sadayappan. 2021. Analytical Characterization and Design Space Exploration for Optimization of CNNs. arXiv preprint arXiv:2101.09808.
[19]
José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel, and Manuela Veloso. 2000. SPIRAL: Automatic Implementation of Signal Processing Algorithms. In High Performance Extreme Computing (HPEC).
[20]
Tharindu R. Patabandi, Anand Venkat, Rajkishore Barik, and Mary Hall. 2021. SWIRL ++ : Evaluating Performance Models to Guide Code Transformation in Convolutional Neural Networks. In Languages and Compilers for Parallel Computing, Santosh Pande and Vivek Sarkar (Eds.). Springer International Publishing, Cham. 108–126. isbn:978-3-030-72789-5 https://doi.org/10.1007/978-3-030-72789-5_9
[21]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not., 48, 6 (2013), June, 519–530. issn:0362-1340 https://doi.org/10.1145/2499370.2462176
[22]
T. M. Smith, R. v. d. Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. V. Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1049–1059. issn:1530-2075 https://doi.org/10.1109/IPDPS.2014.110
[23]
Y. N. Srikant and Priti Shankar. 2007. The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition (2nd ed.). CRC Press, Inc., USA. isbn:142004382X
[24]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’11). Association for Computing Machinery, New York, NY, USA. 117–128. isbn:9781450307437 https://doi.org/10.1145/1989493.1989508
[25]
Kamen Yotov, Xiaoming Li, Gang Ren, María Jesús Garzarán, David A. Padua, Keshav Pingali, and Paul Stodghill. 2005. Is Search Really Necessary to Generate High-Performance BLAS? Proc. IEEE, 93, 2 (2005), 358–386. https://doi.org/10.1109/JPROC.2004.840444
[26]
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 859–873.

Cited By

View all
  • (2023)Design and Implementation of Deep Learning 2D Convolutions on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332203734:12(3104-3116)Online publication date: 1-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MAPS 2021: Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming
June 2021
52 pages
ISBN:9781450384674
DOI:10.1145/3460945
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compilers
  2. Convolutional neural networks
  3. Loop transformations
  4. Machine learning

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '21
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)114
  • Downloads (Last 6 weeks)22
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Design and Implementation of Deep Learning 2D Convolutions on Modern CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332203734:12(3104-3116)Online publication date: 1-Dec-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media