research-article

AKG: automatic kernel generation for neural processing units using polyhedral transformations

Authors:

Xuefeng JinAuthors Info & Claims

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 1233 - 1248

https://doi.org/10.1145/3453483.3454106

Published: 18 June 2021 Publication History

Abstract

Existing tensor compilers have proven their effectiveness in deploying deep neural networks on general-purpose hardware like CPU and GPU, but optimizing for neural processing units (NPUs) is still challenging due to the heterogeneous compute units and complicated memory hierarchy.

In this paper, we present AKG, a tensor compiler for NPUs. AKG first lowers the tensor expression language to a polyhedral representation, which is used to automate the memory management of NPUs. Unlike existing approaches that resort to manually written schedules, AKG leverages polyhedral schedulers to perform a much wider class of transformations, and extends the semantics of the polyhedral representation to combine complex tiling techniques and hierarchical fusion strategies. We also implement the domain-specific optimization of convolution in AKG. Moreover, to achieve the optimal performance, we introduce complementary optimizations in code generation, which is followed by an auto-tuner.

We conduct extensive experiments on benchmarks ranging from single operators to end-to-end networks. The experimental results show that AKG can obtain superior performance to both manual scheduling approaches and vendor provided libraries. We believe AKG will cast a light on the follow-up compiler works on NPUs.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Berkeley, CA, USA. 265–283. isbn:978-1-931971-33-1 http://dl.acm.org/citation.cfm?id=3026877.3026899

Digital Library

[2]

Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph., 38, 4 (2019), Article 121, July, 12 pages. issn:0730-0301 https://doi.org/10.1145/3306346.3322967

Digital Library

[3]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proc. of the 23rd Intl. Conf. on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, USA. 303–316. isbn:978-1-4503-2809-8 https://doi.org/10.1145/2628071.2628092

Digital Library

[4]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA. 193–205. isbn:978-1-7281-1436-1 http://dl_acm.gg363.site/citation.cfm?id=3314872.3314896

[5]

Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the Generation of Composed Linear Algebra Kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, USA. Article 59, 12 pages. isbn:9781605587448 https://doi.org/10.1145/1654059.1654119

Digital Library

[6]

Somashekaracharya G. Bhaskaracharya, Julien Demouth, and Vinod Grover. 2020. Automatic Kernel Generation for Volta Tensor Cores. arxiv:2006.12645.

[7]

Uday Bondhugula. 2020. High Performance Code Generation in MLIR: An Early Case Study with GEMM. arxiv:2003.00532.

[8]

Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A Model for Fusion and Code Motion in an Automatic Parallelizing Compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, NY, USA. 343–352. isbn:978-1-4503-0178-7 https://doi.org/10.1145/1854273.1854317

Digital Library

[9]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). ACM, New York, NY, USA. 101–113. isbn:978-1-59593-860-2 https://doi.org/10.1145/1375581.1375595

Digital Library

[10]

Lorenzo Chelini, Tobias Gysi, Tobias Grosser, Martin Kong, and Henk Corporaal. 2020. Automatic Generation of Multi-Objective Polyhedral Compiler Transformations. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’20). ACM, New York, NY, USA. 83–96. isbn:9781450380751 https://doi.org/10.1145/3410463.3414635

Digital Library

[11]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arxiv:1512.01274.

[12]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Berkeley, CA, USA. 579–594. isbn:978-1-931971-47-8 http://dl.acm.org/citation.cfm?id=3291168.3291211

[13]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400.

[14]

Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. 2018. Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning. arxiv:1801.08058.

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 4171–4186. https://doi.org/10.18653/v1/N19-1423

[16]

Venmugil Elango, Norm Rubin, Mahesh Ravishankar, Hariharan Sandanagobalane, and Vinod Grover. 2018. Diesel: DSL for Linear Algebra and Neural Net Computations on GPUs. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2018). ACM, New York, NY, USA. 42–51. isbn:978-1-4503-5834-7 https://doi.org/10.1145/3211346.3211354

Digital Library

[17]

Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International journal of parallel programming, 21, 6 (1992), 389–420.

[18]

Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA. 1581–1592. isbn:978-0-387-09766-4 https://doi.org/10.1007/978-0-387-09766-4_502

[19]

M. Frigo and S. G. Johnson. 1998. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.98CH36181). 3, 1381–1384 vol.3.

[20]

Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Trans. Program. Lang. Syst., 37, 4 (2015), Article 12, July, 50 pages. issn:0164-0925 https://doi.org/10.1145/2743016

Digital Library

[21]

Junli Gu, Yibing Liu, Yuan Gao, and Maohua Zhu. 2016. OpenCL Caffe: Accelerating and Enabling a Cross Platform Machine Learning Framework. In Proceedings of the 4th International Workshop on OpenCL (IWOCL’16). ACM, New York, NY, USA. Article 8, 5 pages. isbn:9781450343381 https://doi.org/10.1145/2909437.2909443

Digital Library

[22]

Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, and Vinod Grover. 2020. Fireiron: A Data-Movement-Aware Scheduling Language for GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’20). ACM, New York, NY, USA. 71–82. isbn:9781450380751 https://doi.org/10.1145/3410463.3414632

Digital Library

[23]

Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric Multi-Level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, NY, USA. 147–157. isbn:9781605584980 https://doi.org/10.1145/1542275.1542301

Digital Library

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90

[25]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arxiv:1704.04861.

[26]

Huawei. 2021. MindSpore. https://www.mindspore.cn/en

[27]

F. Irigoin and R. Triolet. 1988. Supernode Partitioning. In Proc. of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages (POPL’88). ACM, New York, NY, USA. 319–329. isbn:0-89791-252-7 https://doi.org/10.1145/73560.73588

Digital Library

[28]

Abhinav Jangda and Uday Bondhugula. 2018. An Effective Fusion and Tile Size Model for Optimizing Image Processing Pipelines. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, New York, NY, USA. 261–275. isbn:978-1-4503-4982-6 https://doi.org/10.1145/3178487.3178507

Digital Library

[29]

Abhinav Jangda and Arjun Guha. 2020. Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’20). ACM, New York, NY, USA. 317–328. isbn:9781450380751 https://doi.org/10.1145/3410463.3414649

Digital Library

[30]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, New York, NY, USA. 675–678. isbn:9781450330633 https://doi.org/10.1145/2647868.2654889

Digital Library

[31]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, USA. 47–62. isbn:9781450368735 https://doi.org/10.1145/3341301.3359630

Digital Library

[32]

Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the Graphcore IPU Architecture via Microbenchmarking. arxiv:1912.03413.

[33]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). ACM, New York, NY, USA. 1–12. isbn:9781450348928 https://doi.org/10.1145/3079856.3080246

Digital Library

[34]

Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin, Heidelberg. 301–320. isbn:3540576592

[35]

DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Multi-level Tiling: M for the Price of One. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, USA. Article 51, 12 pages. isbn:978-1-59593-764-3 https://doi.org/10.1145/1362622.1362691

Digital Library

[36]

Martin Kong and Louis-Noël Pouchet. 2019. Model-Driven Transformations for Multi- and Many-Core CPUs. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA. 469–484. isbn:9781450367127 https://doi.org/10.1145/3314221.3314653

Digital Library

[37]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, USA. 127–138. isbn:9781450320146 https://doi.org/10.1145/2491956.2462187

Digital Library

[38]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). ACM, New York, NY, USA. 235–244. isbn:978-1-59593-633-2 https://doi.org/10.1145/1250734.1250761

Digital Library

[39]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM, 60, 6 (2017), May, 84–90. issn:0001-0782 https://doi.org/10.1145/3065386

Digital Library

[40]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. https://doi.org/10.1109/CGO51591.2021.9370308

Digital Library

[41]

Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. TensorFlow Dev Summit.

[42]

Tzu-Mao Li, Michaël Gharbi, Andrew Adams, Frédo Durand, and Jonathan Ragan-Kelley. 2018. Differentiable Programming for Image Processing and Deep Learning in Halide. ACM Trans. Graph., 37, 4 (2018), Article 139, July, 13 pages. issn:0730-0301 https://doi.org/10.1145/3197517.3201383

Digital Library

[43]

Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. 2019. DaVinci: A Scalable Architecture for Neural Network Computing. In 2019 IEEE Hot Chips 31 Symposium (HCS). 1–44. https://doi.org/10.1109/HOTCHIPS.2019.8875654

[44]

Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, 393–405. isbn:9781467389471 https://doi.org/10.1109/ISCA.2016.42

Digital Library

[45]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham. 21–37. isbn:978-3-319-46448-0 https://doi.org/10.1007/978-3-319-46448-0_2

[46]

Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2019. Optimizing CNN Model Inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA. 1025–1040. isbn:978-1-939133-03-8 https://www.usenix.org/conference/atc19/presentation/liu-yizhi

[47]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 881–897. isbn:978-1-939133-19-9 https://www.usenix.org/conference/osdi20/presentation/ma

[48]

Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. 1996. Improving Data Locality with Loop Transformations. ACM Trans. Program. Lang. Syst., 18, 4 (1996), July, 424–453. issn:0164-0925 https://doi.org/10.1145/233561.233564

Digital Library

[49]

Sanyam Mehta, Pei-Hung Lin, and Pen-Chung Yew. 2014. Revisiting Loop Fusion in the Polyhedral Framework. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, USA. 233–246. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555250

Digital Library

[50]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph., 35, 4 (2016), Article 83, July, 11 pages. issn:0730-0301 https://doi.org/10.1145/2897824.2925952

Digital Library

[51]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, USA. 429–443. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694364

Digital Library

[52]

Angshuman Parashar, Prasanth Chatarasi, and Po-An Tsai. 2021. Hardware Abstractions for targeting EDDO Architectures with the Polyhedral Model. In 11th International Workshop on Polyhedral Compilation Techniques (IMPACT 2021).

[53]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 8026–8037.

[54]

Benoît Pradelle, Benoît Meister, Muthu Baskaran, Athanasios Konstantinidis, Thomas Henretty, and Richard Lethin. 2016. Scalable Hierarchical Polyhedral Compilation. In 2016 45th International Conference on Parallel Processing (ICPP). 432–441. https://doi.org/10.1109/ICPP.2016.56

[55]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). ACM, New York, NY, USA. 519–530. isbn:978-1-4503-2014-6 https://doi.org/10.1145/2491956.2462176

Digital Library

[56]

Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean Francois Dietiker, Madhava Syamlal, and Michael James. 2020. Fast Stencil-Code Computation on a Wafer-Scale Processor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20). IEEE Press, Article 58, 14 pages. isbn:9781728199986

Digital Library

[57]

Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Montgomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. arxiv:1805.00907.

[58]

Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic Superoptimization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, USA. 305–316. isbn:9781450318709 https://doi.org/10.1145/2451116.2451150

Digital Library

[59]

Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s Open-Source Deep-Learning Toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, USA. 2135. isbn:9781450342322 https://doi.org/10.1145/2939672.2945397

Digital Library

[60]

James E. Smith. 1982. Decoupled Access/Execute Computer Architectures. In Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA’82). IEEE Computer Society Press, Washington, DC, USA. 112–119.

Digital Library

[61]

Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, USA. 209–223. isbn:9781450342612 https://doi.org/10.1145/2908080.2908105

Digital Library

[62]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically. ACM Trans. Archit. Code Optim., 16, 4 (2019), Article 38, Oct., 26 pages. issn:1544-3566 https://doi.org/10.1145/3355606

Digital Library

[63]

Sven Verdoolaege. 2010. Isl: An Integer Set Library for the Polyhedral Model. In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS’10). Springer-Verlag, Berlin, Heidelberg. 299–302. isbn:3-642-15581-2, 978-3-642-15581-9 https://doi.org/10.1007/978-3-642-15582-6_49

[64]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim., 9, 4 (2013), Article 54, Jan., 23 pages. issn:1544-3566 https://doi.org/10.1145/2400682.2400713

Digital Library

[65]

Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW, 706 (2017).

[66]

R. Clint Whaley and Jack J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’98). IEEE Computer Society, USA. 1–27. isbn:089791984X

[67]

Keiji Yanai, Ryosuke Tanno, and Koichi Okamoto. 2016. Efficient Mobile Implementation of A CNN-Based Object Recognition System. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16). ACM, New York, NY, USA. 362–366. isbn:9781450336031 https://doi.org/10.1145/2964284.2967243

Digital Library

[68]

Tim Zerrell and Jeremy Bruestle. 2019. Stripe: Tensor Compilation via the Nested Polyhedral Model. arxiv:1903.06498.

[69]

Jie Zhao and Albert Cohen. 2019. Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation. ACM Trans. Archit. Code Optim., 16, 4 (2019), Article 47, Dec., 25 pages. issn:1544-3566 https://doi.org/10.1145/3369382

Digital Library

[70]

Jie Zhao and Peng Di. 2020. Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO-53). IEEE Press, Piscataway, NJ, USA. 427–441. https://doi.org/10.1109/MICRO50266.2020.00044

[71]

Yongwei Zhao, Zidong Du, Qi Guo, Shaoli Liu, Ling Li, Zhiwei Xu, Tianshi Chen, and Yunji Chen. 2019. Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA’19). ACM, New York, NY, USA. 788–801. isbn:9781450366694 https://doi.org/10.1145/3307650.3322226

Digital Library

[72]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863–879. isbn:978-1-939133-19-9 https://www.usenix.org/conference/osdi20/presentation/zheng

[73]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, USA. 859–873. isbn:9781450371025 https://doi.org/10.1145/3373376.3378508

Digital Library

Cited By

Huang SLiu FYang TWang ZYang NJiang L(2025)SpMMPlu-Pro: An Enhanced Compiler Plug-In for Efficient SpMM and Sparsity Propagation AlgorithmIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344671844:2(669-683)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3446718
Xie AHu YWang YLi ZGao YCheng Z(2025)GTA: Generating high-performance tensorized program with dual-task schedulingJournal of Systems Architecture10.1016/j.sysarc.2025.103359160(103359)Online publication date: Mar-2025
https://doi.org/10.1016/j.sysarc.2025.103359
Zhai YYang SPan KZhang RLiu SLiu CYe ZJi JZhao JZhang YZhang YGavrilovska ATerry D(2024)Enabling tensor language model to assist in generating high-performance tensor programs for deep learningProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691954(289-305)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691954
Show More Cited By

Index Terms

AKG: automatic kernel generation for neural processing units using polyhedral transformations
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Optimized two-level parallelization for GPU accelerators using the polyhedral model
CC 2017: Proceedings of the 26th International Conference on Compiler Construction

While GPUs play an increasingly important role in today's high-performance computers, optimizing GPU performance continues to impose large burdens upon programmers. A major challenge in optimizing codes for GPUs stems from the two levels of hardware ...
Polyhedral parallel code generation for CUDA
Special Issue on High-Performance Embedded Architectures and Compilers

This article addresses the compilation of a sequential program for parallel execution on a modern GPU. To this end, we present a novel source-to-source compiler called PPCG. PPCG singles out for its ability to accelerate computations from any static ...
Neural Acceleration for General-Purpose Approximate Programs

This work proposes an approximate algorithmic transformation and a new class of accelerators, called neural processing units (NPUs). NPUs leverage the approximate algorithmic transformation that converts regions of code from a Von Neumann model to a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 2021

1341 pages

ISBN:9781450383912

DOI:10.1145/3453483

General Chair:
Stephen N. Freund
Williams College, USA
,
Program Chair:
Eran Yahav
Technion, Israel

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

PLDI '21

Sponsor:

SIGPLAN

PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 20 - 25, 2021

Virtual, Canada

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
2,459
Total Downloads

Downloads (Last 12 months)451
Downloads (Last 6 weeks)27

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang SLiu FYang TWang ZYang NJiang L(2025)SpMMPlu-Pro: An Enhanced Compiler Plug-In for Efficient SpMM and Sparsity Propagation AlgorithmIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344671844:2(669-683)Online publication date: Feb-2025
https://doi.org/10.1109/TCAD.2024.3446718
Xie AHu YWang YLi ZGao YCheng Z(2025)GTA: Generating high-performance tensorized program with dual-task schedulingJournal of Systems Architecture10.1016/j.sysarc.2025.103359160(103359)Online publication date: Mar-2025
https://doi.org/10.1016/j.sysarc.2025.103359
Zhai YYang SPan KZhang RLiu SLiu CYe ZJi JZhao JZhang YZhang YGavrilovska ATerry D(2024)Enabling tensor language model to assist in generating high-performance tensor programs for deep learningProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691954(289-305)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691954
Mo XLi YLiu D(2024)Optimizing Imperfectly-Nested Loop Mapping on CGRAs via Polyhedral-Guided Flattening2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546785(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546785
Liu SQi CCao YYang CHu WShi XYang FYang MWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensorProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695961(160-177)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695961
Thangamani ALoechner VGenaud S(2024)A Survey of General-purpose Polyhedral CompilersACM Transactions on Architecture and Code Optimization10.1145/367473521:4(1-26)Online publication date: 22-Jun-2024
https://dl.acm.org/doi/10.1145/3674735
Zhao JXu JDi PNie WHu JYi YYang SGeng ZZhang RLi BGan ZJin X(2024)Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine RelationsACM Transactions on Computer Systems10.1145/363530541:1-4(1-45)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3635305
Wang FShen MLu YXiao N(2024)TensorMap: A Deep RL-Based Tensor Mapping Framework for Spatial AcceleratorsIEEE Transactions on Computers10.1109/TC.2024.339842473:8(1899-1912)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1109/TC.2024.3398424
Zhang ZYang DZhou XCheng D(2024)MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive OperatorsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00040(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00040
Liu XYang XMa KLiu SZhang KYang HLiu YLuan ZQian D(2024)Moirae: Generating High-Performance Composite Stencil Programs with Global OptimizationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00026(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00026
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten