Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503222.3507723acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Published: 22 February 2022 Publication History

Abstract

This work reveals that memory-intensive computation is a rising performance-critical factor in recent machine learning models. Due to a unique set of new challenges, existing ML optimizing compilers cannot perform efficient fusion under complex two-level dependencies combined with just-in-time demand. They face the dilemma of either performing costly fusion due to heavy redundant computation, or skipping fusion which results in massive number of kernels. Furthermore, they often suffer from low parallelism due to the lack of support for real-world production workloads with irregular tensor shapes. To address these rising challenges, we propose AStitch, a machine learning optimizing compiler that opens a new multi-dimensional optimization space for memory-intensive ML computations. It systematically abstracts four operator-stitching schemes while considering multi-dimensional optimization objectives, tackles complex computation graph dependencies with novel hierarchical data reuse, and efficiently processes various tensor shapes via adaptive thread mapping. Finally, AStitch provides just-in-time support incorporating our proposed optimizations for both ML training and inference. Although AStitch serves as a stand-alone compiler engine that is portable to any version of TensorFlow, its basic ideas can be generally applied to other ML frameworks and optimization compilers. Experimental results show that AStitch can achieve an average of 1.84x speedup (up to 2.73x) over the state-of-the-art Google's XLA solution across five production workloads. We also deploy AStitch onto a production cluster for ML workloads with thousands of GPUs. The system has been in operation for more than 10 months and saves about 20,000 GPU hours for 70,000 tasks per week.

References

[1]
Cited July 2021. AMD GPU-Powered Machine Learning Solutions. https://www.amd.com/en/graphics/servers-radeon-instinct-deep-learning
[2]
Cited July 2021. Automatic Mixed Precision for Deep Learning. https://developer.nvidia.com/automatic-mixed-precision
[3]
Cited July 2021. CUDA Achieved Occupancy. https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm
[4]
Cited July 2021. CUDA Occupancy Calculator. https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
[5]
Cited July 2021. Getting Started with CUDA Graphs. https://developer.nvidia.com/blog/cuda-graphs/
[6]
Cited July 2021. GPU Dominates AI Accelerator Market. https://www.informationweek.com/ai-or-machine-learning/gpus-continue-to-dominate-the-ai-accelerator-market-for-now
[7]
Cited July 2021. NVIDIA cuBLAS. https://developer.nvidia.com/cublas
[8]
Cited July 2021. NVIDIA cuDNN. https://developer.nvidia.com/cudnn
[9]
Cited July 2021. NVIDIA FasterTransformer. https://github.com/NVIDIA/FasterTransformer
[10]
Cited July 2021. Nvprof Profiling Tool. https://docs.nvidia.com/cuda/profiler-users-guide/index.html####nvprof-overview
[11]
Cited July 2021. TensorFlow XLA. https://www.tensorflow.org/xla
[12]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ symposium on operating systems design and implementation ($OSDI$ 16). 265–283.
[13]
Amirali Abdolrashidi, Qiumin Xu, Shibo Wang, Sudip Roy, and Yanqi Zhou. 2019. Learning to fuse. In NeurIPS ML for Systems Workshop.
[14]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, and Frédo Durand. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), 38, 4 (2019), 1–12.
[15]
Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. ACM SIGPLAN Notices, 50, 8 (2015), 173–182.
[16]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205.
[17]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
[18]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, and Luis Ceze. 2018. $TVM$: An automated end-to-end optimizing compiler for deep learning. In 13th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 18). 578–594.
[19]
Keith D Cooper, Timothy J Harvey, and Ken Kennedy. 2001. A simple, fast dominance algorithm. Software Practice & Experience, 4, 1-10 (2001), 1–8.
[20]
Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and Luis Ceze. 2020. Automatic generation of high-performance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 305–316.
[21]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
[22]
Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip. In International Conference on Machine Learning. 2024–2033.
[23]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.
[24]
Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu, Michael A Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017. Deftnn: Addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 786–799.
[25]
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62.
[26]
Wonkyung Jung, Daejin Jung, Sunjung Lee, Wonjong Rhee, and Jung Ho Ahn. 2018. Restructuring batch normalization to accelerate CNN training. arXiv preprint arXiv:1807.01702.
[27]
Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. 2017. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Medical image analysis, 36 (2017), 61–78.
[28]
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and Ponnuswamy Sadayappan. 2019. A code generator for high-performance tensor contractions on GPUs. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 85–95.
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25 (2012), 1097–1105.
[30]
Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun. 2020. Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning. arXiv preprint arXiv:2012.02732.
[31]
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2020. Automatic Horizontal Fusion for GPU Kernels. arXiv preprint arXiv:2007.01277.
[32]
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 881–897.
[33]
Matthew W Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A holistic approach for implementing neural network computations. In Proceedings of the Computing Frontiers Conference. 53–62.
[34]
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 883–898.
[35]
Alberto Parravicini, Arnaud Delamare, Marco Arnaboldi, and Marco D Santambrogio. 2020. DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime. arXiv preprint arXiv:2012.09646.
[36]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703.
[37]
Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Automatic kernel fusion for image processing DSLs. In Proceedings of the 21st International Workshop on Software and Compilers for Embedded Systems. 76–85.
[38]
Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 242–253.
[39]
Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, and Yida Wang. 2020. Nimble: Efficiently compiling dynamic neural networks for model inference. arXiv preprint arXiv:2006.03031.
[40]
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39, 11 (2016), 2298–2304.
[41]
Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong Zhou. 2019. Astra: Exploiting predictability to optimize deep learning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 909–923.
[42]
Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: a language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 209–223.
[43]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730.
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
[45]
Mohamed Wahib and Naoya Maruyama. 2014. Scalable kernel fusion for memory-bound GPU applications. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 191–202.
[46]
Xueying Wang, Guangli Li, Xiao Dong, Jiansong Li, Lei Liu, and Xiaobing Feng. 2020. Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs. In European Conference on Parallel Processing. 219–233.
[47]
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proceedings of Interspeech. 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456
[48]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, and Morgan Funtowicz. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
[49]
Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 107–118.
[50]
Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1–12.
[51]
Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1679–1690.
[52]
Feng Zhang, Jidong Zhai, Bingsheng He, Shuhao Zhang, and Wenguang Chen. 2016. Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 28, 3 (2016), 905–918.
[53]
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. POCLib: A High-Performance Framework for Enabling Near Orthogonal Processing on Compression. IEEE Transactions on Parallel and Distributed Systems, 33, 2 (2021), 459–475.
[54]
Feng Zhang, Jidong Zhai, Xipeng Shen, Dalin Wang, Zheng Chen, Onur Mutlu, Wenguang Chen, and Xiaoyong Du. 2021. TADOC: Text analytics directly on compression. The VLDB Journal, 30, 2 (2021), 163–188.
[55]
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, and Koushik Sen. 2020. Ansor: Generating high-performance tensor programs for deep learning. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 20). 863–879.
[56]
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 587–599.
[57]
Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, and Wei Lin. 2020. Fusionstitching: boosting memory intensive computations for deep learning workloads. arXiv preprint arXiv:2009.10924.
[58]
Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence. 33, 5941–5948.
[59]
Kai Zhu, Wenyi Zhao, Zhen Zheng, Tianyou Guo, Pengzhan Zhao, Junjie Bai, Jun Yang, Xiaoyong Liu, Lansong Diao, and Wei Lin. 2021. DISC: A Dynamic Shape Compiler for Machine Learning Workloads. In Proceedings of the 1st Workshop on Machine Learning and Systems. 89–95.

Cited By

View all
  • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)Accelerated Auto-Tuning of GPU Kernels for Tensor ComputationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656626(549-561)Online publication date: 30-May-2024
  • Show More Cited By

Index Terms

  1. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
      February 2022
      1164 pages
      ISBN:9781450392051
      DOI:10.1145/3503222
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. Compiler Optimization
      2. Fusion
      3. Machine Learning
      4. Memory-Intensive Computation

      Qualifiers

      • Research-article

      Conference

      ASPLOS '22

      Acceptance Rates

      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)555
      • Downloads (Last 6 weeks)59
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)FreeStencil: A Fine-Grained Solver Compiler with Graph and Kernel Optimizations on Structured Meshes for Modern GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673076(1022-1031)Online publication date: 12-Aug-2024
      • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
      • (2024)Accelerated Auto-Tuning of GPU Kernels for Tensor ComputationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656626(549-561)Online publication date: 30-May-2024
      • (2024)Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel PolymerizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640390(797-812)Online publication date: 27-Apr-2024
      • (2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
      • (2024)AtRec: Accelerating Recommendation Model Training on CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.338118635:6(905-918)Online publication date: Jun-2024
      • (2024)Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation ServingIEEE Transactions on Computers10.1109/TC.2024.344974973:11(2474-2487)Online publication date: Nov-2024
      • (2024)Enabling Resource-Efficient AIoT System With Cross-Level Optimization: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2023.331995226:1(389-427)Online publication date: Sep-2025
      • (2024)Using Mixed Wavefront for Accelerating Convolutional Neural Network Parallelization2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE62493.2024.10653109(1336-1342)Online publication date: 10-May-2024
      • (2024)Non-Invasive, Memory Access-Triggered Near-Data Processing for DNN Training Acceleration on GPUsIEEE Access10.1109/ACCESS.2024.346578912(142651-142667)Online publication date: 2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media