research-article

Open access

BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs

Authors:

Mahesh Lakshminarasimhan,

Samuel Williams,

Oscar AnteparaAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 576 - 586

https://doi.org/10.1145/3673038.3673046

Published: 12 August 2024 Publication History

All formats PDF

Abstract

The end-to-end performance of deep learning model inference is often limited by excess data movement on GPUs. To reduce data movement, existing deep learning frameworks apply graph-level optimizations such as operator fusion to exploit data reuse across operators in deep learning graphs. Such optimizations are limited and cannot optimize arbitrary chains of compute- and time-intensive operations, including convolutions. To address these limitations, this paper presents BrickDL, a deep learning inference library that implements merged execution of a sequence of layers as an orthogonal approach to existing graph-level optimizations. BrickDL additionally employs fine-grain blocking using a brick data layout that further optimizes data locality on GPUs. We implement merged execution with the abstraction of bricks using two approaches – padded bricks and memoized bricks, and develop a performance model to choose between them using static analysis. Merged execution with bricks demonstrates performance gains on well-known deep learning models as compared to PyTorch JIT, TensorFlow XLA, and cuDNN baselines on NVIDIA A100 GPU. We also characterize the performance of the proposed optimizations with microbenchmarks and gain insights into their applicability and tradeoffs.

References

[1]

2023. NVIDIA TensorRT SDK. https://developer.nvidia.com/tensorrt

[2]

2023. PyTorch TorchScript. https://pytorch.org/docs/stable/jit.html

[3]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12.

[4]

Mauricio Araya-Polo, Félix Rubio, Raúl De la Cruz, Mauricio Hanzich, José María Cela, and Daniele Paolo Scarpazza. 2009. 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors. Scientific Programming 17, 1-2 (2009), 185–198.

Digital Library

[5]

Protonu Basu, Samuel Williams, Brian Van Straalen, Leonid Oliker, and Mary Hall. 2014. Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric Multigrid. In Proceedings of the Second Workshop on Optimizing Stencil Computations. 9–16.

Digital Library

[6]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.

[7]

Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi. 2017. Provably efficient scheduling of cache-oblivious wavefront algorithms. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. 339–350.

Digital Library

[8]

Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review 51, 1 (2009), 129–159.

[9]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1–12.

[10]

Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-mapping programming paradigm for deep learning tensor programs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 370–384.

Digital Library

[11]

Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. 2021. IOS: Inter-Operator Scheduler for CNN Acceleration. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.). Vol. 3. 167–180.

[12]

Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, and Brian Van Essen. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 210–220.

[13]

Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In Proceedings of the 19th annual international conference on Supercomputing. 361–366.

Digital Library

[14]

Bastian Hagedorn, Bin Fan, Hanfeng Chen, Cris Cecka, Michael Garland, and Vinod Grover. 2023. Graphene: An IR for Optimized Tensor Computations on GPUs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 302–313.

Digital Library

[15]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops. 3154–3160.

[16]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[17]

Russell J. Hewett and Thomas Grady. 2020. DistDL: Distributed Deep Learning. https://doi.org/10.5281/zenodo.3990527

[18]

Mark D Hill and Alan Jay Smith. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12 (1989), 1612–1630.

Digital Library

[19]

Jagan Jayaraj. 2013. A strategy for high performance in computational fluid dynamics. University of Minnesota.

[20]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62.

Digital Library

[21]

Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory system performance and correctness. 51–60.

Digital Library

[22]

Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, 2018. Exascale Deep Learning for Climate Analytics. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 649–660.

[23]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling compiler infrastructure for domain-specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2–14.

Digital Library

[24]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with { rTasks}. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 881–897.

[25]

John McCalpin and David Wonnacott. 1998. Time skewing: A value-based approach to optimizing for memory locality. Technical Report. Rutgers University.

[26]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–13.

Digital Library

[27]

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. 2021. DNNFusion: accelerating deep neural networks execution with advanced operator fusion. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 883–898.

Digital Library

[28]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

[29]

Amit Sabne. 2020. XLA: Compiling machine learning for peak performance. (2020).

[30]

Benjamin Sepanski, Tuowen Zhao, Hans Johansen, and Samuel Williams. 2022. Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations. In 2022 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). IEEE, 1–10.

[31]

Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, and Yida Wang. 2021. Nimble: Efficiently compiling dynamic neural networks for model inference. Proceedings of Machine Learning and Systems 3 (2021), 208–222.

[32]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[33]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31.

[34]

Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 872–887.

Digital Library

[35]

Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A Chowdhury. 2015. Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 205–214.

Digital Library

[36]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19.

Digital Library

[37]

Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. { PET} : Optimizing tensor programs with partially equivalent transformations and automated corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 37–54.

[38]

Gerhard Wellein, Georg Hager, Thomas Zeiser, Markus Wittmann, and Holger Fehske. 2009. Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In 2009 33rd Annual IEEE International Computer Software and Applications Conference, Vol. 1. IEEE, 579–586.

Digital Library

[39]

Samuel Williams, Dhiraj D Kalamkar, Amik Singh, Anand M Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. 2012. Optimization of geometric multigrid for emerging multi-and manycore processors. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.

Digital Library

[40]

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Katherine Yelick. 2006. The potential of the cell processor for scientific computing. In Proceedings of the 3rd Conference on Computing Frontiers. 9–20.

Digital Library

[41]

Michael Wolfe. 1986. Loops skewing: The wavefront method revisited. International Journal of Parallel Programming 15 (1986), 279–293.

Digital Library

[42]

David Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE, 171–180.

[43]

Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance. Proceedings of Machine Learning and Systems 4 (2022), 204–216.

[44]

Bing Xu, Ying Zhang, Hao Lu, Yang Chen, Terry Chen, Mike Iovine, Mu-Chu Lee, and Zhijing Li. 2022. AITemplate. https://github.com/facebookincubator/AITemplate

[45]

Yufan Xu, Saurabh Raje, Atanas Rountev, Gerald Sabin, Aravind Sukumaran-Rajam, and P Sadayappan. 2022. Training of deep learning pipelines on memory-constrained GPUs via segmented fused-tiled execution. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 104–116.

Digital Library

[46]

Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. 2021. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems 3 (2021), 255–268.

[47]

Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK—Yet Another Stencil Kernel: A framework for HPC stencil code-generation and tuning. In 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). IEEE, 30–39.

[48]

Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 472–480.

[49]

Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, and Xuefeng Jin. 2022. Apollo: Automatic partition-based operator fusion through layer by layer optimization. Proceedings of Machine Learning and Systems 4 (2022), 1–19.

[50]

Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–44.

Digital Library

[51]

Tuowen Zhao, Mary Hall, Hans Johansen, and Samuel Williams. 2021. Improving communication by optimizing on-node data movement with data layout. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 304–317.

Digital Library

[52]

Tuowen Zhao, Samuel Williams, Mary Hall, and Hans Johansen. 2018. Delivering Performance-portable Stencil Computations on CPUs and GPUs Using Bricks. In 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 59–70.

[53]

Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348–2359.

[54]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX symposium on operating systems design and implementation (OSDI 20). 863–879.

[55]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 859–873.

Digital Library

[56]

Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, 2022. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373.

Digital Library

[57]

Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, 2022. { ROLLER} : Fast and efficient tensor compilation for deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 233–248.

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288

Index Terms

BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs
1. Computing methodologies
  1. Machine learning
  2. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs
ICPP '12: Proceedings of the 2012 41st International Conference on Parallel Processing

Automatically parallelizing loop nests into CUDA kernels must exploit the full potential of GPUs to obtain high performance. One state-of-the-art approach makes use of the polyhedral model to extract parallelism from a loop nest by applying a sequence ...
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 ACM.

© 2024 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

DOE U.S. Department of Energy

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
278
Total Downloads

Downloads (Last 12 months)278
Downloads (Last 6 weeks)74

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lakshminarasimhan MAntepara OZhao TSepanski BBasu PJohansen HHall MWilliams S(2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
https://doi.org/10.1177/10943420241268288

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents