research-article

Open access

Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPU

Authors:

Yongjun ParkAuthors Info & Claims

LCTES 2024: Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Pages 72 - 82

https://doi.org/10.1145/3652032.3657571

Published: 20 June 2024 Publication History

Abstract

Mixed-precision quantization can reduce the computational requirements of Deep Neural Network (DNN) models with minimal loss of accuracy. As executing mixed-precision DNN models on Neural Processing Units (NPUs) incurs significant under-utilization of computational resources, Precision-Scalable NPUs (PSNPUs) which can process multiple low-precision layers simultaneously have been proposed. However, the under-utilization still remains significant due to the lack of adequate scheduling algorithms to support multiple mixed-precision models on PSNPUs. Therefore, in this paper, we propose a dynamic programming-based scheduling algorithm for the operations of multiple mixed-precision models. Our scheduling algorithm finds the optimal execution plan that exploits the precision-scalable MACs to improve the end-to-end inference latency of mixed-precision models. We evaluate the performance of this algorithm in terms of hardware utilization, inference latency, and schedule search time compared to baseline scheduling algorithms. The experimental results show 1.23 inference latency improvements over the baseline algorithms within the allowed minutes.

References

[1]

Eunjin Baek. 2020. A Multi-Neural Network Acceleration Architecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 940–953. https://doi.org/10.1109/ISCA45697.2020.00081

Digital Library

[2]

Karl-Eduard Berger. 2013. An Efficient Parallelization Strategy for Dynamic Programming on GPU. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. 1797–1806. https://doi.org/10.1109/IPDPSW.2013.208

Digital Library

[3]

Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357

[4]

Yujeong Choi. 2020. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 220–233. https://doi.org/10.1109/HPCA47549.2020.00027

[5]

Zhen Dong. 2019. HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00038

[6]

Zhen Dong. 2020. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). 33, Curran Associates, Inc., 18518–18529. https://proceedings.neurips.cc/paper_files/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Paper.pdf

[7]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 92–104. https://doi.org/10.1145/2749469.2750389

Digital Library

[8]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE Press, 769–774. https://doi.org/10.1109/DAC18074.2021.9586216

Digital Library

[9]

Soroush Ghodrati. 2020. Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 681–697. https://doi.org/10.1109/MICRO50266.2020.00062

[10]

Gurobi Optimization, LLC. 2023. Gurobi Optimizer Reference Manual. https://www.gurobi.com

[11]

Julien Herrmann. 2017. Acyclic Partitioning of Large Directed Acyclic Graphs. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 371–380. https://doi.org/10.1109/CCGRID.2017.101

Digital Library

[12]

Norman P. Jouppi. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). Association for Computing Machinery, New York, NY, USA. 1–12. isbn:9781450348928 https://doi.org/10.1145/3079856.3080246

Digital Library

[13]

Hyoukjun Kwon. 2021. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 71–83. https://doi.org/10.1109/HPCA51647.2021.00016

[14]

Chris Lattner. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. https://doi.org/10.1109/CGO51591.2021.9370308

Digital Library

[15]

Jounghoo Lee. 2021. Dataflow Mirroring: Architectural Support for Highly Efficient Fine-Grained Spatial Multitasking on Systolic-Array NPUs. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 247–252. https://doi.org/10.1109/DAC18074.2021.9586312

Digital Library

[16]

Seokho Lee. 2023. Block Group Scheduling: A General Precision-scalable NPU Scheduling Technique with Capacity-aware Memory Allocation. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6. https://doi.org/10.23919/DATE56975.2023.10137105

[17]

Wenjie Li. 2023. A Precision-Scalable Deep Neural Network Accelerator With Activation Sparsity Exploitation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1. https://doi.org/10.1109/TCAD.2023.3310916

Digital Library

[18]

Vasilis Mageirakos. 2022. Efficient GPU-accelerated Join Optimization for Complex Queries. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3190–3193. https://doi.org/10.1109/ICDE53745.2022.00295

[19]

C. E. Miller. 1960. Integer Programming Formulation of Traveling Salesman Problems. J. ACM, 7, 4 (1960), oct, 326–329. issn:0004-5411 https://doi.org/10.1145/321043.321046

Digital Library

[20]

Young H. Oh. 2021. Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 584–597. https://doi.org/10.1109/HPCA51647.2021.00056

[21]

Sungju Ryu. 2019. BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). Association for Computing Machinery, New York, NY, USA. Article 84, 6 pages. isbn:9781450367257 https://doi.org/10.1145/3316781.3317784

Digital Library

[22]

Douglas C. Schmidt. 2000. GPERF: a perfect hash function generator. Cambridge University Press, USA. 461–491. isbn:0521786185

[23]

Hardik Sharma. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783720

[24]

Hardik Sharma. 2018. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 764–775. https://doi.org/10.1109/ISCA.2018.00069

Digital Library

[25]

Guangming Tan. 2007. A parallel dynamic programming algorithm on a multi-core architecture. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007, Phillip B. Gibbons and Christian Scheideler (Eds.). ACM, 135–144. https://doi.org/10.1145/1248377.1248399

Digital Library

[26]

Jakub Tarnawski. 2020. Efficient algorithms for device placement of DNN graph operators. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Red Hook, NY, USA. Article 1296, 13 pages. isbn:9781713829546

Digital Library

[27]

Zhewei Yao. 2021. HAWQV3: Dyadic Neural Network Quantization. arxiv:2011.10680.

[28]

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1145/3240765.3240801

Digital Library

Index Terms

Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPU
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Implementation and evaluation of quadruple precision BLAS functions on GPUs
PARA'10: Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I

We implemented the quadruple precision Basic Linear Algebra Subprograms (BLAS) functions, AXPY, GEMV and GEMM, on graphics processing units (GPUs), and evaluated their performance. We used DD-type quadruple precision operations, which combine two double ...
Mixed-Precision AMG method for Many Core Accelerators
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

There is a large gap between single- and double-precision performances on GPUs. Single-precision arithmetic is more than twice as fast as double-precision arithmetic on GPUs. However, single-precision arithmetic cannot achieve sufficient accuracy. By ...
Aspect-Driven Mixed-Precision Tuning Targeting GPUs
PARMA-DITAM '18: Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms

Writing mixed-precision kernels allows to achieve higher throughput together with outputs whose precision remain within given limits. The recent introduction of native half-precision arithmetic capabilities in several GPUs, such as NVIDIA P100 and AMD ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

LCTES 2024: Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

June 2024

182 pages

ISBN:9798400706165

DOI:10.1145/3652032

General Chair:
Aviral Shrivastava
Arizona State University, USA
,
Program Chair:
Yulei Sui
UNSW Sydney, Australia

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NRF
Institute for Information and communications Technology Promotion

Conference

LCTES '24

Sponsor:

LCTES '24: 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

June 24, 2024

Copenhagen, Denmark

Acceptance Rates

Overall Acceptance Rate 116 of 438 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
108
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)108

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents