Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3652032.3657571acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article
Open access

Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPU

Published: 20 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Mixed-precision quantization can reduce the computational requirements of Deep Neural Network (DNN) models with minimal loss of accuracy. As executing mixed-precision DNN models on Neural Processing Units (NPUs) incurs significant under-utilization of computational resources, Precision-Scalable NPUs (PSNPUs) which can process multiple low-precision layers simultaneously have been proposed. However, the under-utilization still remains significant due to the lack of adequate scheduling algorithms to support multiple mixed-precision models on PSNPUs. Therefore, in this paper, we propose a dynamic programming-based scheduling algorithm for the operations of multiple mixed-precision models. Our scheduling algorithm finds the optimal execution plan that exploits the precision-scalable MACs to improve the end-to-end inference latency of mixed-precision models. We evaluate the performance of this algorithm in terms of hardware utilization, inference latency, and schedule search time compared to baseline scheduling algorithms. The experimental results show 1.23 inference latency improvements over the baseline algorithms within the allowed minutes.

    References

    [1]
    Eunjin Baek. 2020. A Multi-Neural Network Acceleration Architecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 940–953. https://doi.org/10.1109/ISCA45697.2020.00081
    [2]
    Karl-Eduard Berger. 2013. An Efficient Parallelization Strategy for Dynamic Programming on GPU. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. 1797–1806. https://doi.org/10.1109/IPDPSW.2013.208
    [3]
    Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357
    [4]
    Yujeong Choi. 2020. PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 220–233. https://doi.org/10.1109/HPCA47549.2020.00027
    [5]
    Zhen Dong. 2019. HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). https://doi.org/10.1109/ICCV.2019.00038
    [6]
    Zhen Dong. 2020. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). 33, Curran Associates, Inc., 18518–18529. https://proceedings.neurips.cc/paper_files/paper/2020/file/d77c703536718b95308130ff2e5cf9ee-Paper.pdf
    [7]
    Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 92–104. https://doi.org/10.1145/2749469.2750389
    [8]
    Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. 2021. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE Press, 769–774. https://doi.org/10.1109/DAC18074.2021.9586216
    [9]
    Soroush Ghodrati. 2020. Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 681–697. https://doi.org/10.1109/MICRO50266.2020.00062
    [10]
    Gurobi Optimization, LLC. 2023. Gurobi Optimizer Reference Manual. https://www.gurobi.com
    [11]
    Julien Herrmann. 2017. Acyclic Partitioning of Large Directed Acyclic Graphs. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 371–380. https://doi.org/10.1109/CCGRID.2017.101
    [12]
    Norman P. Jouppi. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). Association for Computing Machinery, New York, NY, USA. 1–12. isbn:9781450348928 https://doi.org/10.1145/3079856.3080246
    [13]
    Hyoukjun Kwon. 2021. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 71–83. https://doi.org/10.1109/HPCA51647.2021.00016
    [14]
    Chris Lattner. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2–14. https://doi.org/10.1109/CGO51591.2021.9370308
    [15]
    Jounghoo Lee. 2021. Dataflow Mirroring: Architectural Support for Highly Efficient Fine-Grained Spatial Multitasking on Systolic-Array NPUs. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 247–252. https://doi.org/10.1109/DAC18074.2021.9586312
    [16]
    Seokho Lee. 2023. Block Group Scheduling: A General Precision-scalable NPU Scheduling Technique with Capacity-aware Memory Allocation. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1–6. https://doi.org/10.23919/DATE56975.2023.10137105
    [17]
    Wenjie Li. 2023. A Precision-Scalable Deep Neural Network Accelerator With Activation Sparsity Exploitation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1. https://doi.org/10.1109/TCAD.2023.3310916
    [18]
    Vasilis Mageirakos. 2022. Efficient GPU-accelerated Join Optimization for Complex Queries. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). 3190–3193. https://doi.org/10.1109/ICDE53745.2022.00295
    [19]
    C. E. Miller. 1960. Integer Programming Formulation of Traveling Salesman Problems. J. ACM, 7, 4 (1960), oct, 326–329. issn:0004-5411 https://doi.org/10.1145/321043.321046
    [20]
    Young H. Oh. 2021. Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 584–597. https://doi.org/10.1109/HPCA51647.2021.00056
    [21]
    Sungju Ryu. 2019. BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). Association for Computing Machinery, New York, NY, USA. Article 84, 6 pages. isbn:9781450367257 https://doi.org/10.1145/3316781.3317784
    [22]
    Douglas C. Schmidt. 2000. GPERF: a perfect hash function generator. Cambridge University Press, USA. 461–491. isbn:0521786185
    [23]
    Hardik Sharma. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783720
    [24]
    Hardik Sharma. 2018. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 764–775. https://doi.org/10.1109/ISCA.2018.00069
    [25]
    Guangming Tan. 2007. A parallel dynamic programming algorithm on a multi-core architecture. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007, Phillip B. Gibbons and Christian Scheideler (Eds.). ACM, 135–144. https://doi.org/10.1145/1248377.1248399
    [26]
    Jakub Tarnawski. 2020. Efficient algorithms for device placement of DNN graph operators. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Red Hook, NY, USA. Article 1296, 13 pages. isbn:9781713829546
    [27]
    Zhewei Yao. 2021. HAWQV3: Dyadic Neural Network Quantization. arxiv:2011.10680.
    [28]
    Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018. DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8. https://doi.org/10.1145/3240765.3240801

    Index Terms

    1. Orchestrating Multiple Mixed Precision Models on a Shared Precision-Scalable NPU

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        LCTES 2024: Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems
        June 2024
        182 pages
        ISBN:9798400706165
        DOI:10.1145/3652032
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 June 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Mixed-precision Model
        2. Precision-scalable NPU
        3. Scheduling
        4. Throughput

        Qualifiers

        • Research-article

        Funding Sources

        • NRF
        • Institute for Information and communications Technology Promotion

        Conference

        LCTES '24

        Acceptance Rates

        Overall Acceptance Rate 116 of 438 submissions, 26%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 108
          Total Downloads
        • Downloads (Last 12 months)108
        • Downloads (Last 6 weeks)108
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media