research-article

Auto-tuning Fixed-point Precision with TVM on RISC-V Packed SIMD Extension

Authors:

Chun-Chieh Yang,

Yuan-Ming Chang,

Jenq-Kuen LeeAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems, Volume 28, Issue 3

Article No.: 33, Pages 1 - 21

https://doi.org/10.1145/3569939

Published: 22 March 2023 Publication History

Abstract

Today, as deep learning (DL) is applied more often in daily life, dedicated processors such as CPUs and GPUs have become very important for accelerating model executions. With the growth of technology, people are becoming accustomed to using edge devices, such as mobile phones, smart watches, and VR devices in their daily lives. A variety of technologies using DL are gradually being applied to these edge devices. However, there is a large number of computations in DL. It faces a challenging problem how to provide solutions in the edge devices. In this article, the proposed method enables a flow with the RISC-V Packed extension (P extension) in TVM. TVM, an open deep learning compiler for neural network models, is growing as a key infrastructure for DL computing. RISC-V is an open instruction set architecture (ISA) with customized and flexible features. The Packed-SIMD extension is a RISC-V extension that enables subword single-instruction multiple-data (SIMD) computations in RISC-V architectures to support fallback engines in AI computing. In the proposed flow, a fixed-point type that is supported by an integer of 16-bit type and saturation instructions is added to replace the original 32-bit float type. In addition, an auto-tuning method is proposed to use a uniform selector mechanism (USM) to find the binary point position for fixed-point type use. The tensorization feature of TVM can be used to optimize specific hardware such as subword SIMD instructions with RISC-V P extension. With our experiment on the Spike simulator, the proposed method with the USM can improve performance by approximately 2.54 to 6.15× in terms of instruction counts with little accuracy loss.

References

[1]

Szymon Migacz. 2017. 8-bit Inference with TensorRT. Retrieved from https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf.

[2]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation. 265–283.

[3]

Andes Technology. 2019. Andes has donated RISC-V P-extension draft 2019. Retrieved from http://www.andestech.com/en/2019/12/31/a-look-back-at-the-achievements-andes-made-in-2019/.

[4]

Andes Technology. 2005. Andes Technology. Retrieved from http://www.andestech.com/en/homepage/.

[5]

Apache MXNet. 2015. Apache MXNet (incubating) for Deep Learning. Retrieved from https://github.com/apache/incubator-mxnet.

[6]

Ming-Yu Hung, Chao-Lin Lee, and Jenq-Kuen Lee. 2021. Case study: Devise quantized schedule primitives in halide to support darknet computation. In Workshop on Compiler Techniques and System Software for High-Performance and Embedding Computing.

[7]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[8]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th Symposium on Operating Systems Design and Implementation. 578–594.

[9]

Yi-Ru Chen, Hui-Hsin Liao, Chia-Hsuan Chang, Che-Chia Lin, Chao-Lin Lee, Yuan-Ming Chang, Chun-Chieh Yang, and Jenq-Kuen Lee. 2020. Experiments and optimizations for TVM on RISC-V architectures with p extension. In International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 1–4.

[10]

Core ML. 2017. Core ML. Retrieved from https://developer.apple.com/documentation/coreml.

[11]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[12]

Fixed-Point Real Numbers. 2018. Fixed-Point Real Numbers. Retrieved from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0037r5.html.

[13]

Patricia Garcia-Canadilla, Sergio Sanchez-Martinez, Fatima Crispi, and Bart Bijnens. 2020. Machine learning in fetal cardiology: What to expect. Fetal Diag. Therap. 47, 5 (2020), 363–372.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).

[15]

William Dally. 2015. High-Performance Hardware for Machine Learning. Retrieved from https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf.

[16]

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, and others. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press In.

[17]

Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 10–14.

[18]

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[19]

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (2017), 6869–6898.

Digital Library

[20]

Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).

[21]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.

[22]

Keras. 2015. Keras. Retrieved from https://keras.io/.

[23]

KMADA. 2019. RISC-V P Extension Proposal. Retrieved from https://github.com/riscv/riscv-p-spec/blob/master/P-ext-proposal.adoc#kmada-kmaxda.

[24]

Alex Krizhevsky and Geoff Hinton. 2010. Convolutional deep belief networks on CIFAR-10. Unpublished Manuscript 40, 7 (2010), 1–9.

[25]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.

Digital Library

[26]

Chi-Bang Kuan and Jenq Kuen Lee. 2012. Compiler supports for VLIW DSP processors with SIMD intrinsics. Concurr. Computat.: Pract. Exper. 24, 5 (2012), 517–532.

Digital Library

[27]

Hsiang-Tsung Kung, Bradley McDanel, and Sai Qian Zhang. 2020. Term quantization: Furthering quantization at run time. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.

[28]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. IEEE, 75–86.

[29]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. DOI:

[30]

Chao-Lin Lee, Min-Yih Hsu, Bing-Sung Lu, Ming-Yu Hung, and Jenq-Kuen Lee. 2020. Experiment and enabled flow for GPGPU-sim simulators with fixed-point instructions. J. Syst. Archit. 111 (2020), 101783.

[31]

Daisuke Miyashita, Edward H. Lee, and Boris Murmann. 2016. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025 (2016).

[32]

numpy. 1995. numpy. Retrieved from https://numpy.org/.

[33]

OpenCL 2009. Open Computing Language. Retrieved from https://https://www.khronos.org/opencl/.

[34]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019), 8026–8037.

[35]

Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1 (2014), 1–34.

Digital Library

[36]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[37]

RISC-V International. 2010. Spike, a RISC-V ISA Simulator. Retrieved from https://github.com/riscv-software-src/riscv-isa-sim.

[38]

Yi-Ru Chen. 2020. Support TVM QNN Flow on RISC-V with SIMD Computation. Retrieved from https://discuss.tvm.apache.org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967.

[39]

TensorFlow. 2021. TensorFlow Lite 8-bit quantization specification. Retrieved from https://www.tensorflow.org/lite/performance/quantization_spec.

[40]

Microsoft. 2015. The Microsoft Cognitive Toolkit (CNTK). Retrieved from https://github.com/microsoft/CNTK.

[41]

Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and compiler support for GPUs using energy-efficient affine register files. ACM Trans. Des. Autom. Electron. Syst. 23, 2 (2017), 1–25.

Digital Library

[42]

Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. 2018. Convolutional neural networks: An overview and application in radiology. Insights Imag. 9, 4 (2018), 611–629.

Cited By

Lin JYang YLai HLee J(2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678151
Deng BNadendla BSuo KXie YLo D(2024)Fixed-point Encoding and Architecture Exploration for Residue Number SystemsACM Transactions on Architecture and Code Optimization10.1145/366492321:3(1-27)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3664923
Shen SYang DXie YPei CYu BYu W(2024)Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM DesignsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658754(440-445)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658754
Show More Cited By

Index Terms

Auto-tuning Fixed-point Precision with TVM on RISC-V Packed SIMD Extension

Recommendations

Devise Rust Compiler Optimizations on RISC-V Architectures with SIMD Instructions
ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

Recently, Rust has become a popular system programming language and been widely used in microkernel OS designs, cryptocurrency designs, deep learning applications, and web browsers. Rust is designed for highly safe and concurrent systems and provides ...
Translating AArch64 Floating-Point Instruction Set to the x86-64 Platform
ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

Binary translation translates binary programs from one instruction set to another. It is widely used in virtual machines and emulators. We extend mc2llvm, which is an LLVM-based retargetable 32-bit binary translator developed in our lab in the past ...
Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions

Three-dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 28, Issue 3

May 2023

456 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/3587887

Editor:
X. Sharon Hu
University of Notre Dame, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 22 March 2023

Online AM: 02 November 2022

Accepted: 12 October 2022

Revised: 04 October 2022

Received: 27 February 2022

Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Taiwan NSTC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
926
Total Downloads

Downloads (Last 12 months)438
Downloads (Last 6 weeks)30

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin JYang YLai HLee J(2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3677333.3678151
Deng BNadendla BSuo KXie YLo D(2024)Fixed-point Encoding and Architecture Exploration for Residue Number SystemsACM Transactions on Architecture and Code Optimization10.1145/366492321:3(1-27)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3664923
Shen SYang DXie YPei CYu BYu W(2024)Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM DesignsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658754(440-445)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658754
Sukumaran SKaimal LS BKunthara RWarrier TGala N(2024)Efficient Adder Designs for Realizing Addition Subset of RISC-V P-SIMD Instructions2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS)10.1109/RAICS61201.2024.10689726(1-6)Online publication date: 16-May-2024
https://doi.org/10.1109/RAICS61201.2024.10689726
Ponzina FRios MLevisse AAnsaloni GAtienza D(2023)Overflow-free Compute Memories for Edge AI AccelerationACM Transactions on Embedded Computing Systems10.1145/360938722:5s(1-23)Online publication date: 9-Sep-2023
https://dl.acm.org/doi/10.1145/3609387
Zhao HPang SZhao YLang HHe YXu HGao FMei K(2023)Hardware-Aware Quantization and Performance Evaluation for Tensor Accelerator VTA2023 China Automation Congress (CAC)10.1109/CAC59555.2023.10450767(8198-8203)Online publication date: 17-Nov-2023
https://doi.org/10.1109/CAC59555.2023.10450767
Lee CChung CCheng SLee JLai R(2023)Accelerating AI performance with the incorporation of TVM and MediaTek NeuroPilotConnection Science10.1080/09540091.2023.227258635:1Online publication date: 30-Oct-2023
https://doi.org/10.1080/09540091.2023.2272586

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents