Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Auto-tuning Fixed-point Precision with TVM on RISC-V Packed SIMD Extension

Published: 22 March 2023 Publication History

Abstract

Today, as deep learning (DL) is applied more often in daily life, dedicated processors such as CPUs and GPUs have become very important for accelerating model executions. With the growth of technology, people are becoming accustomed to using edge devices, such as mobile phones, smart watches, and VR devices in their daily lives. A variety of technologies using DL are gradually being applied to these edge devices. However, there is a large number of computations in DL. It faces a challenging problem how to provide solutions in the edge devices. In this article, the proposed method enables a flow with the RISC-V Packed extension (P extension) in TVM. TVM, an open deep learning compiler for neural network models, is growing as a key infrastructure for DL computing. RISC-V is an open instruction set architecture (ISA) with customized and flexible features. The Packed-SIMD extension is a RISC-V extension that enables subword single-instruction multiple-data (SIMD) computations in RISC-V architectures to support fallback engines in AI computing. In the proposed flow, a fixed-point type that is supported by an integer of 16-bit type and saturation instructions is added to replace the original 32-bit float type. In addition, an auto-tuning method is proposed to use a uniform selector mechanism (USM) to find the binary point position for fixed-point type use. The tensorization feature of TVM can be used to optimize specific hardware such as subword SIMD instructions with RISC-V P extension. With our experiment on the Spike simulator, the proposed method with the USM can improve performance by approximately 2.54 to 6.15× in terms of instruction counts with little accuracy loss.

References

[2]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th Symposium on Operating Systems Design and Implementation. 265–283.
[3]
Andes Technology. 2019. Andes has donated RISC-V P-extension draft 2019. Retrieved from http://www.andestech.com/en/2019/12/31/a-look-back-at-the-achievements-andes-made-in-2019/.
[4]
Andes Technology. 2005. Andes Technology. Retrieved from http://www.andestech.com/en/homepage/.
[5]
Apache MXNet. 2015. Apache MXNet (incubating) for Deep Learning. Retrieved from https://github.com/apache/incubator-mxnet.
[6]
Ming-Yu Hung, Chao-Lin Lee, and Jenq-Kuen Lee. 2021. Case study: Devise quantized schedule primitives in halide to support darknet computation. In Workshop on Compiler Techniques and System Software for High-Performance and Embedding Computing.
[7]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[8]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th Symposium on Operating Systems Design and Implementation. 578–594.
[9]
Yi-Ru Chen, Hui-Hsin Liao, Chia-Hsuan Chang, Che-Chia Lin, Chao-Lin Lee, Yuan-Ming Chang, Chun-Chieh Yang, and Jenq-Kuen Lee. 2020. Experiments and optimizations for TVM on RISC-V architectures with p extension. In International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 1–4.
[10]
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[12]
Fixed-Point Real Numbers. 2018. Fixed-Point Real Numbers. Retrieved from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0037r5.html.
[13]
Patricia Garcia-Canadilla, Sergio Sanchez-Martinez, Fatima Crispi, and Bart Bijnens. 2020. Machine learning in fetal cardiology: What to expect. Fetal Diag. Therap. 47, 5 (2020), 363–372.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).
[15]
William Dally. 2015. High-Performance Hardware for Machine Learning. Retrieved from https://media.nips.cc/Conferences/2015/tutorialslides/Dally-NIPS-Tutorial-2015.pdf.
[16]
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, and others. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press In.
[17]
Mark Horowitz. 2014. 1.1 computing’s energy problem (and what we can do about it). In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 10–14.
[18]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[19]
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18, 1 (2017), 6869–6898.
[20]
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
[21]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.
[22]
Keras. 2015. Keras. Retrieved from https://keras.io/.
[24]
Alex Krizhevsky and Geoff Hinton. 2010. Convolutional deep belief networks on CIFAR-10. Unpublished Manuscript 40, 7 (2010), 1–9.
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097–1105.
[26]
Chi-Bang Kuan and Jenq Kuen Lee. 2012. Compiler supports for VLIW DSP processors with SIMD intrinsics. Concurr. Computat.: Pract. Exper. 24, 5 (2012), 517–532.
[27]
Hsiang-Tsung Kung, Bradley McDanel, and Sai Qian Zhang. 2020. Term quantization: Furthering quantization at run time. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
[28]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. IEEE, 75–86.
[29]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324. DOI:
[30]
Chao-Lin Lee, Min-Yih Hsu, Bing-Sung Lu, Ming-Yu Hung, and Jenq-Kuen Lee. 2020. Experiment and enabled flow for GPGPU-sim simulators with fixed-point instructions. J. Syst. Archit. 111 (2020), 101783.
[31]
Daisuke Miyashita, Edward H. Lee, and Boris Murmann. 2016. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025 (2016).
[32]
numpy. 1995. numpy. Retrieved from https://numpy.org/.
[33]
OpenCL 2009. Open Computing Language. Retrieved from https://https://www.khronos.org/opencl/.
[34]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019), 8026–8037.
[35]
Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee. 2014. Compiler optimization for reducing leakage power in multithread BSP programs. ACM Trans. Des. Autom. Electron. Syst. 20, 1 (2014), 1–34.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[37]
RISC-V International. 2010. Spike, a RISC-V ISA Simulator. Retrieved from https://github.com/riscv-software-src/riscv-isa-sim.
[38]
Yi-Ru Chen. 2020. Support TVM QNN Flow on RISC-V with SIMD Computation. Retrieved from https://discuss.tvm.apache.org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967.
[39]
TensorFlow. 2021. TensorFlow Lite 8-bit quantization specification. Retrieved from https://www.tensorflow.org/lite/performance/quantization_spec.
[40]
Microsoft. 2015. The Microsoft Cognitive Toolkit (CNTK). Retrieved from https://github.com/microsoft/CNTK.
[41]
Shao-Chung Wang, Li-Chen Kan, Chao-Lin Lee, Yuan-Shin Hwang, and Jenq-Kuen Lee. 2017. Architecture and compiler support for GPUs using energy-efficient affine register files. ACM Trans. Des. Autom. Electron. Syst. 23, 2 (2017), 1–25.
[42]
Rikiya Yamashita, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. 2018. Convolutional neural networks: An overview and application in radiology. Insights Imag. 9, 4 (2018), 611–629.

Cited By

View all
  • (2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
  • (2024)Fixed-point Encoding and Architecture Exploration for Residue Number SystemsACM Transactions on Architecture and Code Optimization10.1145/366492321:3(1-27)Online publication date: 14-May-2024
  • (2024)Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM DesignsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658754(440-445)Online publication date: 12-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 28, Issue 3
May 2023
456 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/3587887
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 22 March 2023
Online AM: 02 November 2022
Accepted: 12 October 2022
Revised: 04 October 2022
Received: 27 February 2022
Published in TODAES Volume 28, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TVM
  2. LLVM
  3. fixed-point
  4. RISC-V P extension
  5. Subword SIMD

Qualifiers

  • Research-article

Funding Sources

  • Taiwan NSTC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)438
  • Downloads (Last 6 weeks)30
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Rewriting and Optimizing Vector Length Agnostic Intrinsics from Arm SVE to RVVWorkshop Proceedings of the 53rd International Conference on Parallel Processing10.1145/3677333.3678151(38-47)Online publication date: 12-Aug-2024
  • (2024)Fixed-point Encoding and Architecture Exploration for Residue Number SystemsACM Transactions on Architecture and Code Optimization10.1145/366492321:3(1-27)Online publication date: 14-May-2024
  • (2024)Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM DesignsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658754(440-445)Online publication date: 12-Jun-2024
  • (2024)Efficient Adder Designs for Realizing Addition Subset of RISC-V P-SIMD Instructions2024 IEEE Recent Advances in Intelligent Computational Systems (RAICS)10.1109/RAICS61201.2024.10689726(1-6)Online publication date: 16-May-2024
  • (2023)Overflow-free Compute Memories for Edge AI AccelerationACM Transactions on Embedded Computing Systems10.1145/360938722:5s(1-23)Online publication date: 9-Sep-2023
  • (2023)Hardware-Aware Quantization and Performance Evaluation for Tensor Accelerator VTA2023 China Automation Congress (CAC)10.1109/CAC59555.2023.10450767(8198-8203)Online publication date: 17-Nov-2023
  • (2023)Accelerating AI performance with the incorporation of TVM and MediaTek NeuroPilotConnection Science10.1080/09540091.2023.227258635:1Online publication date: 30-Oct-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media