Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Published: 19 April 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm2 while dissipating 0.46W/TOPs in TSMC 28-nm technology.

    References

    [1]
    Vivienne Sze, Yu-Hsin Hsin Chen, Tien-Ju Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329. DOI:
    [2]
    Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 393–405. DOI:
    [3]
    Mayan Moudgill, John Glossner, Wei Huang, Chaoyang Tian, Chunxia Xu, Nianliang Yang, Lei Wang, Tailin Liang, Shaobo Shi, Xiaodong Zhang, Daniel Iancu, Gary Nacer, and Kerry Li. 2020. Heterogeneous edge CNN hardware accelerator. In Proceedings of the 12th International Conference on Wireless Communications and Signal Processing (WCSP’20). Vol. 10, IEEE, 636–641. [Online]. Available: https://ieeexplore.ieee.org/document/9299736/.
    [4]
    Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, and Tianqi Tang. 2020. A survey of accelerator architectures for deep neural networks. Engineering 6, 3 (2020), 264–274. DOI:
    [5]
    Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, New York, NY, 92–104. DOI:
    [6]
    Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2018. Yoda NN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2018), 48–60. DOI:
    [7]
    Cheng-xin Xue, Tsung-yuan Huang, Je-syu Liu, Ting-wei Chang, Hui-yao Kao, Jing-hong Wang, Ta-wei Liu, Shih-ying Wei, Sheng-po Huang, Wei-chen Wei, Yi-ren Chen, Tzu-hsiang Hsu, Yen-kai Chen, Yun-chen Lo, Tai-hsing Wen, Chung-chuan Lo, Ren-shuo Liu, Chih-cheng Hsieh, Kea-tiong Tang, and Meng-fan Chang. 2020. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In Proceedings of the IEEE International Solid- State Circuits Conference - (ISSCC’20). Vol. 2, IEEE, 244–246. [Online]. Available: https://ieeexplore.ieee.org/document/9063078/.
    [8]
    Wang Hsiangkai, Chen Zakk, Cheng Kito, Hsu Yi-Hsiu, Ibanez Roger, Knight Nick, and Xing Mingjie. RISC-V Vector Extension Intrinsic Document. Retrieved from https://github.com/riscv/rvv-intrinsic-doc.
    [9]
    Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (2017), 26–39. DOI:
    [10]
    Mayan Moudgill, C. John Glossner, Arthur Joseph Hoane, Paul Hurtley, and Vitaly Kalashnikov. 2018. Vector processor configured to operate on variable length vectors using implicitly typed instructions. Patent No. 9,959,246.
    [11]
    Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820–4828. DOI:
    [12]
    Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (102021), 370–403. DOI:
    [13]
    Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2247–2256. DOI:
    [14]
    Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. Survey of machine learning accelerators. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–12. DOI:
    [15]
    Kuo Wei Chang and Tian Sheuan Chang. 2020. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circ. Syst. I: Regul. Pap. 67, 1 (2020), 145–154. DOI:
    [16]
    Fengbin Tu, Shouyi Yin, Peng Ouyang, Shibin Tang, Leibo Liu, and Shaojun Wei. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. VLSI Syst. 25, 8 (82017), 2220–2233. DOI:
    [17]
    Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (12017), 127–138. DOI:
    [18]
    Yu Hsin Chen, Tien Ju Yang, Joel S. Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (72019), 292–308. DOI:
    [19]
    John Glossner, Paul Blinzer, and Jarmo Takala. 2016. HSA-enabled DSPs and accelerators. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’15). IEEE, 1407–1411. DOI:
    [20]
    Yuhao Ju and Jie Gu. 2022. A 65nm systolic neural CPU processor for combined deep learning and general-purpose computing with 95% PE utilization, high data locality and enhanced end-to-end performance. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’22), Vol. 17. IEEE, 1–3. DOI:
    [21]
    Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (102016), 105–112. DOI:
    [22]
    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM Press, New York, New York, 269–284. DOI:
    [23]
    Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2015. DaDianNao: A machine-learning supercomputer. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’15). IEEE, 609–622. DOI:
    [24]
    Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGPLAN Not. 50, 4 (2015), 369–381. DOI:
    [25]
    Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (92019), 8–16. DOI:
    [26]
    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 579–594. http://arxiv.org/abs/1802.04799.
    [27]
    R. M. Tomasulo. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 1 (11967), 25–33. DOI:
    [28]
    Jihyuck Jo, Suchang Kim, and In Cheol Park. 2018. Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 12 (2018), 4196–4207. DOI:
    [29]
    Li Du, Yuan Du, Yilei Li, Junjie Su, Yen-Cheng Kuan, Chun-Chen Liu, and Mau-chung Frank Chang. 2018. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 1 (2018), 198–208. DOI:
    [30]
    Arash Ardakani, Carlo Condo, Mehdi Ahmadi, and Warren J. Gross. 2018. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 4 (2018), 1349–1362. DOI:
    [31]
    Jihyuck Jo, Soyoung Cha, Dayoung Rho, and In Cheol Park. 2018. DSIP: A scalable inference accelerator for convolutional neural networks. IEEE J. Solid-State Circ. 53, 2 (2018), 605–618. DOI:
    [32]
    Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17), Vol. 2017-May, IEEE, 2494–2501. DOI:
    [33]
    Injae Yoo, Bongjin Kim, and In-Cheol Park. 2015. Reverse rate matching for low-power LTE-advanced turbo decoders. IEEE Trans. Circ. Syst. I: Regul. Pap. 62, 12 (2015), 2920–2928. DOI:

    Cited By

    View all
    • (2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 22, Issue 3
    May 2023
    546 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/3592782
    • Editor:
    • Tulika Mitra
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 19 April 2023
    Online AM: 18 October 2022
    Accepted: 10 October 2022
    Revised: 31 August 2022
    Received: 12 February 2022
    Published in TECS Volume 22, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Neural network accelerator
    2. convolutional neural network
    3. ASIC design
    4. Tensor Processor

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB
    • Interdisciplinary research project of USTB
    • Fundamental Research Funds for the Central Universities
    • Foshan Higher Education Foundation
    • MAGICOM Platform of Beijing Advanced Innovation Center for Materials Genome Engineering

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)170
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media