research-article

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Authors:

Xiaotong ZhangAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 3

Article No.: 47, Pages 1 - 27

https://doi.org/10.1145/3568310

Published: 19 April 2023 Publication History

Abstract

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This article proposes a new instruction set extension for tensor computing, TCX, using Reduced Instruction Set Computer (RISC) instructions enhanced with variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC Instruction Set Architectures and provides software compatibility for scalable hardware implementations. We present a tensor accelerator implementation of the tensor extensions using an out-of-order RISC microarchitecture. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described that allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements using tensor dimension registers. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depthwise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4,096 multiply-accumulate compute unit. It consumes 12.8 mm² while dissipating 0.46W/TOPs in TSMC 28-nm technology.

References

[1]

Vivienne Sze, Yu-Hsin Hsin Chen, Tien-Ju Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329. DOI:

[2]

Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen, and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). 393–405. DOI:

Digital Library

[3]

Mayan Moudgill, John Glossner, Wei Huang, Chaoyang Tian, Chunxia Xu, Nianliang Yang, Lei Wang, Tailin Liang, Shaobo Shi, Xiaodong Zhang, Daniel Iancu, Gary Nacer, and Kerry Li. 2020. Heterogeneous edge CNN hardware accelerator. In Proceedings of the 12th International Conference on Wireless Communications and Signal Processing (WCSP’20). Vol. 10, IEEE, 636–641. [Online]. Available: https://ieeexplore.ieee.org/document/9299736/.

[4]

Yiran Chen, Yuan Xie, Linghao Song, Fan Chen, and Tianqi Tang. 2020. A survey of accelerator architectures for deep neural networks. Engineering 6, 3 (2020), 264–274. DOI:

[5]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. ACM, New York, NY, 92–104. DOI:

Digital Library

[6]

Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. 2018. Yoda NN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 37, 1 (2018), 48–60. DOI:

[7]

Cheng-xin Xue, Tsung-yuan Huang, Je-syu Liu, Ting-wei Chang, Hui-yao Kao, Jing-hong Wang, Ta-wei Liu, Shih-ying Wei, Sheng-po Huang, Wei-chen Wei, Yi-ren Chen, Tzu-hsiang Hsu, Yen-kai Chen, Yun-chen Lo, Tai-hsing Wen, Chung-chuan Lo, Ren-shuo Liu, Chih-cheng Hsieh, Kea-tiong Tang, and Meng-fan Chang. 2020. 15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices. In Proceedings of the IEEE International Solid- State Circuits Conference - (ISSCC’20). Vol. 2, IEEE, 244–246. [Online]. Available: https://ieeexplore.ieee.org/document/9063078/.

[8]

Wang Hsiangkai, Chen Zakk, Cheng Kito, Hsu Yi-Hsiu, Ibanez Roger, Knight Nick, and Xing Mingjie. RISC-V Vector Extension Intrinsic Document. Retrieved from https://github.com/riscv/rvv-intrinsic-doc.

[9]

Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (2017), 26–39. DOI:

Digital Library

[10]

Mayan Moudgill, C. John Glossner, Arthur Joseph Hoane, Paul Hurtley, and Vitaly Kalashnikov. 2018. Vector processor configured to operate on variable length vectors using implicitly typed instructions. Patent No. 9,959,246.

[11]

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 4820–4828. DOI:

[12]

Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (102021), 370–403. DOI:

Digital Library

[13]

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. 2020. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2247–2256. DOI:

[14]

Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2020. Survey of machine learning accelerators. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC’20). IEEE, 1–12. DOI:

[15]

Kuo Wei Chang and Tian Sheuan Chang. 2020. VWA: Hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circ. Syst. I: Regul. Pap. 67, 1 (2020), 145–154. DOI:

[16]

Fengbin Tu, Shouyi Yin, Peng Ouyang, Shibin Tang, Leibo Liu, and Shaojun Wei. 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. VLSI Syst. 25, 8 (82017), 2220–2233. DOI:

Digital Library

[17]

Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. 52, 1 (12017), 127–138. DOI:

[18]

Yu Hsin Chen, Tien Ju Yang, Joel S. Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Select. Top. Circ. Syst. 9, 2 (72019), 292–308. DOI:

[19]

John Glossner, Paul Blinzer, and Jarmo Takala. 2016. HSA-enabled DSPs and accelerators. In Proceedings of the IEEE Global Conference on Signal and Information Processing (GlobalSIP’15). IEEE, 1407–1411. DOI:

[20]

Yuhao Ju and Jie Gu. 2022. A 65nm systolic neural CPU processor for combined deep learning and general-purpose computing with 95% PE utilization, high data locality and enhanced end-to-end performance. In Proceedings of the IEEE International Solid- State Circuits Conference (ISSCC’22), Vol. 17. IEEE, 1–3. DOI:

[21]

Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (102016), 105–112. DOI:

Digital Library

[22]

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM Press, New York, New York, 269–284. DOI:

Digital Library

[23]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2015. DaDianNao: A machine-learning supercomputer. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’15). IEEE, 609–622. DOI:

Digital Library

[24]

Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. PuDianNao: A polyvalent machine learning accelerator. ACM SIGPLAN Not. 50, 4 (2015), 369–381. DOI:

Digital Library

[25]

Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2019. A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39, 5 (92019), 8–16. DOI:

[26]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 579–594. http://arxiv.org/abs/1802.04799.

[27]

R. M. Tomasulo. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11, 1 (11967), 25–33. DOI:

Digital Library

[28]

Jihyuck Jo, Suchang Kim, and In Cheol Park. 2018. Energy-efficient convolution architecture based on rescheduled dataflow. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 12 (2018), 4196–4207. DOI:

[29]

Li Du, Yuan Du, Yilei Li, Junjie Su, Yen-Cheng Kuan, Chun-Chen Liu, and Mau-chung Frank Chang. 2018. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 1 (2018), 198–208. DOI:

[30]

Arash Ardakani, Carlo Condo, Mehdi Ahmadi, and Warren J. Gross. 2018. An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circ. Syst. I: Regul. Pap. 65, 4 (2018), 1349–1362. DOI:

[31]

Jihyuck Jo, Soyoung Cha, Dayoung Rho, and In Cheol Park. 2018. DSIP: A scalable inference accelerator for convolutional neural networks. IEEE J. Solid-State Circ. 53, 2 (2018), 605–618. DOI:

[32]

Xi Chen, Xiaolin Hu, Hucheng Zhou, and Ningyi Xu. 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17), Vol. 2017-May, IEEE, 2494–2501. DOI:

[33]

Injae Yoo, Bongjin Kim, and In-Cheol Park. 2015. Reverse rate matching for low-power LTE-advanced turbo decoders. IEEE Trans. Circ. Syst. I: Regul. Pap. 62, 12 (2015), 2920–2928. DOI:

Cited By

Zhao WYang GXia TChen FZheng NRen P(2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3327110

Index Terms

TCX: A RISC Style Tensor Computing Extension and a Programmable Tensor Processor

Recommendations

TCX: a programmable tensor processor
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in Europe

Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-...
A fuzzy RISC processor

We describe application-specific extensions for fuzzy processing to a general purpose processor. The application-specific instruction set extensions were defined and evaluated using hardware/software codesign techniques. Based on this approach, we have ...
A unified processor architecture for RISC & VLIW DSP
GLSVLSI '05: Proceedings of the 15th ACM Great Lakes symposium on VLSI

This paper presents a unified processor core with two operation modes. The processor core works as a compiler-friendly MIPS-like core in the RISC mode, and it is a 4-way VLIW in its DSP mode, which has distributed and ping-pong register organization ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 3

May 2023

546 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3592782

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 19 April 2023

Online AM: 18 October 2022

Accepted: 10 October 2022

Revised: 31 August 2022

Received: 12 February 2022

Published in TECS Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB
Interdisciplinary research project of USTB
Fundamental Research Funds for the Central Universities
Foshan Higher Education Foundation
MAGICOM Platform of Beijing Advanced Innovation Center for Materials Genome Engineering

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)170
Downloads (Last 6 weeks)15

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao WYang GXia TChen FZheng NRen P(2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3327110

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents