research-article

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Authors:

Matheus Cavalcante,

Domenic Wüthrich,

Matteo Perotti,

Luca BeniniAuthors Info & Claims

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

Article No.: 22, Pages 1 - 9

https://doi.org/10.1145/3508352.3549367

Published: 22 December 2022 Publication History

Abstract

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

References

[1]

I. Al Assir, M. El Iskandarani, H. Al Sandid, and M. Saghir. 2021. Arrow: A RISC-V Vector Accelerator for Machine Learning Inference. arXiv:2107.07169

[2]

Arm Corp. 2019. Introduction to Armv8.1-M architecture. Arm Corp., Cambridge, UK. Revision r1p1.

[3]

Arm Corp. 2020. Arm Cortex-M55 Processor Datasheet. Arm Corp., Cambridge, UK. https://developer.arm.com/documentation/102833/0100/?lang=en

[4]

J. Backus. 1978. Can Programming Be Liberated from the von Neumann Style?: A Functional Style and Its Algebra of Programs. Commun. ACM 21, 8 (Aug. 1978), 613--641.

Digital Library

[5]

M. Cavalcante, S. Riedel, A. Pullini, and L. Benini. 2021. MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect. In 2021 Design, Automation, & Test in Europe Conference & Exhibition (DATE). IEEE, Grenoble, France, 701--706.

[6]

M. Cavalcante, F. Schuiki, F. Zaruba, and L. Benini. 2020. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE TVLSI 28, 2 (2020), 530--543.

[7]

R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of Ion-Implanted MOSFET's with Very Small Physical Dimensions. IEEE Journal of Solid-State Circuits 9, 5 (Oct. 1974), 256--268.

[8]

J. Domke, E. Vatai, B. Gerofi, Y. Kodama, M. Wahib, A. Podobas, S. Mittal, M. Pericàs, L. Zhang, P. Chen, A. Drozd, and S. Matsuoka. 2022. At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache. arXiv:2204.02235

[9]

M. Johns and T. J. Kazmierski. 2020. A Minimal RISC-V Vector Processor for Embedded Systems. In 2020 Forum for Specification and Design Languages (FDL). IEEE, Kiel, Germany, 1--4.

[10]

C. Kozyrakis and D. Patterson. 2003. Scalable Vector Processors For Embedded Systems. IEEE Micro 23, 6 (2003), 36--45.

Digital Library

[11]

F. Minervini and O. Palomar. 2021. Vitruvius: And Area-Efficient RISC-V Decoupled Vector Accelerator for High Performance Computing. In RISC-V Summit 2021. RISC-V International, San Francisco, CA, USA.

[12]

G. E. Moore. 1975. Progress in Digital Integrated Electronics. International Electron Devices Meeting, IEEE 21 (1975), 11--13.

[13]

NVIDIA Corp. 2022. Nvidia H100 Tensor Core GPU Architecture (1.02 ed.). NVIDIA Corp. https://resources.nvidia.com/en-us-tensor-core

[14]

OpenHW Corp. 2022. OpenHW Group eXtension Interface. OpenHW Corp. https://docs.openhwgroup.org/projects/openhw-group-core-v-xif Revision a3bcdd76.

[15]

M. Platzer and P. Puschner. 2021. Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation. In 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021), Björn B. Brandenburg (Ed.), Vol. 196. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 1:1--1:18.

[16]

PULP Platform. 2021. MemPool. https://github.com/pulp-platform/mempool/.

[17]

RISC-V Corp. 2022. RISC-V "V" Vector Extension, version 1.0. https://github.com/riscv/riscv-v-spec Accessed on April 16, 2022.

[18]

S. Shintani. 2022. RISC-V-based Parallel Processor IP with Vector Extension for Embedded Systems. In Proceedings Notebook for COOL Chips 25. IEEE, IEEE, Tokyo, Japan, pp. 211--257.

[19]

SiFive Corp. 2022. SiFive Intelligence X280 (21G3 ed.). SiFive Corp., San Mateo, CA, USA. https://sifive.cdn.prismic.io/sifive/62e0df53-be02-4b50-b211-aa55b7042fc8_x280-datasheet-21G3.pdf

[20]

SiFive Corp. 2022. SiFive Performance P270 (21G3 ed.). SiFive Corp., San Mateo, CA, USA. https://sifive.cdn.prismic.io/sifive/859c28c0-8bd5-4fc4-9113-a25a2a89bf9c_P270+Data+Sheet.pdf

[21]

Top500. 2021. Top500 List - November 2021. https://www.top500.org/lists/top500/2021/11/

[22]

J. S. Vetter, E. P. DeBenedictis, and T. M. Conte. 2017. Architectures for the Post-Moore Era. IEEE Micro 37, 04 (July 2017), 6--8.

Digital Library

[23]

T. Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Hot Chips: A Symposium on High Performance Chips (HC30). IEEE, Cupertino, CA, USA, 1--22.

[24]

F. Zaruba and L. Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 11 (2019), 2629--2640.

Digital Library

[25]

F. Zaruba, F. Schuiki, and L. Benini. 2020. Manticore: a 4096-core RISC-V chiplet architecture for ultra-efficient floating-point computing. In 2020 IEEE Hot Chips 32 Symposium (HC32). IEEE Technical Committee on Microprocessors and Microcomputers, IEEE, Cupertino, US, 36--42.

[26]

F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini. 2020. Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads. IEEE Trans. Comput. 70, 11 (2020), 1845--1860.

Cited By

Perotti MZhang YCavalcante MMustafa EBenini L(2024)MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546720(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546720
de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3656642
van Kempen PJones JMueller-Gritschneder DSchlichtmann U(2024)muRISCV-NN: Challenging Zve32x Autovectorization with TinyML Inference Library for RISC-V Vector ExtensionProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3652878(75-78)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3637543.3652878
Show More Cited By

Index Terms

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data
    2. Serial architectures
      1. Pipeline computing

Recommendations

Construction and exploitation of VLIW ASIPs with heterogeneous vector-widths

Numerous applications in important domains, such as communication and multimedia, show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in ...
Register-Pressure Aware Predicator for Length Multiplier of RVV
ICPP Workshops '22: Workshop Proceedings of the 51st International Conference on Parallel Processing

The use of parallel processing with vector processors is indispensable. The RISC-V vector extension (RVV) is a highly anticipated extension due to the demand for growing AI applications. The modularity and extensibility make RISC-V a popular instruction ...
Enhancing LLVM Optimizations for Linear Recurrence Programs on RVV
ICPP Workshops '23: Proceedings of the 52nd International Conference on Parallel Processing Workshops

The RISC-V Vector Extension (RVV) has emerged as a promising vector architecture for high-performance computing. It enables parallel computing capability for RISC-V CPUs by introducing additional vector instructions and vector registers. To fully ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design

October 2022

1467 pages

ISBN:9781450392174

DOI:10.1145/3508352

Conference Chair:
Tulika Mitra
National University of Singapore
,
Program Chairs:
Evangeline Young
The Chinese University of Hong Kong
,
Jinjun Xiong
University at Buffalo (UB)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

In-Cooperation

IEEE-EDS: Electronic Devices Society
IEEE CAS
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICCAD '22

Sponsor:

SIGDA

ICCAD '22: IEEE/ACM International Conference on Computer-Aided Design

October 30 - November 3, 2022

California, San Diego

Acceptance Rates

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Upcoming Conference

ICCAD '24

Sponsor:
sigda

IEEE/ACM International Conference on Computer-Aided Design

October 27 - 31, 2024

New York , NY , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
345
Total Downloads

Downloads (Last 12 months)167
Downloads (Last 6 weeks)21

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Perotti MZhang YCavalcante MMustafa EBenini L(2024)MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546720(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546720
de Bruin BVadivel KWijtvliet MJääskeläinen PCorporaal H(2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3656642
van Kempen PJones JMueller-Gritschneder DSchlichtmann U(2024)muRISCV-NN: Challenging Zve32x Autovectorization with TinyML Inference Library for RISC-V Vector ExtensionProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3652878(75-78)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3637543.3652878
Perotti MCavalcante MAndri RCavigelli LBenini L(2024)Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source ProcessorIEEE Transactions on Computers10.1109/TC.2024.338889673:7(1822-1836)Online publication date: Jul-2024
https://doi.org/10.1109/TC.2024.3388896

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents