Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3508352.3549367acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article

Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

Published: 22 December 2022 Publication History

Abstract

While parallel architectures based on clusters of Processing Elements (PEs) sharing L1 memory are widespread, there is no consensus on how lean their PE should be. Architecting PEs as vector processors holds the promise to greatly reduce their instruction fetch bandwidth, mitigating the Von Neumann Bottleneck (VNB). However, due to their historical association with supercomputers, classical vector machines include microarchitectural tricks to improve the Instruction Level Parallelism (ILP), which increases their instruction fetch and decode energy overhead. In this paper, we explore for the first time vector processing as an option to build small and efficient PEs for large-scale shared-L1 clusters. We propose Spatz, a compact, modular 32-bit vector processing unit based on the integer embedded subset of the RISC-V Vector Extension version 1.0. A Spatz-based cluster with four Multiply-Accumulate Units (MACUs) needs only 7.9 pJ per 32-bit integer multiply-accumulate operation, 40% less energy than an equivalent cluster built with four Snitch scalar cores. We analyzed Spatz' performance by integrating it within MemPool, a large-scale many-core shared-L1 cluster. The Spatz-based MemPool system achieves up to 285 GOPS when running a 256 × 256 32-bit integer matrix multiplication, 70% more than the equivalent Snitch-based MemPool system. In terms of energy efficiency, the Spatz-based MemPool system achieves up to 266 GOPS/W when running the same kernel, more than twice the energy efficiency of the Snitch-based MemPool system, which reaches 128 GOPS/W. Those results show the viability of lean vector processors as high-performance and energy-efficient PEs for large-scale clusters with tightly-coupled L1 memory.

References

[1]
I. Al Assir, M. El Iskandarani, H. Al Sandid, and M. Saghir. 2021. Arrow: A RISC-V Vector Accelerator for Machine Learning Inference. arXiv:2107.07169
[2]
Arm Corp. 2019. Introduction to Armv8.1-M architecture. Arm Corp., Cambridge, UK. Revision r1p1.
[3]
Arm Corp. 2020. Arm Cortex-M55 Processor Datasheet. Arm Corp., Cambridge, UK. https://developer.arm.com/documentation/102833/0100/?lang=en
[4]
J. Backus. 1978. Can Programming Be Liberated from the von Neumann Style?: A Functional Style and Its Algebra of Programs. Commun. ACM 21, 8 (Aug. 1978), 613--641.
[5]
M. Cavalcante, S. Riedel, A. Pullini, and L. Benini. 2021. MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect. In 2021 Design, Automation, & Test in Europe Conference & Exhibition (DATE). IEEE, Grenoble, France, 701--706.
[6]
M. Cavalcante, F. Schuiki, F. Zaruba, and L. Benini. 2020. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE TVLSI 28, 2 (2020), 530--543.
[7]
R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of Ion-Implanted MOSFET's with Very Small Physical Dimensions. IEEE Journal of Solid-State Circuits 9, 5 (Oct. 1974), 256--268.
[8]
J. Domke, E. Vatai, B. Gerofi, Y. Kodama, M. Wahib, A. Podobas, S. Mittal, M. Pericàs, L. Zhang, P. Chen, A. Drozd, and S. Matsuoka. 2022. At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache. arXiv:2204.02235
[9]
M. Johns and T. J. Kazmierski. 2020. A Minimal RISC-V Vector Processor for Embedded Systems. In 2020 Forum for Specification and Design Languages (FDL). IEEE, Kiel, Germany, 1--4.
[10]
C. Kozyrakis and D. Patterson. 2003. Scalable Vector Processors For Embedded Systems. IEEE Micro 23, 6 (2003), 36--45.
[11]
F. Minervini and O. Palomar. 2021. Vitruvius: And Area-Efficient RISC-V Decoupled Vector Accelerator for High Performance Computing. In RISC-V Summit 2021. RISC-V International, San Francisco, CA, USA.
[12]
G. E. Moore. 1975. Progress in Digital Integrated Electronics. International Electron Devices Meeting, IEEE 21 (1975), 11--13.
[13]
NVIDIA Corp. 2022. Nvidia H100 Tensor Core GPU Architecture (1.02 ed.). NVIDIA Corp. https://resources.nvidia.com/en-us-tensor-core
[14]
OpenHW Corp. 2022. OpenHW Group eXtension Interface. OpenHW Corp. https://docs.openhwgroup.org/projects/openhw-group-core-v-xif Revision a3bcdd76.
[15]
M. Platzer and P. Puschner. 2021. Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation. In 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021), Björn B. Brandenburg (Ed.), Vol. 196. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 1:1--1:18.
[16]
PULP Platform. 2021. MemPool. https://github.com/pulp-platform/mempool/.
[17]
RISC-V Corp. 2022. RISC-V "V" Vector Extension, version 1.0. https://github.com/riscv/riscv-v-spec Accessed on April 16, 2022.
[18]
S. Shintani. 2022. RISC-V-based Parallel Processor IP with Vector Extension for Embedded Systems. In Proceedings Notebook for COOL Chips 25. IEEE, IEEE, Tokyo, Japan, pp. 211--257.
[19]
SiFive Corp. 2022. SiFive Intelligence X280 (21G3 ed.). SiFive Corp., San Mateo, CA, USA. https://sifive.cdn.prismic.io/sifive/62e0df53-be02-4b50-b211-aa55b7042fc8_x280-datasheet-21G3.pdf
[20]
SiFive Corp. 2022. SiFive Performance P270 (21G3 ed.). SiFive Corp., San Mateo, CA, USA. https://sifive.cdn.prismic.io/sifive/859c28c0-8bd5-4fc4-9113-a25a2a89bf9c_P270+Data+Sheet.pdf
[21]
Top500. 2021. Top500 List - November 2021. https://www.top500.org/lists/top500/2021/11/
[22]
J. S. Vetter, E. P. DeBenedictis, and T. M. Conte. 2017. Architectures for the Post-Moore Era. IEEE Micro 37, 04 (July 2017), 6--8.
[23]
T. Yoshida. 2018. Fujitsu High Performance CPU for the Post-K Computer. In Hot Chips: A Symposium on High Performance Chips (HC30). IEEE, Cupertino, CA, USA, 1--22.
[24]
F. Zaruba and L. Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 11 (2019), 2629--2640.
[25]
F. Zaruba, F. Schuiki, and L. Benini. 2020. Manticore: a 4096-core RISC-V chiplet architecture for ultra-efficient floating-point computing. In 2020 IEEE Hot Chips 32 Symposium (HC32). IEEE Technical Committee on Microprocessors and Microcomputers, IEEE, Cupertino, US, 36--42.
[26]
F. Zaruba, F. Schuiki, T. Hoefler, and L. Benini. 2020. Snitch: A tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads. IEEE Trans. Comput. 70, 11 (2020), 1845--1860.

Cited By

View all
  • (2024)MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546720(1-6)Online publication date: 25-Mar-2024
  • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 10-May-2024
  • (2024)muRISCV-NN: Challenging Zve32x Autovectorization with TinyML Inference Library for RISC-V Vector ExtensionProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3652878(75-78)Online publication date: 7-May-2024
  • Show More Cited By

Index Terms

  1. Spatz: A Compact Vector Processing Unit for High-Performance and Energy-Efficient Shared-L1 Clusters

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design
        October 2022
        1467 pages
        ISBN:9781450392174
        DOI:10.1145/3508352
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        In-Cooperation

        • IEEE-EDS: Electronic Devices Society
        • IEEE CAS
        • IEEE CEDA

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 22 December 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. RISC-V vector extension
        2. SIMD
        3. many-core
        4. vector processing

        Qualifiers

        • Research-article

        Conference

        ICCAD '22
        Sponsor:
        ICCAD '22: IEEE/ACM International Conference on Computer-Aided Design
        October 30 - November 3, 2022
        California, San Diego

        Acceptance Rates

        Overall Acceptance Rate 457 of 1,762 submissions, 26%

        Upcoming Conference

        ICCAD '24
        IEEE/ACM International Conference on Computer-Aided Design
        October 27 - 31, 2024
        New York , NY , USA

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)167
        • Downloads (Last 6 weeks)21
        Reflects downloads up to 04 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546720(1-6)Online publication date: 25-Mar-2024
        • (2024)R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRAACM Transactions on Reconfigurable Technology and Systems10.1145/365664217:2(1-34)Online publication date: 10-May-2024
        • (2024)muRISCV-NN: Challenging Zve32x Autovectorization with TinyML Inference Library for RISC-V Vector ExtensionProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3652878(75-78)Online publication date: 7-May-2024
        • (2024)Ara2: Exploring Single- and Multi-Core Vector Processing With an Efficient RVV 1.0 Compliant Open-Source ProcessorIEEE Transactions on Computers10.1109/TC.2024.338889673:7(1822-1836)Online publication date: Jul-2024

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media