research-article

Public Access

Software-Defined Vector Processing on Manycore Fabrics

Authors:

Philip Bedoukian,

Adrian SampsonAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 392 - 406

https://doi.org/10.1145/3466752.3480099

Published: 17 October 2021 Publication History

All formats PDF

Abstract

We describe a tiled architecture that can fluidly transition between manycore (MIMD) and vector (SIMD) execution. The hardware provides a software-defined vector programming model that lets applications aggregate groups of manycore tiles into logical vector engines. In manycore mode, the machine behaves as a standard parallel processor. In vector mode, groups of tiles repurpose their functional units as vector execution lanes and scratchpads as vector memory banks. The key mechanism is an instruction forwarding network: a single tile fetches instructions and sends them to other trailing cores. Most cores disable their frontends and instruction caches, so vector groups amortize the intrinsic hardware costs of von Neumann control. Vector groups also use a decoupled access/execute scheme to centralize their memory requests and issue coalesced, wide loads.

We augment an existing RISC-V manycore design with a minimal hardware extension to implement software-defined vectors. Cycle-level simulation results show that software-defined vectors improve performance by an average of 1.7 × over standard MIMD execution while saving 22% of the energy. Compared to a similarly configured GPU, the architecture improves performance by 1.9 ×.

References

[1]

Alon Amid, Krste Asanovic, Allen Baum, Alex Bradbury, Tony Brewer, Chris Celio, Aliaksei Chapyzhenka, Silviu Chiricescu, Ken Dockser, Bob Dreyer, Roger Espasa, Sean Halle, John Hauser, David Horner, Bruce Hoult, Bill Huffman, Constantine Korikov, Ben Korpan, Hanna Kruppe, Yunsup Lee, Guy Lemieux, Filip Moc, Rich Newell, Albert Ou, David Patterson, Colin Schmidt, Alex Solomatnikov, Steve Wallach, Andrew Waterman, and Jim Wilson. 2020. RISC-V “V” Vector Extension, version 0.9. https://github.com/riscv/riscv-v-spec.

[2]

Jonathan Balkind, Katie Lim, Michael Schaffner, Fei Gao, Grigory Chirkov, Ang Li, Alexey Lavrov, Tri M. Nguyen, Yaosheng Fu, Florian Zaruba, Kunal Gulati, Luca Benini, and David Wentzlaff. 2020. BYOC: A “Bring Your Own Core” Framework for Heterogeneous-ISA Research. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

[3]

D. Bates, A. Bradbury, A. Koltes, and R. Mullins. 2015. Exploiting tightly-coupled cores. In Journal of Signal Processing Systems, Vol. 80. 103–120.

[4]

Christopher Batten, Ronny Krashinsky, Steve Gerding, and Krste Asanović. 2004. Cache Refill/Access Decoupling for Vector Machines. In IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]

Bespoke Silicon Group. [n.d.]. HammerBlade. https://github.com/bespoke-silicon-group/bsg_bladerunner.

[6]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Computer Architecture News 39, 2 (May 2011).

Digital Library

[7]

Jeffery A. Brown, Hong Wang, George Chrysos, Perry H. Wang, and John P. Shen. 2001. Speculative precomputation on chip multiprocessors. In In Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation.

[8]

Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawai, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rajesh K. Gupta, Zhiru Zhang, Ronald G. Dreslinski, Christopher Batten, and Michael Bedford Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro 38, 2 (2018), 30–41.

[9]

Thomas H. Dunigan, Jeffrey S. Vetter, James B. White, and Patrick H. Worley. 2005. Performance evaluation of the Cray X1 distributed shared-memory architecture. IEEE Micro 25, 1 (2005), 30–40.

Digital Library

[10]

Scott Grauer-Gray and Louis-Noël Pouchet. 2012. PolyBench/GPU: Implementation of PolyBench codes for GPU processing. URL: http://www.cs.ucla.edu/pouchet/software/polybench.

[11]

S. Gupta, S. Feng, A. Ansari, and S. Mahlke. 2010. Erasing Core Boundaries for Robust and Configurable Performance. In IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]

A. Gutierrez, B. M. Beckmann, A. Dutu, J. Gross, M. LeBeane, J. Kalamatianos, O. Kayiran, M. Poremba, B. Potter, S. Puthoor, M. D. Sinclair, M. Wyse, J. Yin, X. Zhang, A. Jain, and T. Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 608–619.

[13]

Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F. Martinez. 2007. Core Fusion: Accommodating Software Diversity in Chip Multiprocessors. In International Symposium on Computer Architecture (ISCA).

Digital Library

[14]

Xingxing Jin, Brian Daku, and Seok-Bum Ko. 2014. Improved GPU SIMD control flow efficiency via hybrid warp size mechanism. Microprocessors and Microsystems 38 (2014), 717–729.

[15]

Changkyu Kim, Simha Sethumadhavan, Madhu Saravana Sibi Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, and Stephen W. Keckler. 2007. Composable Lightweight Processors. In IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanović. 2004. The vector-thread architecture. In International Symposium on Computer Architecture (ISCA).

[17]

Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari. 2012. Dynamic warp resizing: Analysis and benefits in high-performance SIMT. In IEEE International Conference on Computer Design (ICCD).

Digital Library

[18]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In International Symposium on Computer Architecture (ISCA).

Digital Library

[19]

Yunsup Lee, Albert Ou, Colin Schmidt, Sagar Karandikar, Howard Mao, and Krste Asanović. 2015. The Hwacha vector-fetch architecture manual version 3.8.1. Technical Report UCB/EECS-2015-262. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-262.html

[20]

Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. [n.d.]. CACTI 6.5. https://github.com/HewlettPackard/cacti.

[21]

R Nagarajan, Karthikeyan Sankaralingam, Doug Burger, and S. W. Keckler. 2001. A design space evaluation of grid processor architectures. In IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]

Y. Park, J. J. K. Park, H. Park, and S. Mahlke. 2012. Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 84–95.

[23]

Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A Variable Warp Size Architecture. In International Symposium on Computer Architecture (ISCA).

Digital Library

[24]

Richard M. Russell. 1978. The CRAY-1 Computer System. Commun. ACM 21, 1 (Jan. 1978), 63–72.

Digital Library

[25]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-Core X86 Architecture for Visual Computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 1–15.

Digital Library

[26]

James E. Smith. 1982. Decoupled Access/Execute Computer Architectures. In International Symposium on Computer Architecture (ISCA).

[27]

Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (March 2017), 26–39.

Digital Library

[28]

Karthik Sundaramoorthy, Zach Purser, and Eric Rotenberg. 2000. Slipstream processors: Improving both performance and fault tolerance. ACM SIGPLAN Notices (2000).

Digital Library

[29]

Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. 2018. Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables. In International Symposium on Computer Architecture (ISCA).

[30]

David Tarjan, Michael Boyer, and Kevin Skadron. 2008. Federation: Repurposing scalar cores for out-of-order instruction issue. In Design Automation Conference (DAC).

Digital Library

[31]

Michael Bedford Taylor, Jason Sungtae Kim, Jason E. Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul R. Johnson, Jae Won Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matthew I. Frank, Saman P. Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro 22(2002), 25–35.

Digital Library

[32]

Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, and Xiaomei Ji. 1999. Adapting Cache Line Size to Application Behavior. In Proceedings of the 13th International Conference on Supercomputing (Rhodes, Greece) (ICS ’99). Association for Computing Machinery, New York, NY, USA, 145–154. https://doi.org/10.1145/305138.305188

Digital Library

[33]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. High-Performance Computing on the Intel Xeon Phi: How to Fully Exploit MIC Architectures. Springer.

[34]

Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanović. 2011. The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA. Technical Report UCB/EECS-2011-62. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-62.html

[35]

Florian Zaruba and Luca Benini. 2019. The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 11(2019), 2629–2640.

Digital Library

[36]

Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke. 2007. Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications. In International Symposium on High-Performance Computer Architecture (HPCA).

Digital Library

Recommendations

A cluster for CS education in the manycore era
SIGCSE '11: Proceedings of the 42nd ACM technical symposium on Computer science education

Traditional Beowulf clusters have been homogeneous platforms for distributed-memory MIMD parallelism. However, the shift to multicore architectures has made shared-memory MIMD parallelism increasingly important, and inexpensive manycore GPGPUs have ...
OpenACC Execution Models for Manycore Processor with ARM SVE
HPCAsia '23 Workshops: Proceedings of the HPC Asia 2023 Workshops

OpenACC is designed to offer performance portability across CPUs with SIMD extensions and accelerators based on GPU or manycore architecture. We are working on the design of OpenACC compiler for A64FX manycore processor with Arm SVE. We use a source-to-...
Energy Efficient Stencil Computations on the Low-Power Manycore MPPA-256 Processor
Euro-Par 2018: Parallel Processing
Abstract
A new class of highly-parallel low-power manycore chips that cope with energy constraints have been unveiled. Sunway’s SW26010 and Kalray’s MPPA-256 are examples of them, featuring more than two hundred cores in a single low-power chip. Although ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,475
Total Downloads

Downloads (Last 12 months)230
Downloads (Last 6 weeks)22

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents