research-article

Open access

Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores

Authors:

Zhongcheng Zhang,

Yucheng Ouyang,

Xiaobing FengAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 483 - 497

https://doi.org/10.1145/3582016.3582046

Published: 25 March 2023 Publication History

Abstract

SIMD extensions are widely adopted in multi-core processors to exploit data-level parallelism. However, when co-running workloads on different cores, compute-intensive workloads cannot take advantage of the underutilized SIMD lanes allocated to memoryintensive workloads, reducing the overall performance. This paper proposes Occamy, a SIMD co-processor that can be shared by multiple CPU cores, so that their co-running workloads can spatially share its SIMD lanes. The key idea is to enable elastic spatial sharing by dynamically partitioning all the SIMD lanes across different workloads based on their phase behaviors, so that each workload may execute in variable-length SIMD mode. We also introduce an Occamy compiler to support such variable-length vectorization by analyzing such phase behaviors and generating the vectorized code that works with varying vector lengths. We demonstrate that Occamy can improve SIMD utilization, and consequently, performance over three representative SIMD architectures, with negligible chip area cost.

References

[1]

Alon Amid, Krste Asanovic, Allen Baum, Alex Bradbury, Tony Brewer, Chris Celio, Aliaksei Chapyzhenka, Silviu Chiricescu, Ken Dockser, and Bob Dreyer. 2020. RISC-V “V” Vector Extension, version 0.9.

[2]

Philip Bedoukian, Neil Adit, Edwin Peguero, and Adrian Sampson. 2021. Software-Defined Vector Processing on Manycore Fabrics. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 392–406.

[3]

Spiridon F Beldianu and Sotirios G Ziavras. 2013. Multicore-based vector coprocessor sharing for performance and energy gains. ACM Transactions on Embedded Computing Systems (TECS), 13, 2 (2013), 1–25.

Digital Library

[4]

Spiridon F Beldianu and Sotirios G Ziavras. 2014. Performance-energy optimizations for shared vector accelerators in multicores. IEEE Trans. Comput., 64, 3 (2014), 805–817.

Digital Library

[5]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, and Somayeh Sardashti. 2011. The gem5 simulator. ACM SIGARCH computer architecture news, 39, 2 (2011), 1–7.

[6]

Mark Byler, Michael Wolfe, James Davies, Christopher Huson, and Bruce Leasure. 1987. Multiple Version Loops. In ICPP.

[7]

Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All You Need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 301–315. isbn:9781450392655 https://doi.org/10.1145/3519939.3523701

Digital Library

[8]

Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. 2021. VeGen: A Vectorizer Generator for SIMD and Beyond. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 902–914. isbn:9781450383172 https://doi.org/10.1145/3445814.3446692

Digital Library

[9]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 tensor core GPU: Performance and innovation. IEEE Micro, 41, 2 (2021), 29–35.

[10]

Standard Performance Evaluation Corporation. 2017. SPEC CPU® 2017. https://www.spec.org/cpu2017/

[11]

Ganesh Dasika, Mark Woh, Sangwon Seo, Nathan Clark, Trevor Mudge, and Scott Mahlke. 2010. Mighty-Morphing Power-SIMD. In Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES ’10). Association for Computing Machinery, New York, NY, USA. 67–76. isbn:9781605589039 https://doi.org/10.1145/1878921.1878934

Digital Library

[12]

Joao Mario Domingos, Nuno Neves, Nuno Roma, and Pedro Tomás. 2021. Unlimited vector extension with data streaming support. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 209–222.

Digital Library

[13]

Graham Gobieski, Amolak Nagi, Nathan Serafin, Mehmet Meric Isgenc, Nathan Beckmann, and Brandon Lucia. 2019. Manic: A vector-dataflow architecture for ultra-low-power embedded systems. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 670–684.

Digital Library

[14]

Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 341–351. https://doi.org/10.1109/PACT.2013.6618830

[15]

Sajith Kalathingal, Caroline Collange, Bharath N. Swamy, and André Seznec. 2016. Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP. In 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 18–25. https://doi.org/10.1109/SBAC-PAD.2016.11

[16]

Sajith Kalathingal, Sylvain Collange, Bharath N. Swamy, and André Seznec. 2018. DITVA: Dynamic Inter-Thread Vectorization Architecture. J. Parallel and Distrib. Comput., 120 (2018), 267–281. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2017.11.006

Digital Library

[17]

Aditya Kesiraju, Andrew J. Beaumont-Smith, Deepankar Duggal, and Ran A. Chachick. 2020. Coprocessor with Distributed Register.

[18]

Aditya Kesiraju, Andrew J. Beaumont-Smith, Boris S. Alvarez-Heredia, Pradeep Kanapathipillai, Ran A. Chachick, and Srikanth Balasubramanian. 2020. Coprocessors with Bypass Optimization, Variable Grid Architecture, and Fused Vector Operations.

[19]

Aditya Kesiraju, Brett S. Feero, Nikhil Gupta, and Viney Gautam. 2020. Coprocessor Operation Bundling.

[20]

Kazuhiko Komatsu, Shintaro Momose, Yoko Isobe, Osamu Watanabe, Akihiro Musa, Mitsuo Yokokawa, Toshikazu Aoyama, Masayuki Sato, and Hiroaki Kobayashi. 2018. Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 685–696.

Digital Library

[21]

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, and Leonid Oliker. 2018. A novel multi-level integrated roofline model approach for performance characterization. In International Conference on High Performance Computing. 226–245.

[22]

Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic. 2004. The vector-thread architecture. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. 52–63.

Digital Library

[23]

Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). Association for Computing Machinery, New York, NY, USA. 145–156. isbn:1581131992 https://doi.org/10.1145/349299.349320

Digital Library

[24]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th annual international symposium on Computer architecture. 129–140.

Digital Library

[25]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2013. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. ACM Transactions on Computer Systems (TOCS), 31, 3 (2013), 1–38.

Digital Library

[26]

Bruce Paul Leung. 1990. Issues on the design of parallelizing compilers. Master’s thesis. Citeseer.

[27]

Arm Limited. 2021. Introduction to SVE2. https://developer.arm.com/documentation/102340/0001/Introducing-SVE2

[28]

Oscar G Lorenzo, Tomás F Pena, José C Cabaleiro, Juan C Pichel, and Francisco F Rivera. 2014. Using an extended Roofline Model to understand data and thread affinities on NUMA systems. Annals of Multicore and GPU Programming, 1, 1 (2014), 37–48.

[29]

Yaojie Lu, Seyedamin Rooholamin, and Sotirios G Ziavras. 2016. Vector coprocessor virtualization for simultaneous multithreading. ACM Transactions on Embedded Computing Systems (TECS), 15, 3 (2016), 1–25.

Digital Library

[30]

Jiayuan Meng, Jeremy W. Sheaffer, and Kevin Skadron. 2012. Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 107–118. https://doi.org/10.1109/IPDPS.2012.20

Digital Library

[31]

Anant V Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J Omer, Avishaii Abuhatzera, Belliappa Kuttanna, and Sreenivas Subramoney. 2021. REDUCT: Keep it Close, Keep it Cool!: Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 167–180.

Digital Library

[32]

John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and James C Phillips. 2008. GPU computing. Proc. IEEE, 96, 5 (2008), 879–899.

[33]

Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring simd execution using heterogeneous hardware and dynamic configurability. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 84–95.

Digital Library

[34]

Yongjun Park, Sangwon Seo, Hyunchul Park, Hyoun Kyu Cho, and Scott Mahlke. 2012. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. ACM SIGPLAN Notices, 47, 4 (2012), 363–374.

Digital Library

[35]

Suzanne Rivoire, Rebecca Schultz, Tomofumi Okuda, and Christos Kozyrakis. 2006. Vector lane threading. In 2006 International Conference on Parallel Processing (ICPP’06). 55–64.

Digital Library

[36]

Nigel Stephens. 2016. ARMv8-A next-generation vector architecture for HPC. In 2016 IEEE Hot Chips 28 Symposium (HCS). 1–31. https://doi.org/10.1109/HOTCHIPS.2016.7936203

[37]

Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, and Nathanael Premillieu. 2017. The ARM scalable vector extension. IEEE micro, 37, 2 (2017), 26–39.

Digital Library

[38]

Inc Synopsys. [n. d.]. Synopsys Design Compiler. https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html

[39]

Alexa VanHattum, Rachit Nigam, Vincent T. Lee, James Bornholt, and Adrian Sampson. 2021. Vectorization for Digital Signal Processors via Equality Saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 874–886. isbn:9781450383172 https://doi.org/10.1145/3445814.3446707

Digital Library

[40]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52, 4 (2009), 65–76.

Digital Library

[41]

Samuel Webb Williams. 2008. Auto-tuning performance on multicore computers. University of California, Berkeley.

[42]

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services. IEEE Micro, 41, 5 (2021), 67–75. https://doi.org/10.1109/MM.2021.3085578

Digital Library

[43]

Yohei Yamada and Shintaro Momose. 2018. Vector engine processor of NEC’s brand-new supercomputer SX-Aurora TSUBASA. In Proceedings of A Symposium on High Performance Chips, Hot Chips. 30, 19–21.

[44]

Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, and Samuel Williams. 2021. Hierarchical roofline performance analysis for deep learning applications. In Intelligent Computing. Springer, 473–491.

[45]

Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective,SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.

[46]

Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based Deep Learning Models 10x Faster. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 951–965.

[47]

Hao Zhou and Jingling Xue. 2016. A Compiler Approach for Exploiting Partial SIMD Parallelism. ACM Trans. Archit. Code Optim., 13, 1 (2016), Article 11, mar, 26 pages. issn:1544-3566 https://doi.org/10.1145/2886101

Digital Library

[48]

Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). Association for Computing Machinery, New York, NY, USA. 59–69. isbn:9781450337786 https://doi.org/10.1145/2854038.2854054

Digital Library

Cited By

Pu ZZhang GZhang TZhang CZhang YZhao X(2024)ChameSC: Virtualizing Superscalar Core of a SIMD Architecture for Vector Memory Access2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00019(52-59)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00019
Perotti MRaeber MSinigaglia MCavalcante MRossi DBenini L(2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00042

Index Terms

Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores
1. Computer systems organization
  1. Architectures

Recommendations

Optimizing mobile multimedia using SIMD techniques

Demand for mobile video applications is growing today in wireless handheld platforms. Optimizing instruction set architectures and employing SIMD techniques is a logical approach towards attaining higher performance in mobile multimedia applications. ...
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
A programming system for xeon phis with runtime SIMD parallelization
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 2023

820 pages

ISBN:9781450399180

DOI:10.1145/3582016

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China
Australian Research Council grant

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,665
Total Downloads

Downloads (Last 12 months)658
Downloads (Last 6 weeks)46

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pu ZZhang GZhang TZhang CZhang YZhao X(2024)ChameSC: Virtualizing Superscalar Core of a SIMD Architecture for Vector Memory Access2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00019(52-59)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00019
Perotti MRaeber MSinigaglia MCavalcante MRossi DBenini L(2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00042

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten