Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3582016.3582046acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores

Published: 25 March 2023 Publication History

Abstract

SIMD extensions are widely adopted in multi-core processors to exploit data-level parallelism. However, when co-running workloads on different cores, compute-intensive workloads cannot take advantage of the underutilized SIMD lanes allocated to memoryintensive workloads, reducing the overall performance. This paper proposes Occamy, a SIMD co-processor that can be shared by multiple CPU cores, so that their co-running workloads can spatially share its SIMD lanes. The key idea is to enable elastic spatial sharing by dynamically partitioning all the SIMD lanes across different workloads based on their phase behaviors, so that each workload may execute in variable-length SIMD mode. We also introduce an Occamy compiler to support such variable-length vectorization by analyzing such phase behaviors and generating the vectorized code that works with varying vector lengths. We demonstrate that Occamy can improve SIMD utilization, and consequently, performance over three representative SIMD architectures, with negligible chip area cost.

References

[1]
Alon Amid, Krste Asanovic, Allen Baum, Alex Bradbury, Tony Brewer, Chris Celio, Aliaksei Chapyzhenka, Silviu Chiricescu, Ken Dockser, and Bob Dreyer. 2020. RISC-V “V” Vector Extension, version 0.9.
[2]
Philip Bedoukian, Neil Adit, Edwin Peguero, and Adrian Sampson. 2021. Software-Defined Vector Processing on Manycore Fabrics. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 392–406.
[3]
Spiridon F Beldianu and Sotirios G Ziavras. 2013. Multicore-based vector coprocessor sharing for performance and energy gains. ACM Transactions on Embedded Computing Systems (TECS), 13, 2 (2013), 1–25.
[4]
Spiridon F Beldianu and Sotirios G Ziavras. 2014. Performance-energy optimizations for shared vector accelerators in multicores. IEEE Trans. Comput., 64, 3 (2014), 805–817.
[5]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, and Somayeh Sardashti. 2011. The gem5 simulator. ACM SIGARCH computer architecture news, 39, 2 (2011), 1–7.
[6]
Mark Byler, Michael Wolfe, James Davies, Christopher Huson, and Bruce Leasure. 1987. Multiple Version Loops. In ICPP.
[7]
Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All You Need is Superword-Level Parallelism: Systematic Control-Flow Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). Association for Computing Machinery, New York, NY, USA. 301–315. isbn:9781450392655 https://doi.org/10.1145/3519939.3523701
[8]
Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. 2021. VeGen: A Vectorizer Generator for SIMD and Beyond. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 902–914. isbn:9781450383172 https://doi.org/10.1145/3445814.3446692
[9]
Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. NVIDIA A100 tensor core GPU: Performance and innovation. IEEE Micro, 41, 2 (2021), 29–35.
[10]
Standard Performance Evaluation Corporation. 2017. SPEC CPU® 2017. https://www.spec.org/cpu2017/
[11]
Ganesh Dasika, Mark Woh, Sangwon Seo, Nathan Clark, Trevor Mudge, and Scott Mahlke. 2010. Mighty-Morphing Power-SIMD. In Proceedings of the 2010 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES ’10). Association for Computing Machinery, New York, NY, USA. 67–76. isbn:9781605589039 https://doi.org/10.1145/1878921.1878934
[12]
Joao Mario Domingos, Nuno Neves, Nuno Roma, and Pedro Tomás. 2021. Unlimited vector extension with data streaming support. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 209–222.
[13]
Graham Gobieski, Amolak Nagi, Nathan Serafin, Mehmet Meric Isgenc, Nathan Beckmann, and Brandon Lucia. 2019. Manic: A vector-dataflow architecture for ultra-low-power embedded systems. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 670–684.
[14]
Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 341–351. https://doi.org/10.1109/PACT.2013.6618830
[15]
Sajith Kalathingal, Caroline Collange, Bharath N. Swamy, and André Seznec. 2016. Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP. In 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 18–25. https://doi.org/10.1109/SBAC-PAD.2016.11
[16]
Sajith Kalathingal, Sylvain Collange, Bharath N. Swamy, and André Seznec. 2018. DITVA: Dynamic Inter-Thread Vectorization Architecture. J. Parallel and Distrib. Comput., 120 (2018), 267–281. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2017.11.006
[17]
Aditya Kesiraju, Andrew J. Beaumont-Smith, Deepankar Duggal, and Ran A. Chachick. 2020. Coprocessor with Distributed Register.
[18]
Aditya Kesiraju, Andrew J. Beaumont-Smith, Boris S. Alvarez-Heredia, Pradeep Kanapathipillai, Ran A. Chachick, and Srikanth Balasubramanian. 2020. Coprocessors with Bypass Optimization, Variable Grid Architecture, and Fused Vector Operations.
[19]
Aditya Kesiraju, Brett S. Feero, Nikhil Gupta, and Viney Gautam. 2020. Coprocessor Operation Bundling.
[20]
Kazuhiko Komatsu, Shintaro Momose, Yoko Isobe, Osamu Watanabe, Akihiro Musa, Mitsuo Yokokawa, Toshikazu Aoyama, Masayuki Sato, and Hiroaki Kobayashi. 2018. Performance evaluation of a vector supercomputer SX-Aurora TSUBASA. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 685–696.
[21]
Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, and Leonid Oliker. 2018. A novel multi-level integrated roofline model approach for performance characterization. In International Conference on High Performance Computing. 226–245.
[22]
Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, and Krste Asanovic. 2004. The vector-thread architecture. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. 52–63.
[23]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI ’00). Association for Computing Machinery, New York, NY, USA. 145–156. isbn:1581131992 https://doi.org/10.1145/349299.349320
[24]
Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th annual international symposium on Computer architecture. 129–140.
[25]
Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2013. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. ACM Transactions on Computer Systems (TOCS), 31, 3 (2013), 1–38.
[26]
Bruce Paul Leung. 1990. Issues on the design of parallelizing compilers. Master’s thesis. Citeseer.
[27]
Arm Limited. 2021. Introduction to SVE2. https://developer.arm.com/documentation/102340/0001/Introducing-SVE2
[28]
Oscar G Lorenzo, Tomás F Pena, José C Cabaleiro, Juan C Pichel, and Francisco F Rivera. 2014. Using an extended Roofline Model to understand data and thread affinities on NUMA systems. Annals of Multicore and GPU Programming, 1, 1 (2014), 37–48.
[29]
Yaojie Lu, Seyedamin Rooholamin, and Sotirios G Ziavras. 2016. Vector coprocessor virtualization for simultaneous multithreading. ACM Transactions on Embedded Computing Systems (TECS), 15, 3 (2016), 1–25.
[30]
Jiayuan Meng, Jeremy W. Sheaffer, and Kevin Skadron. 2012. Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 107–118. https://doi.org/10.1109/IPDPS.2012.20
[31]
Anant V Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J Omer, Avishaii Abuhatzera, Belliappa Kuttanna, and Sreenivas Subramoney. 2021. REDUCT: Keep it Close, Keep it Cool!: Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 167–180.
[32]
John D Owens, Mike Houston, David Luebke, Simon Green, John E Stone, and James C Phillips. 2008. GPU computing. Proc. IEEE, 96, 5 (2008), 879–899.
[33]
Yongjun Park, Jason Jong Kyu Park, Hyunchul Park, and Scott Mahlke. 2012. Libra: Tailoring simd execution using heterogeneous hardware and dynamic configurability. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 84–95.
[34]
Yongjun Park, Sangwon Seo, Hyunchul Park, Hyoun Kyu Cho, and Scott Mahlke. 2012. SIMD defragmenter: Efficient ILP realization on data-parallel architectures. ACM SIGPLAN Notices, 47, 4 (2012), 363–374.
[35]
Suzanne Rivoire, Rebecca Schultz, Tomofumi Okuda, and Christos Kozyrakis. 2006. Vector lane threading. In 2006 International Conference on Parallel Processing (ICPP’06). 55–64.
[36]
Nigel Stephens. 2016. ARMv8-A next-generation vector architecture for HPC. In 2016 IEEE Hot Chips 28 Symposium (HCS). 1–31. https://doi.org/10.1109/HOTCHIPS.2016.7936203
[37]
Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, and Nathanael Premillieu. 2017. The ARM scalable vector extension. IEEE micro, 37, 2 (2017), 26–39.
[38]
Inc Synopsys. [n. d.]. Synopsys Design Compiler. https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html
[39]
Alexa VanHattum, Rachit Nigam, Vincent T. Lee, James Bornholt, and Adrian Sampson. 2021. Vectorization for Digital Signal Processors via Equality Saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA. 874–886. isbn:9781450383172 https://doi.org/10.1145/3445814.3446707
[40]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52, 4 (2009), 65–76.
[41]
Samuel Webb Williams. 2008. Auto-tuning performance on multicore computers. University of California, Berkeley.
[42]
Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services. IEEE Micro, 41, 5 (2021), 67–75. https://doi.org/10.1109/MM.2021.3085578
[43]
Yohei Yamada and Shintaro Momose. 2018. Vector engine processor of NEC’s brand-new supercomputer SX-Aurora TSUBASA. In Proceedings of A Symposium on High Performance Chips, Hot Chips. 30, 19–21.
[44]
Charlene Yang, Yunsong Wang, Thorsten Kurth, Steven Farrell, and Samuel Williams. 2021. Hierarchical roofline performance analysis for deep learning applications. In Intelligent Computing. Springer, 473–491.
[45]
Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. MArk: Exploiting Cloud Services for Cost-Effective,SLO-Aware Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.
[46]
Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based Deep Learning Models 10x Faster. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 951–965.
[47]
Hao Zhou and Jingling Xue. 2016. A Compiler Approach for Exploiting Partial SIMD Parallelism. ACM Trans. Archit. Code Optim., 13, 1 (2016), Article 11, mar, 26 pages. issn:1544-3566 https://doi.org/10.1145/2886101
[48]
Hao Zhou and Jingling Xue. 2016. Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). Association for Computing Machinery, New York, NY, USA. 59–69. isbn:9781450337786 https://doi.org/10.1145/2854038.2854054

Cited By

View all
  • (2024)ChameSC: Virtualizing Superscalar Core of a SIMD Architecture for Vector Memory Access2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00019(52-59)Online publication date: 18-Nov-2024
  • (2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024

Index Terms

  1. Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
    March 2023
    820 pages
    ISBN:9781450399180
    DOI:10.1145/3582016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Architecture
    2. Auto Vectorization
    3. Simd

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)658
    • Downloads (Last 6 weeks)46
    Reflects downloads up to 31 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ChameSC: Virtualizing Superscalar Core of a SIMD Architecture for Vector Memory Access2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00019(52-59)Online publication date: 18-Nov-2024
    • (2024)Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00042(172-173)Online publication date: 24-Jul-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media