research-article

Public Access

Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

Authors:

Joseph Zuckerman,

Paolo Mantovani, and

Luca P. CarloniAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

Pages 350 - 365

https://doi.org/10.1145/3466752.3480065

Published: 17 October 2021 Publication History

All formats PDF

Abstract

One of the most critical aspects of integrating loosely-coupled accelerators in heterogeneous SoC architectures is orchestrating their interactions with the memory hierarchy, especially in terms of navigating the various cache-coherence options: from accelerators accessing off-chip memory directly, bypassing the cache hierarchy, to accelerators having their own private cache. By running real-size applications on FPGA-based prototypes of many-accelerator multi-core SoCs, we show that the best cache-coherence mode for a given accelerator varies at runtime, depending on the accelerator’s characteristics, the workload size, and the overall SoC status.

Cohmeleon applies reinforcement learning to select the best coherence mode for each accelerator dynamically at runtime, as opposed to statically at design time. It makes these selections adaptively, by continuously observing the system and measuring its performance. Cohmeleon is accelerator-agnostic, architecture-independent, and it requires minimal hardware support. Cohmeleon is also transparent to application programmers and has a negligible software overhead. FPGA-based experiments show that our runtime approach offers, on average, a 38% speedup with a 66% reduction of off-chip memory accesses compared to state-of-the-art design-time approaches. Moreover, it can match runtime solutions that are manually tuned for the target architecture.

References

[1]

Johnathan Alsop, Matthew Sinclair, and Sarita Adve. 2018. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. In Proceedings of the International Symposium on Computer Architecture (ISCA). 261–274.

Digital Library

[2]

ARM. [n. d.]. CoreLink Interconnect. https://developer.arm.com/ip-products/system-ip/corelink-interconnect.

[3]

ARM. 2020. AMBA AXI and ACE Protocol Specification. https://developer.arm.com/documentation/ihi0022/h.

[4]

Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew Matl, and David Wentzlaff. 2016. OpenPiton: An Open Source Manycore Research Framework. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 217–232.

Digital Library

[5]

Kevin Barker, Thomas Benson, Dan Campbell, David Ediger, Roberto Gioiosa, Adolfy Hoisie, Darren Kerbyson, Joseph Manzano, Andres Marquez, Leon Song, Nathan Tallent, and Antonino Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute.

[6]

Kshitij Bhardwaj, Marton Havasi, Yuan Yao, David M. Brooks, José Miguel Hernández-Lobato, and Gu-Yeon Wei. 2020. A Comprehensive Methodology to Determine Optimal Coherence Interfaces for Many-Accelerator SoCs. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). 145–150.

Digital Library

[7]

K. Bhardwaj, M. Havasi, Y. Yao, D. M. Brooks, J. M. H. Lobato, and Gu-Yeon Wei. 2019. Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization. IEEE Computer Architecture Letters 18, 2 (2019), 119–123. https://doi.org/10.1109/LCA.2019.2910521

Digital Library

[8]

B. Blaner, B. Abali, B. M. Bass, S. Chari, R. Kalla, S. Kunkel, K. Lauricella, R. Leavens, J. J. Reilly, and P. A. Sandon. 2013. IBM POWER7+ processor on-chip accelerators for cryptography and active memory expansion. IBM Journal of Research and Development 57, 6 (2013), 3:1–3:16.

Digital Library

[9]

Luca P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). 17:1–17:6.

[10]

Luca P. Carloni, Emilio G. Cota, Giuseppe Di Guglielmo, Davide Giri, Jihye Kwon, Paolo Mantovani, Luca Piccolboni, and Michele Petracca. 2019. Teaching Heterogeneous Computing with System-Level Design Methods. In Proc. of WCAE.

Digital Library

[11]

Loyd Case. 2016. Easing Heterogeneous Cache Coherent SoC Design using Arteris Ncore Interconnect IP. The Linley Group (2016).

[12]

Jared Casper and Kunle Olukotun. 2014. Hardware Acceleration of Database Operations. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 151–160. https://doi.org/10.1145/2554688.2554787

Digital Library

[13]

Matheus Cavalcante, Andreas Kurth, Fabian Schuiki, and Luca Benini. 2020. Design of an Open-Source Bridge between Non-Coherent Burst-Based and Coherent Cache-Line-Based Memory Systems. In Proceedings of the International Conference on Computing Frontiers (CF). 81–88. https://doi.org/10.1145/3387902.3392631

Digital Library

[14]

CCIX Consortium. 2018. CCIX Base Specification 1.0. https://www.ccixconsortium.com/library/specification/.

[15]

CCIX Consortium. 2019. An Introduction to CCIX. https://www.ccixconsortium.com/wp-content/uploads/2019/11/CCIX-White-Paper-Rev111219.pdf.

[16]

Y. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou. 2013. Accelerator-rich CMPs: From concept to real hardware. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 169–176. https://doi.org/10.1109/ICCD.2013.6657039

[17]

Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127–138.

[18]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C. Chou. 2011. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 155–166. https://doi.org/10.1109/PACT.2011.21

Digital Library

[19]

Cobham Gaisler. [n. d.]. LEON3 Processor. www.gaisler.com/index.php/products/processors/leon3.

[20]

Columbia SLD Group. 2019. ESP Release. www.esp.cs.columbia.edu.

[21]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and Glenn Reinman. 2014. Accelerator-rich Architectures: Opportunities and Progresses. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).

Digital Library

[22]

Thanh Cong and Francois Charot. 2021. Design Space Exploration of Heterogeneous-Accelerator SoCs with Hyperparameter Optimization. In 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). 338–343.

[23]

Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).

Digital Library

[24]

CXL Consortium. 2020. Compute Express Linx 2.0 White Paper. https://www.computeexpresslink.org/resource-library.

[25]

William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific Hardware Accelerators. Communication of ACM 63, 7 (2020), 48–57.

Digital Library

[26]

Yi Ding, Nikita Mishra, and Henry Hoffmann. 2019. Generative and Multi-phase Learning for Computer Systems Optimization. In Proceedings of the International Symposium on Computer Architecture (ISCA). 39–52.

Digital Library

[27]

Michael Ditty, Ashish Karandikar, and David Reed. 2018. Nvidia’s Xavier SoC. In Hot Chips: A Symposium on High Performance Chips.

[28]

Bryan Donyanavard, Tiago Mück, Amir M Rahmani, Nikil Dutt, Armin Sadighi, Florian Maurer, and Andreas Herkersdorf. 2019. SOSA: Self-optimizing learning with self-adaptive control for hierarchical system-on-chip management. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. 685–698.

Digital Library

[29]

C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. 2011. Buffer-Integrated-Cache: A cost-effective SRAM architecture for handheld and embedded platforms. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). 966–971.

[30]

Hubertus Franke, Jimi Xenidis, Claude Basso, Brian M. Bass, Sandra S. Woodward, Jeffrey D. Brown, and Charles L. Johnson. 2010. Introduction to the Wire-Speed Processor and Architecture. IBM Journal of Research and Development 54, 1 (2010), 3:1–3:11.

Digital Library

[31]

D. Fujiki, S. Wu, N. Ozog, K. Goliya, D. Blaauw, S. Narayanasamy, and R. Das. 2020. SeedEx: A Genome Sequencing Accelerator for Optimal Alignments in Subminimal Space. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 937–950. https://doi.org/10.1109/MICRO50266.2020.00080

[32]

Gen-Z Consortium. 2020. Gen-Z Specification 1.1. https://genzconsortium.org/specifications/.

[33]

Davide Giri, Kuan-Lin Chiu, Guy Eichler, Paolo Mantovani, and Luca P. Carloni. 2021. Accelerator Integration for Open-Source SoC Design. IEEE Micro 41, 4 (2021), 8–14. https://doi.org/10.1109/MM.2021.3073893

[34]

Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani, and Luca P. Carloni. 2020. ESP4ML: Platform-Based Design of Systems-on-Chip for Embedded Machine Learning. In Proceedings of the IEEE Conference on Design, Automation, and Test in Europe (DATE).

[35]

Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2018. Accelerators & Coherence: An SoC Perspective. IEEE Micro 38, 6 (2018), 36–45.

[36]

Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2018. NoC-Based Support of Heterogeneous Cache-Coherence Models for Accelerators. In Proceedings of the International Symposium on Networks-on-Chip (NOCS). 1:1–1:8.

[37]

Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2019. Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC). 719–726.

Digital Library

[38]

Ujjwal Gupta, Sumit K Mandal, Manqing Mao, Chaitali Chakrabarti, and Umit Y Ogras. 2019. A deep Q-learning approach for dynamic management of heterogeneous processors. IEEE Computer Architecture Letters 18, 1 (2019), 14–17.

Digital Library

[39]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA). 243–254. https://doi.org/10.1109/ISCA.2016.30

Digital Library

[40]

Mark D. Hill and Vijay Janapa Reddi. 2020. Accelerator-level Parallelism. arxiv:cs.DC/1907.02064

[41]

Mark Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Digest of Technical Papers of the International Solid-State Circuits Conference (ISSCC). 10–14.

[42]

Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. 1996. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 (1996), 237–285. https://doi.org/10.1613/jair.301

Digital Library

[43]

John H. Kelm, Daniel R. Johnson, William Tuohy, Steven S. Lumetta, and Sanjay J. Patel. 2010. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 429–440.

Digital Library

[44]

J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. 2011. Cohesion: An Adaptive Hybrid Memory Model for Accelerators. IEEE Micro 31, 1 (2011), 42–55. https://doi.org/10.1109/MM.2011.8

Digital Library

[45]

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 733–745.

Digital Library

[46]

Andreas Kurth, Wolfgang Rönninger, Thomas Benz, Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication. arXiv:2009.05334. arxiv:cs.AR/arXiv:2009.05334

[47]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 461–475.

Digital Library

[48]

Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, and Christos Kozyrakis. 2008. Comparative Evaluation of Memory Models for Chip Multiprocessors. ACM Trans. Archit. Code Optim. 5, 3, Article 12 (Dec. 2008), 30 pages.

Digital Library

[49]

Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and Transparent Cache Bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’15). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. https://doi.org/10.1145/2807591.2807606

Digital Library

[50]

Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing(ICS ’15). Association for Computing Machinery, New York, NY, USA, 67–77. https://doi.org/10.1145/2751205.2751237

Digital Library

[51]

Wei Liu, Ying Tan, and Qinru Qiu. 2010. Enhanced Q-Learning Algorithm for Dynamic Power Management with Performance Constraint. In Proceedings of the IEEE Conference on Design, Automation, and Test in Europe (DATE). 602–605.

[52]

Michael J. Lyons, Mark Hempstead, Gu-Yeon Wei, and David Brooks. 2012. The Accelerator Store: A Shared Memory Framework for Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (TACO) (2012).

Digital Library

[53]

Paolo Mantovani, Emilio G. Cota, Christian Pilato, Giuseppe Di Guglielmo, and Luca P. Carloni. 2016. Handling Large Data Sets for High-performance Embedded Applications in Heterogeneous Systems-on-chip. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis of Embedded Systems (CASES). 3:1–3:10.

Digital Library

[54]

Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman, Emilio G Cota, Michele Petracca, Christian Pilato, and Luca P Carloni. 2020. Agile SoC development with open ESP. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.

Digital Library

[55]

Sang Lyul Min and Jean-Loup Baer. 1992. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Transactions on Parallel and Distributed Systems1 (1992), 25–44.

Digital Library

[56]

Seung Won Min, Sitao Huang, Mohamed El-Hadedy, Jinjun Xiong, Deming Chen, and Wen-mei Hwu. 2019. Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 301–306. https://doi.org/10.1109/FPL.2019.00055

[57]

Mobileye (an Intel Company). 2018. Towards Autonomous Driving. https://s21.q4cdn.com/600692695/files/doc_presentations/2018/CES-2018-final-MBLY.pdf. CES.

[58]

V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood. 2020. A Primer on Memory Consistency and Cache Coherence: Second Edition. Morgan & Claypool.

[59]

Rikin J Nayak and Jaiminkumar B Chavda. 2018. Comparison of accelerator coherency port (ACP) and high performance port (HP) for data transfer in DDR memory Using Xilinx ZYNQ SoC. In Information and Communication Technology for Intelligent Systems (ICTIS 2017) - Volume 1. Springer, 94–102.

[60]

NVIDIA. 2017. NVIDIA Deep Learning Accelerator (NVDLA). www.nvdla.org.

[61]

OpenCAPI Consortium. 2016. OpenCAPI 4.0 Specifications. https://opencapi.org/technical/specifications/.

[62]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 27–40. https://doi.org/10.1145/3079856.3080254

Digital Library

[63]

Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 137–151. https://doi.org/10.1145/3297858.3304025

Digital Library

[64]

D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao, C. Zhao, Z. Azad, S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin, and M. B. Taylor. 2020. BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs. IEEE Micro 40, 4 (2020), 93–102. https://doi.org/10.1109/MM.2020.2996145

Digital Library

[65]

E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 58–70. https://doi.org/10.1109/HPCA47549.2020.00015

[66]

S. Rahman, N. Abu-Ghazaleh, and R. Gupta. 2020. GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 908–921. https://doi.org/10.1109/MICRO50266.2020.00078

[67]

Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.

[68]

Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, and Luca Benini. 2013. Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ. In Proceedings of the FPGAworld Conference.

Digital Library

[69]

Benjamin Carrion Schafer and Anushree Mahapatra. 2014. S2cbench: Synthesizable systemc benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3 (2014), 53–56.

[70]

Yakun Sophia Shao and David Brooks. 2015. Research Infrastructures for Hardware Accelerators. Morgan & Claypool.

[71]

Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 14–27. https://doi.org/10.1145/3352460.3358302

Digital Library

[72]

Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-Designing Accelerators and SoC Interfaces Using gem5-Aladdin. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). Article 48, 12 pages.

[73]

E Sisbot, Augusto Vega, Arun Paidimarri, John-David Wellman, Alper Buyuktosunoglu, Pradip Bose, and David Trilla. 2019. Multi-Vehicle Map Fusion using GNU Radio. Proceedings of the GNU Radio Conference 4, 1 (2019).

[74]

Stephanie Soldavini and Christian Pilato. 2021. A Survey on Domain-Specific Memory Architectures. arXiv preprint arXiv:2108.08672(2021).

[75]

Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool.

[76]

N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 766–780. https://doi.org/10.1109/MICRO50266.2020.00068

[77]

Ashley Stevens. 2011. Introduction to AMBA® 4 ACE and big.LITTLE Processing Technology. ARM White Paper, CoreLink Intelligent System IP by ARM (2011).

[78]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).

[79]

Jeffrey Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development(2015).

[80]

Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT press.

Digital Library

[81]

Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2021. EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. arxiv:cs.AR/2011.14203

[82]

Shelby Thomas, Chetan Gohkale, Enrico Tanuwidjaja, Tony Chong, David Lau, Saturnino Garcia, and Michael Bedford Taylor. 2014. CortexSuite: A synthetic brain benchmark suite. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 76–79.

[83]

Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs(GPGPU-8). Association for Computing Machinery, New York, NY, USA, 25–35. https://doi.org/10.1145/2716282.2716283

Digital Library

[84]

Christopher J.C.H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3/4 (1992), 279–292. https://doi.org/10.1023/a:1022676722315

Digital Library

[85]

Christopher John Cornish Hellaby Watkins. 1989. Learning from Delayed Rewards. Ph.D. Dissertation. King’s College, Cambridge, UK.

[86]

Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 255–268. https://doi.org/10.1145/2541940.2541961

Digital Library

[87]

Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 516–523. https://doi.org/10.1109/ICCAD.2013.6691165

[88]

Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88. https://doi.org/10.1109/HPCA.2015.7056023

[89]

Xilinx. 2018. Adaptable Intelligence: The Next Computing Era. Keynote, Hot Chips Symposium.

[90]

P. Yao, L. Zheng, Z. Zeng, Y. Huang, C. Gui, X. Liao, H. Jin, and J. Xue. 2020. A Locality-Aware Energy-Efficient Accelerator for Graph Mining Applications. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 895–907. https://doi.org/10.1109/MICRO50266.2020.00077

[91]

Hao Zheng and Ahmed Louri. 2019. An energy-efficient network-on-chip design using reinforcement learning. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).

Digital Library

Cited By

Maurer FThoma MSurhonne ADonyanavard BHerkersdorf A(2023)Machine learning in run-time control of multicore processor systemsit - Information Technology10.1515/itit-2023-005665:4-5(164-176)Online publication date: 2-Aug-2023
https://doi.org/10.1515/itit-2023-0056
Hossain NBuyuktosunoglu AWellman JBose PMartonosi M(2023)SoCurity: A Design Approach for Enhancing SoC SecurityIEEE Computer Architecture Letters10.1109/LCA.2023.330144822:2(105-108)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/LCA.2023.3301448
Xu YLi ASorensen T(2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00028
Show More Cited By

Recommendations

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Read More
Heterogeneous system coherence for integrated CPU-GPU systems
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, ...
Read More
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)
DARPA

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
1,415
Total Downloads

Downloads (Last 12 months)484
Downloads (Last 6 weeks)74

Other Metrics

View Author Metrics

Citations

Cited By

Maurer FThoma MSurhonne ADonyanavard BHerkersdorf A(2023)Machine learning in run-time control of multicore processor systemsit - Information Technology10.1515/itit-2023-005665:4-5(164-176)Online publication date: 2-Aug-2023
https://doi.org/10.1515/itit-2023-0056
Hossain NBuyuktosunoglu AWellman JBose PMartonosi M(2023)SoCurity: A Design Approach for Enhancing SoC SecurityIEEE Computer Architecture Letters10.1109/LCA.2023.330144822:2(105-108)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/LCA.2023.3301448
Xu YLi ASorensen T(2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00028
Peccerillo BCheshmikhani EMannino MMondelli ABartolini S(2023)IXIAM: ISA EXtension for Integrated Accelerator ManagementIEEE Access10.1109/ACCESS.2023.326426511(33768-33791)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3264265
López-Paradís GVenu BArmejach AMoretó M(2023)Characterization of a Coherent Hardware Accelerator Framework for SoCsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-46077-7_7(91-106)Online publication date: 2-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-46077-7_7
Kurth AForsberg BBenini L(2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3189390
Asri MGerstlauer A(2022)CASPHAr: Cache-Managed Accelerator Staging and Pipelining in Heterogeneous System ArchitecturesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319753541:11(4325-4336)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCAD.2022.3197535
Liu SWeng JKupsh DSohrabizadeh AWang ZGuo LLiu JZhulin MMani RZhang LCong JNowatzki THardavellas NCampanoni SGrot BKarpuzcu U(2022)OverGen: Improving FPGA Usability through Domain-Specific Overlay GenerationProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00018(35-56)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00018
Jia TMantovani PDos Santos MGiri DZuckerman JLoscalzo ECochet MSwaminathan KTombesi GZhang JChandramoorthy NWellman JTien KCarloni LShepard KBrooks DWei GBose P(2022)A 12nm Agile-Designed SoC for Swarm-Based Perception with Heterogeneous IP Blocks, a Reconfigurable Memory Hierarchy, and an 800MHz Multi-Plane NoCESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)10.1109/ESSCIRC55480.2022.9911456(269-272)Online publication date: 19-Sep-2022
https://doi.org/10.1109/ESSCIRC55480.2022.9911456

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents