Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3466752.3480065acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

Published: 17 October 2021 Publication History
  • Get Citation Alerts
  • Abstract

    One of the most critical aspects of integrating loosely-coupled accelerators in heterogeneous SoC architectures is orchestrating their interactions with the memory hierarchy, especially in terms of navigating the various cache-coherence options: from accelerators accessing off-chip memory directly, bypassing the cache hierarchy, to accelerators having their own private cache. By running real-size applications on FPGA-based prototypes of many-accelerator multi-core SoCs, we show that the best cache-coherence mode for a given accelerator varies at runtime, depending on the accelerator’s characteristics, the workload size, and the overall SoC status.
    Cohmeleon applies reinforcement learning to select the best coherence mode for each accelerator dynamically at runtime, as opposed to statically at design time. It makes these selections adaptively, by continuously observing the system and measuring its performance. Cohmeleon is accelerator-agnostic, architecture-independent, and it requires minimal hardware support. Cohmeleon is also transparent to application programmers and has a negligible software overhead. FPGA-based experiments show that our runtime approach offers, on average, a 38% speedup with a 66% reduction of off-chip memory accesses compared to state-of-the-art design-time approaches. Moreover, it can match runtime solutions that are manually tuned for the target architecture.

    References

    [1]
    Johnathan Alsop, Matthew Sinclair, and Sarita Adve. 2018. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence. In Proceedings of the International Symposium on Computer Architecture (ISCA). 261–274.
    [2]
    ARM. [n. d.]. CoreLink Interconnect. https://developer.arm.com/ip-products/system-ip/corelink-interconnect.
    [3]
    ARM. 2020. AMBA AXI and ACE Protocol Specification. https://developer.arm.com/documentation/ihi0022/h.
    [4]
    Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri Nguyen, Yanqi Zhou, Alexey Lavrov, Mohammad Shahrad, Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew Matl, and David Wentzlaff. 2016. OpenPiton: An Open Source Manycore Research Framework. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 217–232.
    [5]
    Kevin Barker, Thomas Benson, Dan Campbell, David Ediger, Roberto Gioiosa, Adolfy Hoisie, Darren Kerbyson, Joseph Manzano, Andres Marquez, Leon Song, Nathan Tallent, and Antonino Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute.
    [6]
    Kshitij Bhardwaj, Marton Havasi, Yuan Yao, David M. Brooks, José Miguel Hernández-Lobato, and Gu-Yeon Wei. 2020. A Comprehensive Methodology to Determine Optimal Coherence Interfaces for Many-Accelerator SoCs. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). 145–150.
    [7]
    K. Bhardwaj, M. Havasi, Y. Yao, D. M. Brooks, J. M. H. Lobato, and Gu-Yeon Wei. 2019. Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization. IEEE Computer Architecture Letters 18, 2 (2019), 119–123. https://doi.org/10.1109/LCA.2019.2910521
    [8]
    B. Blaner, B. Abali, B. M. Bass, S. Chari, R. Kalla, S. Kunkel, K. Lauricella, R. Leavens, J. J. Reilly, and P. A. Sandon. 2013. IBM POWER7+ processor on-chip accelerators for cryptography and active memory expansion. IBM Journal of Research and Development 57, 6 (2013), 3:1–3:16.
    [9]
    Luca P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). 17:1–17:6.
    [10]
    Luca P. Carloni, Emilio G. Cota, Giuseppe Di Guglielmo, Davide Giri, Jihye Kwon, Paolo Mantovani, Luca Piccolboni, and Michele Petracca. 2019. Teaching Heterogeneous Computing with System-Level Design Methods. In Proc. of WCAE.
    [11]
    Loyd Case. 2016. Easing Heterogeneous Cache Coherent SoC Design using Arteris Ncore Interconnect IP. The Linley Group (2016).
    [12]
    Jared Casper and Kunle Olukotun. 2014. Hardware Acceleration of Database Operations. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 151–160. https://doi.org/10.1145/2554688.2554787
    [13]
    Matheus Cavalcante, Andreas Kurth, Fabian Schuiki, and Luca Benini. 2020. Design of an Open-Source Bridge between Non-Coherent Burst-Based and Coherent Cache-Line-Based Memory Systems. In Proceedings of the International Conference on Computing Frontiers (CF). 81–88. https://doi.org/10.1145/3387902.3392631
    [14]
    CCIX Consortium. 2018. CCIX Base Specification 1.0. https://www.ccixconsortium.com/library/specification/.
    [15]
    CCIX Consortium. 2019. An Introduction to CCIX. https://www.ccixconsortium.com/wp-content/uploads/2019/11/CCIX-White-Paper-Rev111219.pdf.
    [16]
    Y. Chen, J. Cong, M. A. Ghodrat, M. Huang, C. Liu, B. Xiao, and Y. Zou. 2013. Accelerator-rich CMPs: From concept to real hardware. In Proceedings of the IEEE International Conference on Computer Design (ICCD). 169–176. https://doi.org/10.1109/ICCD.2013.6657039
    [17]
    Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127–138.
    [18]
    B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C. Chou. 2011. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 155–166. https://doi.org/10.1109/PACT.2011.21
    [19]
    Cobham Gaisler. [n. d.]. LEON3 Processor. www.gaisler.com/index.php/products/processors/leon3.
    [20]
    Columbia SLD Group. 2019. ESP Release. www.esp.cs.columbia.edu.
    [21]
    Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and Glenn Reinman. 2014. Accelerator-rich Architectures: Opportunities and Progresses. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).
    [22]
    Thanh Cong and Francois Charot. 2021. Design Space Exploration of Heterogeneous-Accelerator SoCs with Hyperparameter Optimization. In 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). 338–343.
    [23]
    Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).
    [24]
    CXL Consortium. 2020. Compute Express Linx 2.0 White Paper. https://www.computeexpresslink.org/resource-library.
    [25]
    William J. Dally, Yatish Turakhia, and Song Han. 2020. Domain-Specific Hardware Accelerators. Communication of ACM 63, 7 (2020), 48–57.
    [26]
    Yi Ding, Nikita Mishra, and Henry Hoffmann. 2019. Generative and Multi-phase Learning for Computer Systems Optimization. In Proceedings of the International Symposium on Computer Architecture (ISCA). 39–52.
    [27]
    Michael Ditty, Ashish Karandikar, and David Reed. 2018. Nvidia’s Xavier SoC. In Hot Chips: A Symposium on High Performance Chips.
    [28]
    Bryan Donyanavard, Tiago Mück, Amir M Rahmani, Nikil Dutt, Armin Sadighi, Florian Maurer, and Andreas Herkersdorf. 2019. SOSA: Self-optimizing learning with self-adaptive control for hierarchical system-on-chip management. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. 685–698.
    [29]
    C. F. Fajardo, Z. Fang, R. Iyer, G. F. Garcia, S. E. Lee, and L. Zhao. 2011. Buffer-Integrated-Cache: A cost-effective SRAM architecture for handheld and embedded platforms. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). 966–971.
    [30]
    Hubertus Franke, Jimi Xenidis, Claude Basso, Brian M. Bass, Sandra S. Woodward, Jeffrey D. Brown, and Charles L. Johnson. 2010. Introduction to the Wire-Speed Processor and Architecture. IBM Journal of Research and Development 54, 1 (2010), 3:1–3:11.
    [31]
    D. Fujiki, S. Wu, N. Ozog, K. Goliya, D. Blaauw, S. Narayanasamy, and R. Das. 2020. SeedEx: A Genome Sequencing Accelerator for Optimal Alignments in Subminimal Space. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 937–950. https://doi.org/10.1109/MICRO50266.2020.00080
    [32]
    Gen-Z Consortium. 2020. Gen-Z Specification 1.1. https://genzconsortium.org/specifications/.
    [33]
    Davide Giri, Kuan-Lin Chiu, Guy Eichler, Paolo Mantovani, and Luca P. Carloni. 2021. Accelerator Integration for Open-Source SoC Design. IEEE Micro 41, 4 (2021), 8–14. https://doi.org/10.1109/MM.2021.3073893
    [34]
    Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani, and Luca P. Carloni. 2020. ESP4ML: Platform-Based Design of Systems-on-Chip for Embedded Machine Learning. In Proceedings of the IEEE Conference on Design, Automation, and Test in Europe (DATE).
    [35]
    Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2018. Accelerators & Coherence: An SoC Perspective. IEEE Micro 38, 6 (2018), 36–45.
    [36]
    Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2018. NoC-Based Support of Heterogeneous Cache-Coherence Models for Accelerators. In Proceedings of the International Symposium on Networks-on-Chip (NOCS). 1:1–1:8.
    [37]
    Davide Giri, Paolo Mantovani, and Luca P. Carloni. 2019. Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC). 719–726.
    [38]
    Ujjwal Gupta, Sumit K Mandal, Manqing Mao, Chaitali Chakrabarti, and Umit Y Ogras. 2019. A deep Q-learning approach for dynamic management of heterogeneous processors. IEEE Computer Architecture Letters 18, 1 (2019), 14–17.
    [39]
    Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the International Symposium on Computer Architecture (ISCA). 243–254. https://doi.org/10.1109/ISCA.2016.30
    [40]
    Mark D. Hill and Vijay Janapa Reddi. 2020. Accelerator-level Parallelism. arxiv:cs.DC/1907.02064
    [41]
    Mark Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Digest of Technical Papers of the International Solid-State Circuits Conference (ISSCC). 10–14.
    [42]
    Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. 1996. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 (1996), 237–285. https://doi.org/10.1613/jair.301
    [43]
    John H. Kelm, Daniel R. Johnson, William Tuohy, Steven S. Lumetta, and Sanjay J. Patel. 2010. Cohesion: A Hybrid Memory Model for Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 429–440.
    [44]
    J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel. 2011. Cohesion: An Adaptive Hybrid Memory Model for Accelerators. IEEE Micro 31, 1 (2011), 42–55. https://doi.org/10.1109/MM.2011.8
    [45]
    Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA). 733–745.
    [46]
    Andreas Kurth, Wolfgang Rönninger, Thomas Benz, Matheus Cavalcante, Fabian Schuiki, Florian Zaruba, and Luca Benini. 2020. An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication. arXiv:2009.05334. arxiv:cs.AR/arXiv:2009.05334
    [47]
    Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 461–475.
    [48]
    Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, and Christos Kozyrakis. 2008. Comparative Evaluation of Memory Models for Chip Multiprocessors. ACM Trans. Archit. Code Optim. 5, 3, Article 12 (Dec. 2008), 30 pages.
    [49]
    Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and Transparent Cache Bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(SC ’15). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. https://doi.org/10.1145/2807591.2807606
    [50]
    Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing(ICS ’15). Association for Computing Machinery, New York, NY, USA, 67–77. https://doi.org/10.1145/2751205.2751237
    [51]
    Wei Liu, Ying Tan, and Qinru Qiu. 2010. Enhanced Q-Learning Algorithm for Dynamic Power Management with Performance Constraint. In Proceedings of the IEEE Conference on Design, Automation, and Test in Europe (DATE). 602–605.
    [52]
    Michael J. Lyons, Mark Hempstead, Gu-Yeon Wei, and David Brooks. 2012. The Accelerator Store: A Shared Memory Framework for Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (TACO) (2012).
    [53]
    Paolo Mantovani, Emilio G. Cota, Christian Pilato, Giuseppe Di Guglielmo, and Luca P. Carloni. 2016. Handling Large Data Sets for High-performance Embedded Applications in Heterogeneous Systems-on-chip. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis of Embedded Systems (CASES). 3:1–3:10.
    [54]
    Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman, Emilio G Cota, Michele Petracca, Christian Pilato, and Luca P Carloni. 2020. Agile SoC development with open ESP. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.
    [55]
    Sang Lyul Min and Jean-Loup Baer. 1992. Design and analysis of a scalable cache coherence scheme based on clocks and timestamps. IEEE Transactions on Parallel and Distributed Systems1 (1992), 25–44.
    [56]
    Seung Won Min, Sitao Huang, Mohamed El-Hadedy, Jinjun Xiong, Deming Chen, and Wen-mei Hwu. 2019. Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 301–306. https://doi.org/10.1109/FPL.2019.00055
    [57]
    Mobileye (an Intel Company). 2018. Towards Autonomous Driving. https://s21.q4cdn.com/600692695/files/doc_presentations/2018/CES-2018-final-MBLY.pdf. CES.
    [58]
    V. Nagarajan, D. J. Sorin, M. D. Hill, and D. A. Wood. 2020. A Primer on Memory Consistency and Cache Coherence: Second Edition. Morgan & Claypool.
    [59]
    Rikin J Nayak and Jaiminkumar B Chavda. 2018. Comparison of accelerator coherency port (ACP) and high performance port (HP) for data transfer in DDR memory Using Xilinx ZYNQ SoC. In Information and Communication Technology for Intelligent Systems (ICTIS 2017) - Volume 1. Springer, 94–102.
    [60]
    NVIDIA. 2017. NVIDIA Deep Learning Accelerator (NVDLA). www.nvdla.org.
    [61]
    OpenCAPI Consortium. 2016. OpenCAPI 4.0 Specifications. https://opencapi.org/technical/specifications/.
    [62]
    Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA). 27–40. https://doi.org/10.1145/3079856.3080254
    [63]
    Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 137–151. https://doi.org/10.1145/3297858.3304025
    [64]
    D. Petrisko, F. Gilani, M. Wyse, D. C. Jung, S. Davidson, P. Gao, C. Zhao, Z. Azad, S. Canakci, B. Veluri, T. Guarino, A. Joshi, M. Oskin, and M. B. Taylor. 2020. BlackParrot: An Agile Open-Source RISC-V Multicore for Accelerator SoCs. IEEE Micro 40, 4 (2020), 93–102. https://doi.org/10.1109/MM.2020.2996145
    [65]
    E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 58–70. https://doi.org/10.1109/HPCA47549.2020.00015
    [66]
    S. Rahman, N. Abu-Ghazaleh, and R. Gupta. 2020. GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 908–921. https://doi.org/10.1109/MICRO50266.2020.00078
    [67]
    Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.
    [68]
    Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, and Luca Benini. 2013. Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ. In Proceedings of the FPGAworld Conference.
    [69]
    Benjamin Carrion Schafer and Anushree Mahapatra. 2014. S2cbench: Synthesizable systemc benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3 (2014), 53–56.
    [70]
    Yakun Sophia Shao and David Brooks. 2015. Research Infrastructures for Hardware Accelerators. Morgan & Claypool.
    [71]
    Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. 2019. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 14–27. https://doi.org/10.1145/3352460.3358302
    [72]
    Yakun Sophia Shao, Sam Likun Xi, Vijayalakshmi Srinivasan, Gu-Yeon Wei, and David Brooks. 2016. Co-Designing Accelerators and SoC Interfaces Using gem5-Aladdin. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). Article 48, 12 pages.
    [73]
    E Sisbot, Augusto Vega, Arun Paidimarri, John-David Wellman, Alper Buyuktosunoglu, Pradip Bose, and David Trilla. 2019. Multi-Vehicle Map Fusion using GNU Radio. Proceedings of the GNU Radio Conference 4, 1 (2019).
    [74]
    Stephanie Soldavini and Christian Pilato. 2021. A Survey on Domain-Specific Memory Architectures. arXiv preprint arXiv:2108.08672(2021).
    [75]
    Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool.
    [76]
    N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang. 2020. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 766–780. https://doi.org/10.1109/MICRO50266.2020.00068
    [77]
    Ashley Stevens. 2011. Introduction to AMBA® 4 ACE and big.LITTLE Processing Technology. ARM White Paper, CoreLink Intelligent System IP by ARM (2011).
    [78]
    John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).
    [79]
    Jeffrey Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development(2015).
    [80]
    Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
    [81]
    Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2021. EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. arxiv:cs.AR/2011.14203
    [82]
    Shelby Thomas, Chetan Gohkale, Enrico Tanuwidjaja, Tony Chong, David Lau, Saturnino Garcia, and Michael Bedford Taylor. 2014. CortexSuite: A synthetic brain benchmark suite. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 76–79.
    [83]
    Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jiménez. 2015. Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs(GPGPU-8). Association for Computing Machinery, New York, NY, USA, 25–35. https://doi.org/10.1145/2716282.2716283
    [84]
    Christopher J.C.H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8, 3/4 (1992), 279–292. https://doi.org/10.1023/a:1022676722315
    [85]
    Christopher John Cornish Hellaby Watkins. 1989. Learning from Delayed Rewards. Ph.D. Dissertation. King’s College, Cambridge, UK.
    [86]
    Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 255–268. https://doi.org/10.1145/2541940.2541961
    [87]
    Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 516–523. https://doi.org/10.1109/ICCAD.2013.6691165
    [88]
    Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88. https://doi.org/10.1109/HPCA.2015.7056023
    [89]
    Xilinx. 2018. Adaptable Intelligence: The Next Computing Era. Keynote, Hot Chips Symposium.
    [90]
    P. Yao, L. Zheng, Z. Zeng, Y. Huang, C. Gui, X. Liao, H. Jin, and J. Xue. 2020. A Locality-Aware Energy-Efficient Accelerator for Graph Mining Applications. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO). 895–907. https://doi.org/10.1109/MICRO50266.2020.00077
    [91]
    Hao Zheng and Ahmed Louri. 2019. An energy-efficient network-on-chip design using reinforcement learning. In Proceedings of the ACM/IEEE Design Automation Conference (DAC).

    Cited By

    View all
    • (2023)Machine learning in run-time control of multicore processor systemsit - Information Technology10.1515/itit-2023-005665:4-5(164-176)Online publication date: 2-Aug-2023
    • (2023)SoCurity: A Design Approach for Enhancing SoC SecurityIEEE Computer Architecture Letters10.1109/LCA.2023.330144822:2(105-108)Online publication date: 1-Jul-2023
    • (2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
    October 2021
    1322 pages
    ISBN:9781450385572
    DOI:10.1145/3466752
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. cache coherence
    2. hardware accelerators
    3. q-learning
    4. system-on-chip

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    MICRO '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Upcoming Conference

    MICRO '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)484
    • Downloads (Last 6 weeks)74

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Machine learning in run-time control of multicore processor systemsit - Information Technology10.1515/itit-2023-005665:4-5(164-176)Online publication date: 2-Aug-2023
    • (2023)SoCurity: A Design Approach for Enhancing SoC SecurityIEEE Computer Architecture Letters10.1109/LCA.2023.330144822:2(105-108)Online publication date: 1-Jul-2023
    • (2023)Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00028(201-213)Online publication date: Apr-2023
    • (2023)IXIAM: ISA EXtension for Integrated Accelerator ManagementIEEE Access10.1109/ACCESS.2023.326426511(33768-33791)Online publication date: 2023
    • (2023)Characterization of a Coherent Hardware Accelerator Framework for SoCsEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-46077-7_7(91-106)Online publication date: 2-Jul-2023
    • (2022)HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.318939033:12(4368-4382)Online publication date: 1-Dec-2022
    • (2022)CASPHAr: Cache-Managed Accelerator Staging and Pipelining in Heterogeneous System ArchitecturesIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319753541:11(4325-4336)Online publication date: 1-Nov-2022
    • (2022)OverGen: Improving FPGA Usability through Domain-Specific Overlay GenerationProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00018(35-56)Online publication date: 1-Oct-2022
    • (2022)A 12nm Agile-Designed SoC for Swarm-Based Perception with Heterogeneous IP Blocks, a Reconfigurable Memory Hierarchy, and an 800MHz Multi-Plane NoCESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC)10.1109/ESSCIRC55480.2022.9911456(269-272)Online publication date: 19-Sep-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media