Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3626202.3637571acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

MiCache: An MSHR-inclusive Non-blocking Cache Design for FPGAs

Published: 02 April 2024 Publication History

Abstract

On FPGAs, customizing data parallelism can significantly improve performances of applications. However, a large number of applications, such as sparse matrix multiplication, exhibit irregular memory access patterns, for which further improvements are limited by their low memory access efficiency. It is challenging to solve using traditional caches due to the massive cache misses. To address this, prior research efforts are dedicated to developing non-blocking caches withMiss Status Holding Registers (MSHRs) to manage the cache misses and mitigate stalls caused by the misses. However, exsiting approaches allocate dedicatedBlock RAMs (BRAMs) for implementing MSHRs. It introduces complexities in MSHR configurations and potential resource inefficiencies, as MSHR demand is highly dynamic when solving real-world problems. In this paper, we present MiCache, an MSHR-inclusive non-blocking cache design where cache entries and MSHR entries share the same storage spaces to support the dynamic demand for MSHRs during the executions of applications. We design a consistent storage structure for cache and MSHR entries, ensuring a unified and efficient mechanism for cache/MSHR lookup and data access. To further improve the performance, we design a parallel dual pipeline, one of which processes the requests from processing elements, and the other processes the responses from off-chip memory. We implement and evaluate our proposal on a Xilinx Alveo U280 board. Evaluation results show that, compared to the state-of-the-art non-blocking cache design on FPGAs, with equivalent cache configurations, MiCache reduces the BRAM consumption by up to 17%. When using the same amount of BRAM resources, MiCache achieves up to 1.56x performance improvement.

References

[1]
Johnathan Alsop, Xianwei Zhang, Tsung Tai Yeh, Bradford M. Beckmann, Matthew D. Sinclair, Srikant Bharadwaj, Alexandru Dutu, Anthony Gutierrez, Onur Kayiran, Michael LeBeane, Brandon Potter, and Sooraj Puthoor. 2019. Optimizing GPU Cache Policies for MI Workloads. In Proceedings of IEEE International Symposium on Workload Characterization. Orlando, FL, USA, 243--248.
[2]
Mikhail Asiatici and Paolo Ienne. 2019. Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs. In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Seaside, CA, USA, 310--319.
[3]
Mikhail Asiatici and Paolo Ienne. 2021. Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses. In Proceedings of ACM/IEEE Annual International Symposium on Computer Architecture. Valencia, Spain, 609--622.
[4]
Timothy A. Davis and Yifan Hu. 2011. The university of Florida sparse matrix collection. ACM Trans. Math. Softw., Vol. 38, 1 (2011), 1:1--1:25.
[5]
Keith I. Farkas and Norman P. Jouppi. 1994. Complexity/Performance Tradeoffs with Non-Blocking Loads. In Proceedings of Annual International Symposium on Computer Architecture. Chicago, IL, USA, 211--222.
[6]
A. D. Santana Gil, José Ignacio Benavides Ben'i tez, Manuel Hernandez Calvi n o, and Ezequiel Herruzo Gomez. 2010. Reconfigurable Cache Implemented on an FPGA. In Proceedings of International Conference on Reconfigurable Computing and FPGAs. Cancun, Quintana Roo, Mexico, 250--255.
[7]
Yongbin Gu and Lizhong Chen. 2019. Dynamically linked MSHRs for adaptive miss handling in GPUs. In Proceedings of ACM International Conference on Supercomputing. Phoenix, Arizona, 510--521.
[8]
Christoph Hagleitner, Dionysios Diamantopoulos, Burkhard Ringlein, Constantinos Evangelinos, Charles R. Johns, Rong N. Chang, Bruce D'Amora, James A. Kahle, James C. Sexton, Michael Johnston, Edward Pyzer-Knapp, and Chris Ward. 2021. Heterogeneous Computing Systems for Complex Scientific Discovery Workflows. In Proceedings of Design, Automation & Test in Europe Conference & Exhibition. Grenoble, France, 13--18.
[9]
Martin C. Herbordt, Tom Van Court, Yongfeng Gu, Bharat Sukhwani, Al Conti, Josh Model, and Douglas DiSabello. 2007. Achieving High Performance with FPGA-Based Computing. Computer, Vol. 40, 3 (2007), 50--57.
[10]
Hongjing Huang, Zeke Wang, Jie Zhang, Zhenhao He, Chao Wu, Jun Xiao, and Gustavo Alonso. 2022. Shuhai: A Tool for Benchmarking High Bandwidth Memory on FPGAs. IEEE Trans. Computers, Vol. 71, 5 (2022), 1133--1144.
[11]
Weiwen Jiang, Edwin H.-M. Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference. ACM Trans. Embed. Comput. Syst., Vol. 18, 5s (2019), 67:1--67:23.
[12]
David Kroft. 1981. Lockup-Free Instruction Fetch/Prefetch Cache Organization. In Proceedings of Annual Symposium on Computer Architecture. Minneapolis, Minnesota, USA, 81--88.
[13]
Sheng Li, Ke Chen, Jay B. Brockman, and Norman P. Jouppi. 2011. Performance impacts of non-blocking caches in out-of-order processors. HPL Technical Reports (2011), 1--9. https://www.hpl.hp.com/techreports/2011/HPL-2011--65.html
[14]
Gabriel H. Loh. 2008. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of International Symposium on Computer Architecture. Beijing, China, 453--464.
[15]
Matthew Naylor, Paul James Fox, A. Theodore Markettos, and Simon W. Moore. 2013. Managing the FPGA memory wall: Custom computing or vector processing?. In Proceedings of International Conference on Field programmable Logic and Applications. Porto, Portugal, 1--6.
[16]
Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, and Nicholas J. Wright. 2020. The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing. In Proceedings of IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Atlanta, GA, USA, 8--19.
[17]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture. ACM, Davis, CA, USA, 457--467.
[18]
Christoph Scheurich and Michel Dubois. 1988. The design of a lockup-free cache for high-performance multiprocessors. In Proceedings of ACM/IEEE Conference on Supercomputing. Orlando, FL, USA, 352--359.
[19]
Greg Stitt, Gaurav Chaudhari, and James Coole. 2008. Traversal caches: a first step towards FPGA acceleration of pointer-based data structures. In Proceedings of International Conference on Hardware/Software Codesign and System Synthesis. Atlanta, GA, USA, 61--66.
[20]
James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable Cache Miss Handling for High Memory-Level Parallelism. In Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture. Orlando, FL, USA, 409--422.
[21]
Yeong-Jae Woo, Sheayun Lee, and Sang Lyul Min. 2018. FMMU: a hardware-accelerated flash map management unit for scalable performance of flash-based SSDs. In Proceedings of Annual Design Automation Conference. San Francisco, California, 115:1--115:6.
[22]
Xilinx. 2019. Alveo U280 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u280.html
[23]
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-Mei W. Hwu, and Deming Chen. 2018. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of International Conference on Computer-Aided Design. San Diego, CA, USA, 56--64. io

Index Terms

  1. MiCache: An MSHR-inclusive Non-blocking Cache Design for FPGAs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
    April 2024
    300 pages
    ISBN:9798400704185
    DOI:10.1145/3626202
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. field-programmable gate arrays
    2. hardware acceleration
    3. non-blocking cache

    Qualifiers

    • Research-article

    Conference

    FPGA '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 125 of 627 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 144
      Total Downloads
    • Downloads (Last 12 months)144
    • Downloads (Last 6 weeks)41
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media