Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Published: 11 September 2015 Publication History

Abstract

Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs through a full crossbar, presenting a peak bandwidth of 76.8GB/s to the user logic. Such highly parallel memory systems suffer from high latency, and their effective bandwidth is highly sensitive to access ordering. To achieve high performance, the user must use a customized memory interface that combines scheduling, latency hiding, and data reuse. In this article, we describe the design of a custom memory interface for 3D stencil kernels on the Convey HC-1 that incorporates these features. Experimental results show that the proposed memory interface achieves a speedup in runtime of 2.2 for 6-point stencil and 9.5 for 27-point stencil when compared to a naive memory interface.

References

[1]
J. H. Ahn, N. P. Jouppi, C. J. Kozyrakis Leverich, and R. S. Schreiber. 2009. Future scaling of processor-memory interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). Article No. 42.
[2]
W. Augustin, J. Weiss, and V. Heuveline. 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation. In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11). San Antonio, Texas, USA.
[3]
R. Banakar, S. Steinke, and B. Lee. 2002. Scratchpad memory design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). 73--78.
[4]
N. Baradaran and P. C. Diniz. 2008. A compiler approach to managing storage and memory bandwidth in configurable architectures. ACM Transactions on Design Automation of Electronic Systems 13, 4, Article No. 61.
[5]
Y. Ben-Asher and N. Rotem. 2010. Automatic memory partitioning: Increasing memory parallelism via data structure partitioning. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 155--162.
[6]
L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. 2002. Layout-driven memory synthesis for embedded systems-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 10, 2, 96--105.
[7]
H. K. Chang and Y. L. Lin. 2000. Array allocation taking into account SDRAM characteristics. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’00). 497--502.
[8]
J. Cong, H. Huang, C. Liu, and Y. Zou. 2011a. A reuse-aware prefetching scheme for scratchpad memory. In Proceedings of the 48th Design Automation Conference (DAC’11). 960--965.
[9]
J. Cong, M. Huang, and Y. Zou. 2011b. 3D recursive Gaussian IIR on GPU and FPGAs: A case study for accelerating bandwidth-bounded applications. In Proceedings of the 9th IEEE Symposium on Application Specific Processors. 201.
[10]
J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011c. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15.
[11]
J. Cong, P. Zhang, and Y. Zou. 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD’11). 185--192.
[12]
Convey Corporation. 2012. Convey Personality Development Kit Reference Manual. Retrieved August 24, 2015, from http://www.conveysupport.com/alldocs/ConveyPDKReferenceManual.pdf.
[13]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--12.
[14]
Z. Fang, X. H. Sun, Y. Chen, and S. Byna. 2009. Core-aware memory access scheduling schemes. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’09). 1--12.
[15]
C. He, M. Lu, and C. Sun. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). IEEE, Los Alamitos, CA, 207--216.
[16]
C. He, G. Qin, M. Lu, and W. Zhao. 2006. An efficient implementation of high-accuracy finite difference computing engine on FPGAs. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’06). IEEE, Los Alamitos, CA, 95--98.
[17]
C. He, W. Zhao, and M. Lu. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 127--136.
[18]
W. K. C. Ho and S. J. E. Wilton. 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays. In Field Programmable Logic and Applications. Lecture Notes in Computer Science, Vol. 1673. Springer, 111--123.
[19]
I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems 12, 2, Article No. 15.
[20]
Z. Jin and J. D. Bakos. 2013. Memory access scheduling on the Convey HC-1. In Proceedings of the 21st IEEE International Symposium on Field-Programmable Custom Computing Machines. 237.
[21]
M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2004. A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23, 2, 243--260.
[22]
S. Liu, S. O. Memik, Y. Zhang, and G. Memik. 2008. A power and temperature aware DRAM architecture. In Proceedings of the Design Automation Conference (DAC’08).
[23]
C. G. Lyuh and T. Kim. 2004. Memory access scheduling and binding considering energy minimization in multi-bank memory systems. In Proceedings of the Design Automation Conference (DAC’04).
[24]
P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC’97). 7.
[25]
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. 2008. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). 128--138.
[26]
Y. Tatsumi and H. Mattausch. 1999. Fast quadratic increase of multiport-storage-cell area with port number. Electronics Letters 35, 25, 2185--2187.
[27]
Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article No. 12.
[28]
Y. Wang, P. Zhang, X. Cheng, and J. Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis.” In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). 257--262.

Cited By

View all
  • (2022)Efficient Homomorphic Convolution Designs on FPGA for Secure InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.319789530:11(1691-1704)Online publication date: Nov-2022
  • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019

Index Terms

  1. Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Reconfigurable Technology and Systems
    ACM Transactions on Reconfigurable Technology and Systems  Volume 8, Issue 4
    October 2015
    134 pages
    ISSN:1936-7406
    EISSN:1936-7414
    DOI:10.1145/2822909
    • Editor:
    • Steve Wilton
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 September 2015
    Accepted: 01 June 2015
    Revised: 01 May 2015
    Received: 01 November 2014
    Published in TRETS Volume 8, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D stencil
    2. Memory latency hiding
    3. data reuse
    4. memory access scheduling
    5. memory interface

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Efficient Homomorphic Convolution Designs on FPGA for Secure InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.319789530:11(1691-1704)Online publication date: Nov-2022
    • (2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media