research-article

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System

Authors:

Jason D. BakosAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 8, Issue 4

Article No.: 24, Pages 1 - 24

https://doi.org/10.1145/2800788

Published: 11 September 2015 Publication History

Abstract

Massively parallel memory systems are designed to deliver high bandwidth at relatively low clock speed for memory-intensive applications implemented on programmable logic. For example, the Convey HC-1 provides 1,024 DRAM banks to each of four FPGAs through a full crossbar, presenting a peak bandwidth of 76.8GB/s to the user logic. Such highly parallel memory systems suffer from high latency, and their effective bandwidth is highly sensitive to access ordering. To achieve high performance, the user must use a customized memory interface that combines scheduling, latency hiding, and data reuse. In this article, we describe the design of a custom memory interface for 3D stencil kernels on the Convey HC-1 that incorporates these features. Experimental results show that the proposed memory interface achieves a speedup in runtime of 2.2 for 6-point stencil and 9.5 for 27-point stencil when compared to a naive memory interface.

References

[1]

J. H. Ahn, N. P. Jouppi, C. J. Kozyrakis Leverich, and R. S. Schreiber. 2009. Future scaling of processor-memory interfaces. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). Article No. 42.

Digital Library

[2]

W. Augustin, J. Weiss, and V. Heuveline. 2011. Convey HC-1 Hybrid Core Computer-The Potential of FPGAs in numerical simulation. In Proceedings of the Second International Workshop on New Frontiers in High-Performance and Hardware-Aware Computing (HipHaC'11). San Antonio, Texas, USA.

[3]

R. Banakar, S. Steinke, and B. Lee. 2002. Scratchpad memory design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). 73--78.

Digital Library

[4]

N. Baradaran and P. C. Diniz. 2008. A compiler approach to managing storage and memory bandwidth in configurable architectures. ACM Transactions on Design Automation of Electronic Systems 13, 4, Article No. 61.

Digital Library

[5]

Y. Ben-Asher and N. Rotem. 2010. Automatic memory partitioning: Increasing memory parallelism via data structure partitioning. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 155--162.

Digital Library

[6]

L. Benini, L. Macchiarulo, A. Macii, and M. Poncino. 2002. Layout-driven memory synthesis for embedded systems-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 10, 2, 96--105.

Digital Library

[7]

H. K. Chang and Y. L. Lin. 2000. Array allocation taking into account SDRAM characteristics. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’00). 497--502.

Digital Library

[8]

J. Cong, H. Huang, C. Liu, and Y. Zou. 2011a. A reuse-aware prefetching scheme for scratchpad memory. In Proceedings of the 48th Design Automation Conference (DAC’11). 960--965.

Digital Library

[9]

J. Cong, M. Huang, and Y. Zou. 2011b. 3D recursive Gaussian IIR on GPU and FPGAs: A case study for accelerating bandwidth-bounded applications. In Proceedings of the 9th IEEE Symposium on Application Specific Processors. 201.

Digital Library

[10]

J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011c. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2, Article No. 15.

Digital Library

[11]

J. Cong, P. Zhang, and Y. Zou. 2011d. Combined loop transformation and hierarchy allocation in data reuse optimization. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD’11). 185--192.

Digital Library

[12]

Convey Corporation. 2012. Convey Personality Development Kit Reference Manual. Retrieved August 24, 2015, from http://www.conveysupport.com/alldocs/ConveyPDKReferenceManual.pdf.

[13]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, Los Alamitos, CA, 1--12.

Digital Library

[14]

Z. Fang, X. H. Sun, Y. Chen, and S. Byna. 2009. Core-aware memory access scheduling schemes. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’09). 1--12.

Digital Library

[15]

C. He, M. Lu, and C. Sun. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). IEEE, Los Alamitos, CA, 207--216.

Digital Library

[16]

C. He, G. Qin, M. Lu, and W. Zhao. 2006. An efficient implementation of high-accuracy finite difference computing engine on FPGAs. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’06). IEEE, Los Alamitos, CA, 95--98.

Digital Library

[17]

C. He, W. Zhao, and M. Lu. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, Los Alamitos, CA, 127--136.

Digital Library

[18]

W. K. C. Ho and S. J. E. Wilton. 2004. Logical-to-physical memory mapping for FPGAs with dual-port embedded arrays. In Field Programmable Logic and Applications. Lecture Notes in Computer Science, Vol. 1673. Springer, 111--123.

Digital Library

[19]

I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems 12, 2, Article No. 15.

Digital Library

[20]

Z. Jin and J. D. Bakos. 2013. Memory access scheduling on the Convey HC-1. In Proceedings of the 21st IEEE International Symposium on Field-Programmable Custom Computing Machines. 237.

Digital Library

[21]

M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2004. A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 23, 2, 243--260.

Digital Library

[22]

S. Liu, S. O. Memik, Y. Zhang, and G. Memik. 2008. A power and temperature aware DRAM architecture. In Proceedings of the Design Automation Conference (DAC’08).

Digital Library

[23]

C. G. Lyuh and T. Kim. 2004. Memory access scheduling and binding considering energy minimization in multi-bank memory systems. In Proceedings of the Design Automation Conference (DAC’04).

Digital Library

[24]

P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC’97). 7.

Digital Library

[25]

S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. 2008. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). 128--138.

Digital Library

[26]

Y. Tatsumi and H. Mattausch. 1999. Fast quadratic increase of multiport-storage-cell area with port number. Electronics Letters 35, 25, 2185--2187.

[27]

Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article No. 12.

Digital Library

[28]

Y. Wang, P. Zhang, X. Cheng, and J. Cong. 2012. An integrated and automated memory optimization flow for FPGA behavioral synthesis.” In Proceedings of the 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC’12). 257--262.

Cited By

Hu XLi MTian JWang Z(2022)Efficient Homomorphic Convolution Designs on FPGA for Secure InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.319789530:11(1691-1704)Online publication date: Nov-2022
https://doi.org/10.1109/TVLSI.2022.3197895
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813

Index Terms

Memory Interface Design for 3D Stencil Kernels on a Massively Parallel Memory System
1. Networks
  1. Network protocols

Recommendations

Direct distributed memory access for CMPs

On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory ...
Massively parallel GPU memory compaction
ISMM 2019: Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management

Memory fragmentation is a widely studied problem of dynamic memory allocators. It is well known that fragmentation can lead to premature out-of-memory errors and poor cache performance.

With the recent emergence of dynamic memory allocators for SIMD ...
Demand look-ahead memory access scheduling for 3D graphics processing units

With the rapid growing complexity of 3D applications, the memory subsystem has become the most bandwidth-exhausting bottleneck in a Graphics Processing Unit (GPU). To produce realistic images, tens to hundreds of thousands of primitives are used. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 8, Issue 4

October 2015

134 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2822909

Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2015

Accepted: 01 June 2015

Revised: 01 May 2015

Received: 01 November 2014

Published in TRETS Volume 8, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
147
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hu XLi MTian JWang Z(2022)Efficient Homomorphic Convolution Designs on FPGA for Secure InferenceIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.319789530:11(1691-1704)Online publication date: Nov-2022
https://doi.org/10.1109/TVLSI.2022.3197895
Koraei MFatemi OJahre M(2019)DCMIACM Transactions on Architecture and Code Optimization10.1145/335281316:4(1-24)Online publication date: 11-Oct-2019
https://dl.acm.org/doi/10.1145/3352813

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents