research-article

Managing DRAM latency divergence in irregular GPGPU applications

Authors:

Niladrish Chatterjee,

Gabriel H. Loh,

Nuwan Jayasena,

Rajeev BalasubramonianAuthors Info & Claims

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 128 - 139

https://doi.org/10.1109/SC.2014.16

Published: 16 November 2014 Publication History

Abstract

Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.

References

[1]

"NVIDIA Kepler GK110 Whitepaper," 2012, http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.

[2]

"Intel Architecture Instruction Set Extensions Programming Reference," 2013, http://download-software.intel.com/sites/default/files/319433-016.pdf.

[3]

T. M. Aamodt and W. L. Fung, "GPGPU-Sim 3.x Manual," http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual.

[4]

T. M. Aamodt and W. L. Fung, "GPGPU-Sim Accuracy," http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual#Accuracy.

[5]

"The Opportunities and Challenges of Exascale Computing," http://science.energy.gov/~/media/ascr/ascac/pdf/reports/exascale_subcommittee_report.pdf, 2010.

[6]

R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. Loh, and O. Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Hetergenous Systems," in Proceedings of ISCA, 2012.

Digital Library

[7]

S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W. W. Hwu, "Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors," in Proceedings of PPoPP, 2012.

Digital Library

[8]

A. Bakhoda, G. Yuan, W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of ISPASS, 2009.

[9]

E. Blem, M. Sinclair, and K. Sankaralingam, "Challenge Benchmarks That Must be Conquered to Sustain the GPU Revolution," in Proceedings of EAMA-4, 2011.

[10]

M. Bojnordi and E. Ipek, "PARDIS: A Programmable Memory Controller for the DDRx Interfacing Standards," in Proceedings of ISCA, 2012.

Digital Library

[11]

M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of IISWC, 2012.

Digital Library

[12]

M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm," in GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2011.

[13]

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, "USIMM: the Utah SImulated Memory Module," University of Utah, Tech. Rep., 2012, UUCS-12-002.

[14]

N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. Jouppi, "Staged Reads: Mitigating the Impact of DRAM Writes on DRAM Reads," in Proceedings of HPCA, 2012.

Digital Library

[15]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of IISWC, 2009.

Digital Library

[16]

S. Che, J. Sheaffer, and K. Skadron, "Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems," in Proceedings of SC, 2011.

Digital Library

[17]

J. Corbal, R. Espasa, and M. Valero, "Command Vector Memory Systems: High Performance at Low Cost," in Proceedings of PACT, 1998.

Digital Library

[18]

G. Dasika, A. Sethia, T. Mudge, and S. Mahlke, "PEPSC: A Power-Efficient Processor for Scientific Computing," in Proceedings of PACT, 2011.

Digital Library

[19]

"Green 500 List - Nov 2013," http://www.green500.org/lists/green201311, Nov 2013.

[20]

B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, "Mars: A MapReduce Framework on Graphics Processors," in Proceedings of PACT, 2008.

Digital Library

[21]

T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt, "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," in Proceedings of ISPASS, 2012.

Digital Library

[22]

S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness," in Proceedings of ISCA, 2009.

Digital Library

[23]

"Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0," http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf, Hynix, 2009.

[24]

M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, "A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC," in Proceedings of DAC, 2012.

Digital Library

[25]

A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. Kandemir, and C. R. Das, "Application-aware Memory System for Fair and Efficient Execution for Concurrent GPGPU Applications," in Proceedings of GPGPU-7, 2014.

[26]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. Das, "Orchestrated Scheduling and Prefetching for GPUs," in Proceedings of ISCA, 2013.

Digital Library

[27]

A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. Das, "OWL: Cooperative Thread Array Scheduling Techniques for Improving GPGPU Performance," in Proceedings of ASPLOS, 2013.

Digital Library

[28]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, 2011.

Digital Library

[29]

Khronos Group, "OpenCL," http://www.khronos.org/opencl.

[30]

Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers," in Proceedings of HPCA, 2010.

[31]

Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," in Proceedings of MICRO, 2010.

Digital Library

[32]

N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin, "DRAM Scheduling Policy for a GPGPU Architecture Based on a Potential Function," in IEEE Computer Architecture Letters, Nov 2011.

Digital Library

[33]

B. Matthew, S. A. McKee, J. B. Carter, and A. Davis, "Design of a Parallel Vector Access Unit for SDRAM Memory Systems," in Proceedings of HPCA, 2000.

[34]

M. Mendez-Lojo, M. Burtscher, and K. Pingali, "A GPU Implementation of Inclusion-based Points-to Analysis," in Proceedings of PPoPP, 2012.

Digital Library

[35]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch amd Memory Divergence Tolerance," in Proceedings of ISCA, 2010.

Digital Library

[36]

D. G. Merrill, M. Garland, and A. S. Grimshaw, "Scalable GPU Graph Traversal," in Proceedings of PPoPP, 2012.

Digital Library

[37]

Calculating Memory System Power for DDR3 - Technical Note TN-41-01, Micron Technology Inc., 2007.

[38]

O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of MICRO, 2007.

Digital Library

[39]

NVIDIA Corporation, "NVIDIA Cuda C Programming Guide v4.2," http://developer.nvidia.com/nvidia-gpu-computing-documentation.

[40]

O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling - Enhancing Both Performance and Fairness of Shared DRAM Systems," in Proceedings of ISCA, 2008.

Digital Library

[41]

B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural Support for Address Translation on GPUs," in Proceedings of ASPLOS, 2014.

Digital Library

[42]

S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, "Memory Access Scheduling," in Proceedings of ISCA, 2000.

Digital Library

[43]

T. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious Wavefront Scheduling," in Proceedings of MICRO, 2012.

Digital Library

[44]

T. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of MICRO, 2013.

Digital Library

[45]

J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-tolerant GPU Applications," in IEEE Transactions on Multimedia, 2012.

[46]

D. Shah and D. Wischik, "Switched networks with maximum weight policies: Fluid Approximation and Multiplicative State Space Collapse," The Annals of Applied Probability, vol. 22, 2012.

[47]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W.-. M. W. Hwu, "The Parboil Technical Report," University of Illinois, Tech. Rep., 2012.

[48]

J. Stuecheli, D. Kaseridis, D. Daly, H. Hunter, and L. John, "The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies," in Proceedings of ISCA, 2010.

Digital Library

[49]

D. Tarjan, J. Meng, and K. Skadron, "Increasing Memory Miss Tolerance for SIMD Cores," in Proceedings of SC, 2009.

Digital Library

[50]

Z. Wang, S. M. Khan, and D. A. Jimenez, "Improving Write-Back Efficiency with Decoupled Last-Write Prediction," in Proceedings of ISCA, 2012.

Digital Library

[51]

G. L. Yuan, A. Bakhoda, and T. M. Aamodt, "Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures," in Proceedings of MICRO, 2008.

Digital Library

[52]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly Elimination of Dynamic Irregularities for GPU Computing," in Proceedings of ASPLOS, 2012.

Digital Library

[53]

Z. Zhang, Z. Zhu, and X. Zhand, "A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality," in Proceedings of MICRO, 2000.

Digital Library

Cited By

Mostofi SFalahati HMahani NLotfi-Kamran PSarbazi-Azad H(2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623782
Jeong JYoon MOh YKoo G(2023)Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605645(546-555)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605645
Zhao HZhang LZhang FThapliyal HDeMara RPartin-Vaisband IKatkoori S(2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3583781.3590305
Show More Cited By

Index Terms

Managing DRAM latency divergence in irregular GPGPU applications

Recommendations

GPGPU: general-purpose computation on graphics hardware
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

The graphics processor (GPU) on today's commodity video cards has evolved into an extremely powerful and flexible processor. Modern graphics architectures provide tremendous memory bandwidth and computational horsepower, with dozens of fully ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2014

1054 pages

ISBN:9781479955008

General Chair:
Trish Damkroger
Lawrence Livermore National Laboratory, Livermore, California
,
Program Chair:
Jack Dongarra
University of Tennessee, Knoxville, Tennessee

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Qualifiers

Research-article

Conference

SC '14

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '14: International Conference for High Performance Computing, Networking, Storage and Analysis

November 16 - 21, 2014

Louisana, New Orleans

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mostofi SFalahati HMahani NLotfi-Kamran PSarbazi-Azad H(2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623782
Jeong JYoon MOh YKoo G(2023)Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605645(546-555)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605645
Zhao HZhang LZhang FThapliyal HDeMara RPartin-Vaisband IKatkoori S(2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3583781.3590305
Belayneh LYe HChen KBlaauw DMudge TDreslinski RTalati NKloeckner AMoreira J(2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569649
Zhang JSwift MLi JFalsafi BFerdman MLu SWenisch T(2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507774
Zhao XJahre MEeckhout LLarus JCeze LStrauss K(2020)HSMProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378457(1371-1385)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378457
Oh YKoo GAnnavaram MRo WManne SHunter HAltman E(2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322222
Li CAusavarungnirun RRossbach CZhang YMutlu OGuo YYang JBahar IHerlihy MWitchel ELebeck A(2019)A Framework for Memory Oversubscription Management in Graphics Processing UnitsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304044(49-63)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304044
Ausavarungnirun RMiller VLandgraf JGhose SGandhi JJog ARossbach CMutlu O(2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173169
Yazdanbakhsh ASong CSacks JLotfi-Kamran PEsmaeilzadeh HKim NEvripidou SStenström PO'Boyle M(2018)In-DRAM near-data approximate acceleration for GPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243188(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243188
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents