Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2014.16acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Managing DRAM latency divergence in irregular GPGPU applications

Published: 16 November 2014 Publication History

Abstract

Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency divergence causes significant slowdown when running irregular GPGPU applications. To solve this issue, we propose memory scheduling mechanisms that avoid inter-warp interference in the DRAM system to reduce the average memory stall latency experienced by warps. We further reduce latency divergence through mechanisms that coordinate scheduling decisions across multiple independent memory channels. Finally we show that carefully orchestrating the memory scheduling policy can achieve low average latency for warps, without compromising bandwidth utilization. Our combined scheme yields a 10.1% performance improvement for irregular GPGPU workloads relative to a throughput-optimized GPU memory controller.

References

[1]
"NVIDIA Kepler GK110 Whitepaper," 2012, http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[2]
"Intel Architecture Instruction Set Extensions Programming Reference," 2013, http://download-software.intel.com/sites/default/files/319433-016.pdf.
[3]
T. M. Aamodt and W. L. Fung, "GPGPU-Sim 3.x Manual," http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual.
[4]
T. M. Aamodt and W. L. Fung, "GPGPU-Sim Accuracy," http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual#Accuracy.
[5]
"The Opportunities and Challenges of Exascale Computing," http://science.energy.gov/~/media/ascr/ascac/pdf/reports/exascale_subcommittee_report.pdf, 2010.
[6]
R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. Loh, and O. Mutlu, "Staged Memory Scheduling: Achieving High Performance and Scalability in Hetergenous Systems," in Proceedings of ISCA, 2012.
[7]
S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W. W. Hwu, "Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors," in Proceedings of PPoPP, 2012.
[8]
A. Bakhoda, G. Yuan, W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of ISPASS, 2009.
[9]
E. Blem, M. Sinclair, and K. Sankaralingam, "Challenge Benchmarks That Must be Conquered to Sustain the GPU Revolution," in Proceedings of EAMA-4, 2011.
[10]
M. Bojnordi and E. Ipek, "PARDIS: A Programmable Memory Controller for the DDRx Interfacing Standards," in Proceedings of ISCA, 2012.
[11]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of IISWC, 2012.
[12]
M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm," in GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2011.
[13]
N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, "USIMM: the Utah SImulated Memory Module," University of Utah, Tech. Rep., 2012, UUCS-12-002.
[14]
N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. Jouppi, "Staged Reads: Mitigating the Impact of DRAM Writes on DRAM Reads," in Proceedings of HPCA, 2012.
[15]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of IISWC, 2009.
[16]
S. Che, J. Sheaffer, and K. Skadron, "Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems," in Proceedings of SC, 2011.
[17]
J. Corbal, R. Espasa, and M. Valero, "Command Vector Memory Systems: High Performance at Low Cost," in Proceedings of PACT, 1998.
[18]
G. Dasika, A. Sethia, T. Mudge, and S. Mahlke, "PEPSC: A Power-Efficient Processor for Scientific Computing," in Proceedings of PACT, 2011.
[19]
"Green 500 List - Nov 2013," http://www.green500.org/lists/green201311, Nov 2013.
[20]
B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, "Mars: A MapReduce Framework on Graphics Processors," in Proceedings of PACT, 2008.
[21]
T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt, "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," in Proceedings of ISPASS, 2012.
[22]
S. Hong and H. Kim, "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness," in Proceedings of ISCA, 2009.
[23]
"Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0," http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf, Hynix, 2009.
[24]
M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, "A QoS-Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC," in Proceedings of DAC, 2012.
[25]
A. Jog, E. Bolotin, Z. Guz, M. Parker, S. W. Keckler, M. Kandemir, and C. R. Das, "Application-aware Memory System for Fair and Efficient Execution for Concurrent GPGPU Applications," in Proceedings of GPGPU-7, 2014.
[26]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. Das, "Orchestrated Scheduling and Prefetching for GPUs," in Proceedings of ISCA, 2013.
[27]
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. Das, "OWL: Cooperative Thread Array Scheduling Techniques for Improving GPGPU Performance," in Proceedings of ASPLOS, 2013.
[28]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," IEEE Micro, vol. 31, 2011.
[29]
Khronos Group, "OpenCL," http://www.khronos.org/opencl.
[30]
Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers," in Proceedings of HPCA, 2010.
[31]
Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," in Proceedings of MICRO, 2010.
[32]
N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin, "DRAM Scheduling Policy for a GPGPU Architecture Based on a Potential Function," in IEEE Computer Architecture Letters, Nov 2011.
[33]
B. Matthew, S. A. McKee, J. B. Carter, and A. Davis, "Design of a Parallel Vector Access Unit for SDRAM Memory Systems," in Proceedings of HPCA, 2000.
[34]
M. Mendez-Lojo, M. Burtscher, and K. Pingali, "A GPU Implementation of Inclusion-based Points-to Analysis," in Proceedings of PPoPP, 2012.
[35]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch amd Memory Divergence Tolerance," in Proceedings of ISCA, 2010.
[36]
D. G. Merrill, M. Garland, and A. S. Grimshaw, "Scalable GPU Graph Traversal," in Proceedings of PPoPP, 2012.
[37]
Calculating Memory System Power for DDR3 - Technical Note TN-41-01, Micron Technology Inc., 2007.
[38]
O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of MICRO, 2007.
[39]
NVIDIA Corporation, "NVIDIA Cuda C Programming Guide v4.2," http://developer.nvidia.com/nvidia-gpu-computing-documentation.
[40]
O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling - Enhancing Both Performance and Fairness of Shared DRAM Systems," in Proceedings of ISCA, 2008.
[41]
B. Pichai, L. Hsu, and A. Bhattacharjee, "Architectural Support for Address Translation on GPUs," in Proceedings of ASPLOS, 2014.
[42]
S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, "Memory Access Scheduling," in Proceedings of ISCA, 2000.
[43]
T. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious Wavefront Scheduling," in Proceedings of MICRO, 2012.
[44]
T. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of MICRO, 2013.
[45]
J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-tolerant GPU Applications," in IEEE Transactions on Multimedia, 2012.
[46]
D. Shah and D. Wischik, "Switched networks with maximum weight policies: Fluid Approximation and Multiplicative State Space Collapse," The Annals of Applied Probability, vol. 22, 2012.
[47]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D. Liu, and W.-. M. W. Hwu, "The Parboil Technical Report," University of Illinois, Tech. Rep., 2012.
[48]
J. Stuecheli, D. Kaseridis, D. Daly, H. Hunter, and L. John, "The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies," in Proceedings of ISCA, 2010.
[49]
D. Tarjan, J. Meng, and K. Skadron, "Increasing Memory Miss Tolerance for SIMD Cores," in Proceedings of SC, 2009.
[50]
Z. Wang, S. M. Khan, and D. A. Jimenez, "Improving Write-Back Efficiency with Decoupled Last-Write Prediction," in Proceedings of ISCA, 2012.
[51]
G. L. Yuan, A. Bakhoda, and T. M. Aamodt, "Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures," in Proceedings of MICRO, 2008.
[52]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly Elimination of Dynamic Irregularities for GPU Computing," in Proceedings of ASPLOS, 2012.
[53]
Z. Zhang, Z. Zhu, and X. Zhand, "A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality," in Proceedings of MICRO, 2000.

Cited By

View all
  • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
  • (2023)Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605645(546-555)Online publication date: 7-Aug-2023
  • (2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2014
1054 pages
ISBN:9781479955008
  • General Chair:
  • Trish Damkroger,
  • Program Chair:
  • Jack Dongarra

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Qualifiers

  • Research-article

Conference

SC '14
Sponsor:

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
  • (2023)Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605645(546-555)Online publication date: 7-Aug-2023
  • (2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
  • (2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
  • (2022)Software-defined address mapping: a case on 3D memoryProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507774(70-83)Online publication date: 28-Feb-2022
  • (2020)HSMProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378457(1371-1385)Online publication date: 9-Mar-2020
  • (2019)LinebackerProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322222(183-196)Online publication date: 22-Jun-2019
  • (2019)A Framework for Memory Oversubscription Management in Graphics Processing UnitsProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304044(49-63)Online publication date: 4-Apr-2019
  • (2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
  • (2018)In-DRAM near-data approximate acceleration for GPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243188(1-14)Online publication date: 1-Nov-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media