research-article

Open access

Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead

Authors:

Konstantinos Koukos,

Erik Hagersten, and

Stefanos KaxirasAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 1

Article No.: 1, Pages 1 - 22

https://doi.org/10.1145/2889488

Published: 28 March 2016 Publication History

Abstract

This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

References

[1]

Sarita V. Adve and Mark D. Hill. 1990. Weak ordering -- a new definition. In Proceedings of the 17th ACM/IEEE International Symposium on Computer Architecture (ISCA). 2--14.

[2]

AMD. 2013. APU ^TM. Retrieved from http://www.amd.com/en-us/innovations/software-technologies/apu.

[3]

Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. IEEE Comput. 29, 2 (Feb 1996), 18--28.

Digital Library

[4]

Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redefining the role of the CPU in the Era of CPU-GPU integration. IEEE Micro 32, 6 (Nov 2012), 4--16.

Digital Library

[5]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 163--174.

[6]

Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th ACM/IEEE International Symposium on Computer Architecture (ISCA). 282--293.

Digital Library

[7]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Arch. News 39, 2 (Aug. 2011), 1--7.

Digital Library

[8]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 International Symposium on Workload Characterization (IISWC). 44--54.

Digital Library

[9]

Blas Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA). 93--104.

Digital Library

[10]

Sandhya Dwarkadas, Nikolaos Hardavellas, Leonidas Kontothanassis, Rishiyur Nikhil, and Robert Stets. 1999. Cashmere-VLM: Remote memory paging for software distributed shared memory. In Proceedings of the 13th International Symposium on Parallel Processing (IPPS). 153--159.

[11]

Albert Esteve, Alberto Ros, Maria E. Gómez, Antonio Robles, and José Duato. 2015. Efficient tlb-based detection of private pages in chip multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS) (March 2015).

[12]

William Gropp, Ewing Lusk, and Anthony Skjellum. 1999. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Vol. 1. MIT Press, Cambridge, MA.

Digital Library

[13]

Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th ACM/IEEE International Symposium on Computer Architecture (ISCA). 184--195.

Digital Library

[14]

Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 189--200.

[15]

Blake A. Hechtman and Daniel J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 201--212.

[16]

Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free memory models. In Proceedings of the 19th International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS). 427--440.

[17]

Hynix. 2013. Hynix H5GQ1H24AFR -- 1Gb (32Mx32) GDDR5 SGRAM. (2013). http://www.hynix.com/.

[18]

Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (2010), 7--15.

Digital Library

[19]

Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning centralized coherence and distributed critical-section execution on their head: A new approach for scalable distributed shared memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 3--14.

Digital Library

[20]

Stefanos Kaxiras and Alberto Ros. 2013. A new perspective for efficient virtual-cache coherence. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 535--546.

Digital Library

[21]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 487--498.

Digital Library

[22]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 469--480.

[23]

Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, and Dan Ginsburg. 2011. OpenCL Programming Guide. Pearson Education.

[24]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.

[25]

Nvidia. 2015. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/#memory-fence-functions.

[26]

Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO). 457--467.

[27]

Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. Gem5-gpu: A heterogeneous CPU-GPU simulator. Comput. Arch. Lett. 14, 1 (Jan 2015), 34--36.

Digital Library

[28]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 568--578.

[29]

Alberto Ros, Mahdad Davari, and Stefanos Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 186--197.

[30]

Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 241--252.

Digital Library

[31]

Alberto Ros and Stefanos Kaxiras. 2015. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA). 427--438.

Digital Library

[32]

Andreas Sembrant, Erik Hagersten, and David Black-Shaffer. 2013. TLC: A tag-less cache for reducing dynamic first level cache energy. In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO). 49--61.

Digital Library

[33]

Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA). 157--168.

[34]

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 578--590.

[35]

John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).

[36]

Lukasz G. Szafaryn, Todd Gamblin, Bronis R. De Supinski, and Kevin Skadron. 2011. Experiences with achieving portability across heterogeneous architectures. Proceedings of WOLFHPC, in Conjunction with ICS, Tucson (2011).

[37]

Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P. Jouppi. 2008. CACTI 5.1. HP Laboratories 2 (Apr 2008).

Cited By

Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Pati SAga SJayasena NSinclair M(2022)Demystifying BERT: System Design Implications2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00033(296-309)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00033
Oswald NNagarajan VSorin DGavrielatos VOlausson TCarr R(2022)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00061(756-771)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00061
Show More Cited By

Index Terms

Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Euro-Par 2009

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical ...
Read More
Complexity-effective multicore coherence
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Much of the complexity and overhead (directory, state bits, invalidations) of a typical directory coherence implementation stems from the effort to make it "invisible" even to the strongest memory consistency model. In this paper, we show that a much ...
Read More
A unified model for multicore architectures
IFMT '08: Proceedings of the 1st international forum on Next-generation multicore/manycore technologies

With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1

April 2016

347 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2899032

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2016

Accepted: 01 November 2015

Revised: 01 October 2015

Received: 01 May 2015

Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Swedish Research Council UPMARC Linnaeus Centre
European Commission FEDER funds
“Fundación Seneca-Agencia de Ciencia y Tecnología de la Región de Murcia”
the Spanish MINECO
EU Project LPGPU
the project “Jóvenes Líderes en Investigación”

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
788
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)15

Other Metrics

View Author Metrics

Citations

Cited By

Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Pati SAga SJayasena NSinclair M(2022)Demystifying BERT: System Design Implications2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00033(296-309)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00033
Oswald NNagarajan VSorin DGavrielatos VOlausson TCarr R(2022)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00061(756-771)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00061
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Huzaifa MAlsop JMahmoud ASalvador GSinclair MAdve S(2020)Inter-kernel Reuse-aware Thread Block SchedulingACM Transactions on Architecture and Code Optimization10.1145/340653817:3(1-27)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3406538
Chou YNg CCattell SIntan JSinclair MDevietti JRogers TAamodt T(2020)Deterministic Atomic Buffering2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00083(981-995)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00083
Ren XLustig DBolotin EJaleel AVilla ONellans D(2020)HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00054(582-595)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00054
Li LChapman BTaufer MBalaji PPeña A(2019)Compiler assisted hybrid implicit and explicit GPU memory management under unified address spaceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356141(1-16)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356141
Yoon HLowe-Power JSohi G(2018)Filtering Translation Bandwidth with Virtual CachingACM SIGPLAN Notices10.1145/3296957.317319553:2(113-127)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173195
Yoon HLowe-Power JSohi GShen XTuck JBianchini RSarkar V(2018)Filtering Translation Bandwidth with Virtual CachingProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173195(113-127)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173195
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents