Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Filtering Translation Bandwidth with Virtual Caching

Published: 19 March 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

    References

    [1]
    {n. d.}. AMD and HSA. ({n. d.}). Retrieved Accessed: 2017-12-09 from http://www.amd.com/en-us/innovations/software-technologies/hsa
    [2]
    {n. d.}. The ARM CoreLink CCI-550 Cache Coherent Interconnect. ({n. d.}). Retrieved Accessed: 2017-12-09 from https://developer.arm.com/products/system-ip/corelink-interconnect/corelink-cache-coherent-interconnect-family/corelink-cci-550
    [3]
    Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth Address Translation for Multiple-issue Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96). ACM, New York, NY, USA, 158-167.
    [4]
    Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 237-248.
    [5]
    Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012. Reducing Memory Reference Energy with Opportunistic Virtual Caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 297-308. http://dl.acm.org/citation.cfm?id=2337159.2337194
    [6]
    Benjie Batanes. 2016. PS4 Pro Specs: How Does It Fare Against Xbox Project Scorpio? Which One Is Better? (November 2016). Retrieved Accessed: 2017-12-09 from http://www.itechpost.com/articles/50922/20161107/ps4-pro-specs-fare-against-xbox-project-scorpio-one-better.htm
    [7]
    A. Bhattacharjee. 2017. Preserving Virtual Memory by Mitigating the Address Translation Wall. IEEE Micro 37, 5 (September 2017), 6-10.
    [8]
    Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA '11). IEEE Computer Society, Washington, DC, USA, 62-63. http://dl.acm.org/citation.cfm?id=2014698.2014896
    [9]
    Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska. 1994. Sharing and Protection in a Single-address-space Operating System. ACM Trans. Comput. Syst. 12, 4 (Nov. 1994), 271-307.
    [10]
    Jeffrey S. Chase, Henry M. Levy, Edward D. Lazowska, and Miche Baker-Harvey. 1992. Lightweight Shared Objects in a 64-bit Operating System. In Conference Proceedings on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '92). ACM, New York, NY, USA, 397-413.
    [11]
    Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 185-195.
    [12]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44-54.
    [13]
    Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343-355.
    [14]
    Ian Cutress. 2017. Hot Chips: Microsoft Xbox One X Scorpio Engine Live Blog. (August 2017). Retrieved Accessed: 2017-12-13 from https://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utc
    [15]
    Koen De Bosschere, Albert Cohen, Jonas Maebe, and Harm Munk. 2015. HiPEAC Vision. (2015).
    [16]
    James R. Goodman. 1987. Coherency for Multiprocessor Virtual Address Caches. SIGPLAN Not. 22, 10 (Oct. 1987), 72-81.
    [17]
    Mark Hill, Susan Eggers, Jim Larus, George Taylor, Glenn Adams, B. K. Bose, Garth Gibson, Paul Hansen, Jon Keller, Shing Kong, Corinna Lee, Daebum Lee, Joan Pendleton, Scott Ritchie, David A. Wood, Ben Zorn, Paul Hilfinger, Dave Hodges, Randy Katz, John Ousterhout, and Dave Patterson. 1986. Design Decisions in SPUR. Computer 19, 11 (Nov. 1986), 8-22.
    [18]
    Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 427-440.
    [19]
    Bruce Jacob. 2009. The Memory System: You Can'T Avoid It, You Can'T Ignore It, You Can'T Fake It. Morgan and Claypool Publishers.
    [20]
    Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big Data Causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU. In Proceedings of the 13th International Workshop on Data Management on New Hardware (DAMON '17). ACM, New York, NY, USA, Article 6, 10 pages.
    [21]
    Stefanos Kaxiras and Alberto Ros. 2013. A New Perspective for Efficient Virtual-cache Coherence. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546.
    [22]
    Andy Kegel, Paul Blinzer, Arka Basu, and Maggie Chan. 2016. Virtualizing IO through IO Memory Management Unit. (2016). Retrieved Accessed: 2017-12-09 from http://pages.cs.wisc.edu/~basu/iscaiommututorial/IOMMUTUTORIALASPLOS2016.pdf
    [23]
    Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '12). ACM, New York, NY, USA, 70-71.
    [24]
    Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 201-216. http://dl.acm.org/citation.cfm?id=2685048.2685065
    [25]
    Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. 1992. Architecture Support for Single Address Space Operating Systems. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, USA, 175-186.
    [26]
    Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages.
    [27]
    Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 733-745.
    [28]
    George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Developer Summit (2012).
    [29]
    Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 72-83. http://dl.acm.org/citation.cfm?id=2337159.2337168
    [30]
    Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104.
    [31]
    Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA '16). IEEE Computer Society, Washington, DC, USA.
    [32]
    Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269.
    [33]
    Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 743-758.
    [34]
    Jason Power. 2017. Inferring Kaveri's Shared Virtual Memory Implementation. (July 2017). Retrieved Accessed: 2017-12-09 from http://www.lowepower.com/jason/inferring-kaveris-shared-virtual-memory-implementation.html
    [35]
    Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, 457-467.
    [36]
    Jason Power, Joel Hestness, Marc Orr, Mark Hill, and David Wood. 2014. gem5-gpu: A Heterogeneous CPU-GPU Simulator. Computer Architecture Letters 13, 1 (Jan 2014).
    [37]
    Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, 568-578.
    [38]
    Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, and David A. Wood. 2015. Toward GPUs Being Mainstream in Analytic Processing: An Initial Argument Using Simple Scan-aggregate Queries. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN'15). ACM, New York, NY, USA, Article 11, 8 pages.
    [39]
    Kiran Puttaswamy and Gabriel H. Loh. 2006. Thermal Analysis of a 3D Die-stacked High-performance Microprocessor. In Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06). ACM, New York, NY, USA, 19-24.
    [40]
    Xiaogang Qiu and Michel Dubois. 2001. Towards virtually-addressed memory hierarchies. In Proceedings of the 2001 IEEE 7th International Symposium on High Performance Computer Architecture (HPCA '01). 51-62.
    [41]
    Xiaogang Qiu and Michel Dubois. 2008. The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches. IEEE Trans. Comput. 57, 12 (Dec. 2008), 1585-1599.
    [42]
    Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. 1997. On High-bandwidth Data Cache Design for Multi-issue Processors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-30). IEEE Computer Society, Washington, DC, USA, 46-56. http://dl.acm.org/citation.cfm?id=266800.266805
    [43]
    Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 485-498.
    [44]
    Abhayendra Singh, Shaizeen Aga, and Satish Narayanasamy. 2015. Efficiently Enforcing Strong Memory Ordering in GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 699-712.
    [45]
    Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. 2013. Cache Coherence for GPU Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '13). IEEE Computer Society, Washington, DC, USA, 578-590.
    [46]
    Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges (MICRO 2011 Keynote talk).
    [47]
    J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161-171.
    [48]
    W. H. Wang, J.-L. Baer, and H. M. Levy. 1989. Organization and Performance of a Two-level Virtual-real Cache Hierarchy. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA '89). ACM, New York, NY, USA, 140-148.
    [49]
    Neil H. E. Weste and Kamran Eshraghian. 1985. Principles of CMOS VLSI Design: A Systems Perspective. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
    [50]
    H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 235-246.
    [51]
    D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton. 1986. An In-cache Address Translation Mechanism. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA '86). IEEE Computer Society Press, Los Alamitos, CA, USA, 358-365. http://dl.acm.org/citation.cfm?id=17407.17398
    [52]
    H. Yoon and G. S. Sohi. 2016. Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. In Proceedings of the 2016 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '16). 212-224.
    [53]
    Lixin Zhang, Evan Speight, Ram Rajamony, and Jiang Lin. 2010. Enigma: Architectural and Operating System Support for Reducing the Impact of Address Translation. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 159-168.

    Cited By

    View all
    • (2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
    • (2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
    • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 2
    ASPLOS '18
    February 2018
    809 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3296957
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2018
      827 pages
      ISBN:9781450349116
      DOI:10.1145/3173162
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 March 2018
    Published in SIGPLAN Volume 53, Issue 2

    Check for updates

    Author Tags

    1. TLB
    2. address translation
    3. heterogeneous computing
    4. virtual caching
    5. virtual memory

    Qualifiers

    • Research-article

    Funding Sources

    • University of Wisconsin Foundation (John P. Morgridge Professor)
    • National Science Foundation
    • William F. Vilas Trust Estate (Vilas Research Professor)
    • Wisconsin Alumni Research Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)164
    • Downloads (Last 6 weeks)38
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
    • (2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
    • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
    • (2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
    • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
    • (2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
    • (2022)BARM: A Batch-Aware Resource Manager for Boosting Multiple Neural Networks Inference on GPUs With Memory OversubscriptionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319980633:12(4612-4624)Online publication date: 1-Dec-2022
    • (2022)Designing Virtual Memory System of MCM GPUsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00036(404-422)Online publication date: 1-Oct-2022
    • (2021)Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip ResourcesMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480105(1169-1181)Online publication date: 18-Oct-2021
    • (2021)Rebooting Virtual Memory with Midgard2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00047(512-525)Online publication date: Jun-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media